In my previous post, we looked at the first “D” of the 3D approach of identifying and extracting value out of your transaction data: Determination.
If you recall, I proposed a 3 step approach (the 3D approach) to realising value from a variety of large data sets:
- Determination –scour data sources to establish if and where there might be value
- Development – create models that will be developed for the decision areas where value was identified with data that is predictive of this predetermined outcome
- Deployment – implement and run the developed market
In this post, we will continue looking at the 5 steps of the Determination phase:
- Incorporating the data
- Aggregating the data
- Identifying the target areas of value
- Scouring the data for value
- Reviewing and planning mini-projects
Let’s now look at steps 4 and 5.
4. Scouring the data for value
The process of data scouring or prospecting involves looking at the observation – such as the aggregated data – and assessing whether this data might predict future target areas of value. Essentially, is there a correlation between an historic piece/group of data and a future outcome?
To determine this, analysts will typically assess a regression relationship between the observation characteristic and the outcome (target) variable. Initially the analysis will be univariate (single variable) analysis, and later multi-variable analysis can take place.
Univariate analysis involves taking an observation characteristic, creating attribute groups and then calculating the information value. The equation to calculate this is displayed below:
The resultant information value calculated per field and given target variable will indicate whether a field has value or not. An indicative table of strength values is displayed below.
Range | Strength | |
0 | 0.02 | Non-predictive |
0.02 | 0.1 | Weak |
0.1 | 0.3 | Medium |
0.3 | 0.5 | Strong |
0.5 | 100 | Very Strong |
An example of the calculation is given. Here an online retailer wants to understand whether customers who bought books in January 2015 – March 2015 were likely to buy music in April 2015 – June 2015. Therefore, the rows represent the observation data and the columns represent the outcome, or target, data.
Does 3 months of book purchases predict a music purchase 3m later? | Music purchases (Apr 2015-Jun 2015) |
|||
Customers with no purchases | Customers with no music purchases | ALL Customers | ||
Book purchases (Jan 2015-Mar 2015) | Customers with no book purchases | 9,000 | 1,000 | 10,000 |
Customers with book purchases | 1,800 | 500 | 2,000 | |
ALL Customers | 10,500 | 1,500 | 12,000 |
This is then converted to column percentages:
Does 3 months of book purchases predict a music purchase 3m later? | Music purchases (Apr 2015-Jun 2015) |
|||
Customers with no purchases | Customers with music purchases | |||
Book purchases (Jan 2015-Mar 2015) | No purchases | 86% | 67% | |
Book purchases | 14% | 33% | ||
ALL Customers | 100% | 100% |
The calculations are then run:
Item | Calculation | i=1 | i=2 | |
A | %B/%G | 129% | 43% | |
B | ln(%B/%G) | 0.251314 | -0.8473 | |
C | (%B-%G) | 19% | -19% | |
D | B*C | 0.047869 | 0.16139 | |
E | Sum of D1 and D2 | 0.209259 |
Therefore book purchases in the first quarter of 2015 had a medium correlation to music purchases in the second quarter of 2015.
This calculation is run for all observation variables on of the target variables. The resultant can be displayed in a data-scouring heat-map. An example is displayed here.
The rows comprise observation, or the aggregated, characteristics. The columns represent each outcome target value. The colours represent the degree of correlation (according to the mapping table – red being the strongest and blue being the weakest).
This exercise, if done well, will set the business up for a well-ordered series of projects to extract value from the data.
Measuring incremental lift
Once the univariate analysis is complete, it is worthwhile assessing how correlated each characteristic is with the others. Correlation analysis is often run to determine this. For example, it may be that “Number of purchases in the last 1m” is positively correlated to someone responding to a marketing offer. Similarly “Number of purchases in the last 3m” may also be positively correlated. The naïve conclusion may be that both characteristics should be used together to predict response to an offer. The reality is that these two characteristics are highly correlated.
Many statistical techniques and measures can be used to determine the correlation. Ultimately, the characteristics could be grouped into correlated items to summarise the number of low-correlated but predictive groups. This information is important when it comes to determining what model – for example, decision tree/segmentation, clusters, scorecard – should be considered in the next phase.
5. Reviewing and planning mini-projects
Once the heat maps are developed the following should be determined:
- For each target variable – was there significant amount of data with a strong prediction of the target?
- Could this data be used to create models, such as scores, segmentation or clustering? The variables identified should be tested for co-linearity/correlation to determine if the strong characteristics can add value.
- Can the data merging, aggregation and model deployment be successfully accomplished given current hardware and software constraints? If not, what needs to change and is the business ready for it?
- Which exercise is likely to produce the most value?
- Should the business run projects in parallel sprints or in a relay (one-at-a-time) fashion?
Conclusion
The Determination step gives the data analysts, risk/loyalty/marketing managers a very good insight into which fields might be valuable and for what purposes. The next step is the Development step. In the next blog we’ll be exploring this step.
In my next blog post, I’ll take you through the next phase of finding value in your transaction data: Development.