Using Statistical Modeling to Predict Product Purchase in Microsoft AZURE

It’s been a long time since my last post so I thought I’d return with something a little different than in the past. Rather than speak about “market research done wrong” (as my former posts tend to dwell on), I’d like to demonstrate some of the applications of predictive modeling/machine learning/[INSERT “DATA SCIENCE” BUZZWORD HERE] within the market research realm. As the industry moves towards methodologies that blend both “hard” and “soft” data (e.g. transaction data and surveys, respectively), these techniques will become more important to learn and apply in order to stay relevant and continue to provide our clients with the best direction and research ROI. The goal of this post is to show a simple demonstration of predictive modeling applied to a common market research question:



While I have spent most of my self-study time learning the R programming language, I wanted to try Microsoft’s Azure Machine Learning. Azure is part of Microsoft’s Cortana Analytics Suite which is the company’s foray into the data analytics market (now part of Azure’s AI Platform — Microsoft has a way of playing mix-and-match with their various product suites). You may be familiar with the Cortana Personal Assistant that came with the recent Windows 10 deployment (similar to Apple’s Siri or Amazon Echo’s Alexa). Although all is not lost for R-lovers — Microsoft’s acquisition of Revolution R back in 2015 has helped with the integration of an R coding module within the Azure environment (there is one for Python too if that’s your preferred language).

To oversimplify, Azure is a drag-and-drop machine learning workflow designer. It removes much of the coding/scripting aspects inherent to a machine learning (from now on, just “ML”) project. While this makes things easier for the user, it can give one a false sense of understanding what exactly their experiment (this is the name given to an Azure project) is doing to achieve its results. I suggest at least getting a intermediate understanding of applied ML before delving into Azure. Azure experiments are created by dragging “modules” onto the work space and connecting them similar to a flow chart. Each module represents an operation such as filtering columns in a data set, applying a ML algorithm, or converting output to a specific format such as .CSV.


Innocentive is a website similar to Kaggle. It offers data science competitions with monetary rewards for creating predictive models used by real-world organizations. Sure, the data is often “cleaner” than a real-world setting, but for learning the trade (with the incentive of winning some cold, hard cash along the way) they are great sites. Details on this particular challenge are located here.

In a nutshell, the project is a classification problem — who will purchase a product based on the given set of features? Participants are given two data sets: a training set used to build the model and a test set to evaluate the model using new data not included in the model-building process. Both sets are sizable (training set: ~643,000 records; test set: ~782,000 records) but shouldn’t require any fancy distributed computing systems (i.e. Hadoop) to analyze. Since Azure is cloud-based this becomes a non-issue because the data is stored out-of-memory. The outcome variable we are trying to predict using the remaining features, “target purchase,” is marked “1” for purchasers and “0” for non-purchasers. The challenge requires that models are at least72% accurate, or:

One aspect of this data challenge that intrinsically adds difficulty is that fact that the features’ meaning has been obscured and have been given generic names like “C2,” “C3,” etc. All participants are given is a simple metadata file that indicates whether the feature is numeric or character in nature. Because of the somewhat nebulous a priori knowledge we are given about the data, feature engineering and feature selection can be laborious and fraught with a lot of guesswork.


Admittedly, I made one of the first mistakes in ML — I threw every algorithm I could at the training set without first analyzing the features! While you can do this (and you’ll usually fail fast) keep in mind that one of the most important steps in ML is feature selection and feature engineering. Arguably, data cleansing/preparation is the most important but these are a close second. Thankfully, the data set provided for training has zero missing values and most of our numeric features share similar standard deviations and means (although we will need to revisit the numeric variables later).

Before diving into the “fun” stuff, I did perform a few maintenance routines: fixing metadata so variables are treated correctly according to the file mentioned above, sampling from the training file to reduce the overall size of the data set we are working with (this should help with computation time), and splitting the data set into training and validation sets (while we already have a test set, the validation set will allow us to “pre-test” our model on the training data prior to using it in a “real” evaluation).

Azure only has a handful of built-in ML algorithms but they cover most of the commonly used ones. A variety of options are available for two-class problems so I began with Two-Class Boosted Decision Trees, Two-Class Decision Forest, Two-Class Support Vector Machines, and Two-Class Logistic Regression. Since building any ML model is an iterative process, I started by using the “Sweep Parameters” module for each algorithm. This allows Azure to test a range of tuning parameter options and choose the model with the highest predictive accuracy.

On first pass, each of our models scored similarly (~71% – ~74% accuracy). While many may hang their hats up and be satisfied with these results, I believe we can do better!

Our best model utilized Two-Class Boosted Decision Trees. While I played around with the tuning parameters for the other models, for the sake of brevity (and I’m already pushing that!) please note that this model held up after several iterations and continued to have the best accuracy among all those tested. I’ll stick with it for the remainder of the post.

Aside: Two-Class Boosted Decision Trees

What exactly is this algorithm? Tree building algorithms have been around for a long time and are most-closely associated with CART (Classification and Regression Tree) models. A basic CART model (for classification rather than regression in this post) iterates over all features and selects the feature that best splits the data by the outcome variable (in our case, “target purchase”). By “best splits” I mean the feature that most accurately divides the data into two subgroups with the greatest proportion of cases falling into the purchaser/non-purchaser classification (i.e. our “1s” and “0s”). Further, the algorithm recursively applies this splitting procedure using the remaining features within each subgroup and continues this way until some threshold is reached. When the splits no longer help divide the data anymore, we’ve reached a “terminal node” or “leaf” in our tree. Here is an example:

After some tweaking, new cases can be passed through the tree to receive classification at a “leaf.” Azure’s Boosted Decision Trees work by building several “weak learner” trees, averaging how well they classify cases as an group, or “ensemble,” then adjusting the parameters to “fix” incorrect classification in the next iteration of tree building (the “boosting” part). I admit, this is a lousy explanation for a highly technical algorithm (and I don’t pretend to fully understand the specific mathematics behind it), but to simplify just think of the algorithm adhering to the proverb “if at first you don’t succeed, try, try again — with incremental improvements.”


Now that I’ve decided that our boosted tree model is the way to go, let’s see if there are any improvements I can make by adjusting the feature inputs. Please note, in a typical setting this is working a little backwards. The first round of feature engineering/selection is typically performed prior to modeling, but considering that this data set is anonymized we don’t really have any knowledge of what each feature is measuring. Unsupervised learning methods (clustering, correlation analysis, etc.) could potentially provide some insight, but I tried something a little different.

Azure contains a set of modules called “Learning with Counts” which take features in a data set and count the occurrences of unique values grouped by the outcome variable (“target purchase”). This is a sort of feature engineering that uses the frequencies of values rather than the values themselves. It also calculates the difference between the group’s counts.

I gave this a whirl since a lot of the columns marked as “numeric” in the metadata in fact only contain a handful of unique values. I suppose “numeric” should not be confused with “continuous” in all cases with this data set. The transformed data set now contains new features with these count values. Let’s run our boosted tree algorithm again to see if any improvement occurred.

Wow! The accuracy shot up enormously! This is the territory I’d be happy in with a ML model. Yet, nothing really matters until we apply the model to predict “target_purchase” on our test set. After applying the same counting transformation to our test set and using Innocentive’s online scoring tool we are able to upload our predictions for our ~782K-record test data.

Drum roll please…

Oh no! A dismal 48% accuracy! This is actually worse than just guessing. How can this be? Two things come to mind:

  1. The counting transformation failed on the test set due to the lack of a “target purchase” outcome variable (this column is blank in this set and the transformation may not have run correctly).
  2. The model over fit the training data (i.e. the model is extremely biased towards the nuances of the training data that it only does a good job of modeling that set — not the test set or any other similar set of data we’d send through it).

In order to make out model generalize to unseen data we will most likely have to sacrifice some predictive accuracy when we train the model.


Although this is titled “3rd Attempt,” I’ve skipped ahead a few steps since this post is becoming a little long in the tooth. Here is a brief summary of how we got here:

  • Rather than “Learning with Counts” on the numeric variables with only a few unique values (see above), I instead converted these to categorical variables in both the training and test sets. This transformation will not cause the same error that happened before due to the lack of the “target purchase” outcome variable in the test set.
  • I’ve identified the optimal model-building parameters to maximize our model’s accuracy on the test set. This means I can now simply use the “Train Model” module to create a single model using one parameter set. Because we’ve reduced the computation time at this step, it is more feasible to use a greater portion of the training data. Therefore, I’ve eliminated the “Partition and Sample” module to allow the algorithm to use all the training data for model building.

Our final trained model is 72% accurate meaning it correctly classifies both purchasers and non-purchasers 72% of the time. Now the true test…how will it perform on the test set

A dramatic improvement! We went from 48% accuracy to 72% accuracy (you can see from the image that I made several submissions to the online scoring tool before reaching this level of accuracy so a lot of trial and error was involved). While I would have liked to do better, I was at least able to cross the minimal submission threshold. Yet more uplifting is the fact my model came in 19th place out of the 735 participants in the competition! Not enough to crack the top ten and give me a chance at the $20,000 prize but satisfying nevertheless.


In a real-world setting…

  • …we’d typically have a lot more knowledge about the data we were modeling prior to reaching that step. This could aid tremendously in the feature engineering/selection phase (although tree-based methods like the one we used here inherently have feature selection built into their algorithm). We’d also investigate our features through visualization which makes it much easier to determine the relationships between our features and the outcome as well as between the features themselves. Azure unfortunately does not have any built-in modules for exploring data through visualization although both the R and Python modules may allow this (I have yet to try).
  • …we’d probably weigh the value of our model based on its sensitivity (how well it predicts true positives) or specificity (how well it predicts true negatives) rather than overall accuracy. The reason for this is that sometimes the cost of incorrectly classifying a case as positive, and vice versa, will have a bigger impact on the model’s value. For instance, if you build a model that identifies potential purchasers of your product within a zip code for a direct mail campaign, you’ll be more concerned with the model’s sensitivity than its specificity — you want to maximize the number of buyers and are probably not too concerned with a few false positives (Type I Error) ending up in the distribution. Alternatively, a model used to predict the malignancy of potentially cancerous cells may want a more balanced sensitivity/specificity since a false negative (Type II Error) could lead to potentially fatal results — accuracy might be the appropriate measure.
  • …our final model would need to balance parsimony with business need. A great example of a model that failed in this regard is the Netflix Prize which in order to achieve such great results utilized an ensemble of 107 algorithms! Yet the bigger issue came with the shift in Netflix’s business model from mail delivery DVD/Blu-ray to streaming. This completely changed how people chose what to watch now that they could “sample” content before watching. A model is only as good as its reflection (and impact) on real-time business needs!

One of the oldest uses of market research is to determine who will buy a product or not. Traditionally, this type of study would involve long, tedious surveys, minimally informative focus groups, and a turnaround time of several weeks or months. If the Netflix Prize teaches us anything it’s that business does not stand still for too long anymore and today’s research insights may already be invalid before they are presented to a client. I was able to structure together this simple model by myself over the course of a one week and that’s only because I worked on it part time. Writing this post nearly took the same amount of time!