In this tutorial, we will try to predict click-through rate of ads with the Decision Tree algorithm we learnt in the last post. Before continuing, I would recommend you to first read that post for a theoretical understanding of Decision Trees.
What does Click-Through Rate Prediction mean?
Let’s assume that you are designing an algorithm for a search engine, and your task is to maximize revenue by displaying the best ads — ads that are both related to search results and are most likely to be clicked — at the top of the search results. How would you do it?
You might say that revenue can be maximized by first filtering on relevancy and then displaying the ad from the highest bidder at the top. There is a problem with this solution. Most advertisers pay for the clicks and not for ad views — they want to be sure that ads lead to their website — so just displaying the ad from the highest bidder wouldn’t maximize revenue. Under this cost-per-click model, advertisers will only be charged if the users actually click on their ads. So how to correctly approach this problem?
Let’s say you have two ads:
- A 1.00$ ad with a 10% probability of being clicked
- A 2.00$ ad with a 1% probability of being clicked
Which one would provide the most revenue? The expected revenue (cost * click-probability) for the first ad is higher than that of the second one, even though its dollar-cost is lower, because there is a higher chance that the user will click the first ad. In other words, we can look at the problem of maximizing revenue in terms of accurately predicting the probability that a given ad will be clicked i.e. “click-through rate” (CTR).
In this tutorial, I’ll walk you through an example of predicting CTR. We will use the dataset from a Kaggle competition, Click-Through Rate Prediction, sponsored by Avazu.
2. Exploratory Data Analysis
Our dataset comprises of the following features:
- id: ad identifier
- click: 0 for non-click, 1 for click
- hour: in the format of YYMMDDHH
- C1: some anonymized categorical variable e.g. 1002
- banner_pos: where a banner is located, 1 and 0
- site_id: site identifier
- site_domain: hashed site domain
- site_category: hashed site category e.g. 28905ebd
- app_id: mobile app identifier
- device_id: mobile device identifier
- device_ip: IP address
- device_model: hashed model e.g. iPhone 6, Samsung
- device_type: hashed device type e.g. tablet, smartphone
- device_conn_type: hashed type of connection e.g. Wi-Fi, 4G
- C14-C21: some more anonymized categorical variables
Anonymized and hashed values: these are categorical features and their values correspond to some real and meaningful values. They are presented this way for privacy concerns.
We will be using Python 3 and the following packages:
numpy– allows efficient numerical computations
pandas– provides data structures for data analysis
scikit-learn– for machine learning algorithms
matplotlib– for plotting data
seaborn– for extra plot types and for more elegant and readable plots
So let’s start by importing libraries:
Our source files are in compressed format but no worries! We can use some python magic to unzip files into .csv format as here:
Now that we have unzipped the source file, let’s try to understand what story the data is telling us. We are going to take only the first 100,000 samples from the train.csv file (unzipped from the train.gz). Working with a smaller subset allows to make faster calculations, which works for the purpose of this tutorial.
We can obtain the column headers of our data by using the .columns command:
The click column, in the snippet above, is our target variable, and the other columns are our potential features.
Let’s take a quick look at the first rows to see how our data looks like:
From the data description provided to us, we know that click=0 means the ad was not clicked, and click=1 means the ad was clicked.
We can use pandas df.describe() function to see the statistical distribution of our variables (count, mean, min, max, quartiles). These can also be computed separately:
df.count(), df.min(), df.max(), df.median(), df.quantile(q) . Especially, we would like to see the summary statistics of our target variable, click.
From the mean value, we can see that the number of ads clicked was just 17.5% — value of click can only be 0 or 1, so the mean value is also the click ratio.
Now let’s explore our data a bit more. device_type seems to be an interesting variable, so let’s try to understand it better. We can separate out our feature of interest, device_type, and our target variable, click, as below:
We can see that there are four types of devices, with type 1 being the most prevalent.
If we compare the click ratios per device_type, we can see that there is a marked difference in the average click-through rate depending on device_type — highest click ratio is for device_type_0, 22.7%, and lowest is for device_type_4, 7.3%. This tells us that device_type is a good feature for predicting our target variable.
As an exercise, try to analyze some other features, and see if you can find some interesting insights or some other good features.
Correlations between attributes
Correlation is a statistical technique that can tell us whether pairs of variables are related. For example, your calorie intake and weight are related; people with a high calorie intake tend to be heavier. In an ideal situation, we would have an independent set of features, but real data is unfortunately not ideal. So it is useful to know whether some pairs of attributes are correlated and by how much.
Pandas allows us to easily get correlation coefficients. Pearson correlation coefficient is the most common one, and it is used to test for linear relationships between data. In short, it returns pairs of all attributes and their correlation coefficients in range [-1; 1], where 1 indicates positive correlation (i.e. both variables follow the same trend), -1 negative correlation (i.e. when one variable increases, the other decreases) and 0 means no relationship between variables at all.
We can obtain correlations among variables with a single line of code (Python awesomeness!). We can also plot the value of these correlations as a heatmap — a great way to get a quick overview of the relationships between attributes.
The heatmap allows us to easily see that:
- the highest positive correlation (red squares) is between C1, banner_pos and device_type. Correlation between the position of the banner and type of device seems to make logical sense as well.
- we have negatively correlated variables (blue squares) — C18 and C21.
- the variables C16 and C21 are highly correlated to our target variable, click.
- id is not correlated to any of the features, as one would expect.
Now we can drop some non-informative variables:
Let’s check if we have any missing values:
All the 0s above (counts of missing values per variable) indicate that we do not need to worry about dealing with missing data.
If we look at our data types, we can see that all of our variables are not numerical — we have some categorical variables (ones with type = object).
Categorical variables are variables that fall into a specific category; they have descriptive and non-quantitative values. ‘device_model’ and ‘site_category’ are such examples. It is possible to represent each possible value for a category as a separate feature by a technique called One-hot encoding.
For example, the site_category feature has three possible values. If we assume that its three possible values are news, education, and sports, one-hot encoding will convert them into three binary features, is_news, is_education, and is_sports.
Before we can proceed, we need to transform our categorical variables (stored as text values) into vectors. The reason we need such transformations is because many ML algorithms, including Decision Trees in Scikit-Learn, do not support them directly — they require numerical inputs. Pandas can help us to easily perform these transformations via the get_dummies() function as below:
As you can see, the number of columns has now increased from a mere 24 to 4060!
For more detailed data exploration and data visualization, please refer to the previous hands-on tutorial. In this tutorial, I have only gone through the essentials to solve the problem at hand. I have skipped some of the steps that were thoroughly explained in the past post (e.g. an in-depth analysis of correlations among attributes and a detailed look at the distribution of variables).
3. Predictive Modelling
So let’s move to predictive modelling. We want to create a model that can generalize to new or future data well. To this aim, we will split our data into training set and testing set.
We are going to use a 80/20 split — 80% of the data is used for training and 20% for testing:
Now let’s create a basic Decision Tree model, fitted on
Y_train, using Scikit-learn. In the first step, we are going to use the default values of the parameters for the DecisionTreeClassifier. For example, as you can see in the snippet below, default value for max_depth parameter is “None”. Please refer to the theoretical post on Decision Trees for understanding the meaning of the main parameters.
Now we will predict the likelihood of clicks for the test set (unseen cases) by passing the test data to the “fitted model”, and we will store the output probabilities in y_predict. Then we are going to evaluate the model by comparing the predicted probabilities versus the actual values — y_predict vs Y_test — using appropriate evaluation metrics:
We can see that we have achieved an accuracy score of 0.81.
Let’s look at another evaluation metric, AUC value: A ROC curve plots the True Positive Rate (i.e. actual equals clicked and predicted equals clicked) against the False Positive Rate (i.e. predicted equals clicked but actual equals not-clicked). One way to compare classifiers is to measure the area under the ROC curve — AUC value. A perfect classifier will have AUC value equal to 1, whereas a purely random classifier will have AUC equal to 0.5. Scikit-Learn provides a function, roc_auc_score, to compute the AUC value:
The AUC we have achieved is 0.65. Let’s see if we can improve our performance by tweaking the values of the parameters for the Decision Tree i.e. by not using default values.
GridSearchCV is a technique that performs an exhaustive search over the best set of parameters. Our only job is to specify the parameters we want to tune and the values we want to explore. Then we can use the optimal values (output from GridSearchCV) as input parameters for our classifier; this in return allows us to obtain the “best” model.
Let’s try to explore the max_depth paramter:
The best value for max_depth according to GridSearch is 5. Now let’s get the evaluation metrics for our “optimal” model (given by grid_search.best_estimator_):
As you can see in the snippet above, we have improved our performance by just tweaking one parameter — AUC value has increased from 0.65 to 0.68! However, it is still not a great value, but click-through prediction is a difficult problem.
Note that although accuracy has increased, the True Positive Rate has decreased, so optimizing for accuracy is not always the correct approach. Accuracy is an intuitive and simple measure, but when we have highly skewed data, it is not a good measure. A model that always assigns the value of the most prevalent class to new instances (e.g. the model takes the easy path and always predicts not-clicked) would have a higher accuracy than a model that tries to assign some values from the other class as well (e.g. it predicts some output labels as clicked).
Also note that Decision Trees tend to cause overfitting — it is likely that the optimal values for the splits at each step only work for the training instances and not on new data. In practice, one could compare different algorithms and then pick the best model.
Anyway, now that we have the click-probabilities, finding the ad that maximizes revenue is easy — multiply these probabilities by the cost-per-click and sort the results in descending order.
Visualizing the Decision Tree
As a last step, we are going to use an open source visualization library called Graphviz for visualizing our Decision Tree. Graphiz is widely used in networking applications to visualize connections between switch hubs and networks. In ML, it can be used to visualize Decision Trees and neural networks.
This is not really the end, it’s your turn now! For example, we have not done any feature engineering in this post. We just removed some variables like timestamp — one could apply some transformations and see if there is some interesting correlation between e.g. TimeOfDay and clicks. So try to do some feature engineering by creating new features. You could also try some other algorithms, and use the evaluation metrics to compare their relative performances. Have fun!
Thanks for reading and remember to click follow for receiving the latest posts! 🙂