Data Science Advertising

Data Science for Targeted Advertising: How to Display Relevant Ads by Leveraging Past User Behavior

Online advertising industry is bigger than you think

The past decade has seen a huge growth in online advertising. It is an enormous industry with brands expected to pour over 30 billion of dollars in 2014. Online advertising provides companies with instant feedback and publishers more knowledge about their users. Advertisers are very interested in precisely targeted ads. In particular they want to spend the smallest amount of money and get the maximum increase in profit. This is resolved by applying the targeted advertising. The problem involves determining where, when and to whom display particular advertisement on the Internet. Advertising systems deliver ads based on demographic, contextual or behavioral attributes. One of the examples are sponsored searches. It is most profitable business model on the web and accounts for the huge amount of income for the top search engines Google, Yahoo and Bing. It generates at least 25 billion dollars per year.

There are couple of usable methods to do targeted advertising:

  • Demographic Targeting – this approach defines targeted audience by gender, age, income, location, etc. It is old and efficient approach, because it is easy to project behavior for products categories. Demographic targeting is popular since it’s easy to understand and implement. It provides advertiser transparence and control over the audience selected for targeting.
  • Property Targeting – is a simple and popular targeting mechanism. The advertiser specifies set of pages where the ad should be shown. For example the company who sells tracks could show advertisement on website about vehicles.
  • Behavioral Targeting – provides an approach to serve ads to users leveraging the past behavior of the user (searches, site visits, purchases). The most valuable resource for behavioral targeting is network traffic of particular user. The more such data you have, the better targeting result you will achieve. Thus, even local ISP companies can provide more accurate ads for consumer than Google or Yahoo.

Real-time bidding exchanges – de facto standard for targeted advertising

Online advertising industry has grown significantly during the past few years, with extensive usage of the real-time bidding exchanges (RTB). This auction website allows advertisers to bid on the opportunity to place the online display ads in real time.  Advertisers are integrated with exchange system via API and collect variety of data to decide whether or not to bid and at what price. This has created a simple and efficient method for companies to target advertisements to particular users. As the industry standard, showing the display ad to the consumer is called “impression”. The auctions run in real time and instantly triggers when user navigates to the web page and taking place during the time the page if completely loaded in the user’s browser. During the auction, information about the location of the potential advertisement along with user information are passed to bidders in form of bid request.  This data is often appended with information previously collected by advertisers about the user. When an auction starts, potential advertiser makes the decision if it wants to bid on this impression, at which price and what advertisement to show in case it wins the auction. There are billions of such real-time transactions each day and advertisers require large-scale solution to handle such auctions in milliseconds.

Such complicated ecosystems is a perfect opportunity for applying machine learning techniques, which play a key role in the ad bidding optimization process, increasing the targeting accuracy and  reaching the ultimate goal from marketer’s perspective “Address the right browsers with the right message at the right moment and preferably at the right price”.

Improving ads relevance by applying Machine Learning techniques

The main task of machine learning system is to identify prospective customers – online users who have the higher propensity to purchase a specific product in near future after being displayed the advertisement. The ultimate goal is to build the system which will learn predictive models for each ad targeting automatically. One of the challenges of building such systems is that different ad campaigns could have different performance measures. However each of these criteria may be approximately represented as some ranking of potential purchases in terms of purchase propensity. A primary source of input features for behavioral targeting is user browser history, recorded as a set of web pages visited in the past. The target labels could be individual for each campaign and based on actual purchases of the specific product. From high level, this looks like an example of a straightforward predictive modeling problem. But if take a closer look, it appears, that it is impossible to obtain necessary amount of training data directly for this problem. First, probability of the purchase in next 7 days after seeing the ads is very low and is in range from 0.0000001 to 0.001, depending on the advertisement campaign. Second, the input feature vector includes more than one million features even in the simplest case (consider the user browsing history is encoded as set of hashed URLs). These dataset attributes involves difficulties in training process, however there are efficient approaches, which are designed to predict the consumer purchase propensity in such difficult circumstances.

Site visits as better purchase predictors than click through rate (CTR)

We know that probability of purchase after seeing the advertisement is rare event. This causes model training with highly imbalanced class distribution (skewed classes). The simplest and most widely used approach is to introduce proxy trained models. Currently, the most common proxy is clicks on advertisement. The efficiency of campaigns are often evaluated based on “click through rate” (CTR). As a result they are optimized towards increased CTR. In this approach clicks on advertisement are treated as positive samples. Hence instead of conversions, the model is trained using clicks, but the testset is still labeled by conversions. In recent study [1], this approach was tested against 10 different ad campaigns. The result implies that targeting based on clicks does not necessarily mean maximizing for conversions.

data science advertising

Figure 1. Improvement in prediction accuracy by using conversions for training instead of clicks. Testing is done using conversions in both cases.

Are there other good proxy candidates for evaluating and optimizing the advertising campaigns? Latest researches [2], answered this question. In contrast to clicks, site visits turned out generally to be good proxies for purchases.  Specifically, site visits do remarkably well as the basis for building models to target browsers who will purchase subsequent to being shown the ad. Even is some cases the models trained on site visits are producing better results than one trained on conversions.

data science advertising

Figure 2. AUC performance distribution in with respect to purchase prediction of the models trained on clicks, site visits and purchases respectively.

The results show that site visitors are more likely tend to be the purchasers rather than ad clickers.

Dimensionality reduction techniques improve model accuracy.

As mentioned earlier another difficulty in predicting the purchase is huge input feature space, which typically requires the dimensionality reduction. In most cases ad targeting system tracks over 100 million unique URLs, and any of them could be used in predictive model. It’s very expensive to build and store such high-dimensional models. However, number of dimensionality reduction techniques are available nowadays, but not all of them are well suited for ad targeting problem.

The simplest method for massive binary feature space reduction is feature hashing. It transforms a bag of words into a bag of hashed IDs. Given a set of tokens and a hash function h(), we apply the hash function to each of the tokens and the new feature space is simply the set of hashed values. We can generate a column index for a given token with a hash function. The output of the hash function should be big enough to avoid collision with even a million unique tokens. The pseudocode is following:

function hash_vectorizer(features : string array, N : integer):
     x := new vector[N]
     for f in features:
         h := hash(f)
         x[h mod N] += 1
     return x

Dimensionality reduction results from hash collisions. For example, if a URLs set contains {,,}, and we have h(“”) = 6, h(“”) = 6 and h(“”) = 8 then, in the new space, the hashed URLs set has values for features 6 and 8. Hash functions are typically 32-bit or 64-bit, and to project into an arbitrary k-dimensional feature space, we compute h() mod k.

Another approach is Contextual Categories. The web has a number of sources, both proprietary and free, that categorize specific web pages by their content. These categories serve as content-based groupings that can be used to reduce the dimensionality of the data. With category data, original feature space of URLs becomes a feature space of categories.

There are many other techniques for dimensionality reduction including Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) which are proved to good alternatives for reducing the huge URL feature space.


This article made a brief overview of the targeted advertising business, which is the multibillion industry and grows dramatically. Most of the big players on the online advertising market are working with Real-time bidding systems (RTB), which connects advertisers and publisher. RTB act as online auction allowing advertisers to bid on the opportunity to place the online display ads for particular user in real time. Right now in industry the key metric for measuring success of ad camping is click through rate (CTR), however recent studies presented that site visits are better conversion predictor than CTR. At first sight the machine learning for target advertising seems to be trivial. But after looking at problem more precisely, one may notice underlying difficulties including rare conversion, lack of training data and highly dimensional input feature space. However number of researches have been conducted, which identified the efficient solutions for solving mentioned difficulties and providing good models for predicting future conversion events.


  1. S. Pandey, M. Aly, A. Bagherjeiran, A. Hatch, P. Ciccolo, A. Ratnaparkhi, and M. Zinkevich. Learning to target: What works for behavioral targeting.
  2. B. Dalessandro, R. Hook, C. Perlich, F. Provost. Evaluating and Optimizing Online Advertising: Forget the click, but there are good proxies.
  3. C. Perlich, B. Dalessandro, O. Stitelman, T. Raeder, F. Provost. Machine learning for targeted display advertising: Transfer learning in action.