Metrics for Ads Click-Through Rate (CTR) Models

In this article, we cover problems of ROC-AUC and NDCG, variations of log loss, design of an auction simulator, and many other issues on the topic

Feb 04, 2024

Introduction

Digital advertising is a huge industry with multi-billion dollar revenues. Ads allocation is a dynamic process and is based on user feedback. Click-though rate (CTR) models hold a very important place in advertising systems.

There is a huge variety of advertising systems: ads networks which are spread across thousands of web sites and systems which are embedded in web platforms such as search engines, social networks and marketplaces. All these types of systems work differently and that is why there is no one recipe of how to develop a proper CTR model, and metrics selection is a one of the aspects to be solved.

In this article I give a review of top tech companies’ research papers on Ads CTR models metrics selection.

Disclaimer: Most of advertising systems consist of two (or more) stages: the candidate selection and their following ranking. Here I consider only the ranking stage.

Disclaimer 2: Some of the mentioned papers are 5+ years old. But in my opinion they are quite actual even now since recent papers emphasise on the new model architectures rather than the metrics design.

Notes on Ads Auctions

Ads CTR models are tightly connected to auctions. Most of advertising systems have an allocation mechanism based not only on click probabilities but also on bids. These bids may be provided by advertisers explicitly (sometimes via complex rules) or can be set by an autobidding algorithm.

There are several types of auctions that are used in practice such as first price auction, second price auction (you can see the term GSP, generalized second price auction) and VCG (Vickery-Clarke-Groves) auction. The details on these auctions are beyond this article. If you are interested, you may check out Tim Roughgarden’s lectures on algorithmic game theory. First 2 lectures give a great overview of the topic.

Why Ads CTR model metrics selection is a difficult question

The choice of metrics for evaluating ads CTR models differs dramatically from the classical ML models case. There are a number of reasons

The ranking power of the model is important but it is not the only thing to consider. Because ads are often distributed through auction rankings, CTR models need proper calibration. As CTR gets multiplied by ad bids during allocation, ensuring accuracy becomes essential
The ultimate goal of ads systems is to maximize their revenue. It creates a desire to use money-related metrics for CTR models. Accurately replicating real-world auctions proves challenging, making such offline metrics to be hard to obtain
Estimating CTR for sponsored search can be tricky due to position bias. Even with explicit feedback of ad views, users tend to click less on lowermost items – termed "position fatigue"

An overview of existing metrics

First of all, there are two types of metrics used for ML models

Offline metrics. Metrics that are evaluated at the research stage without real data tests
Online metrics. Metrics that are evaluated on AB tests and can be monitored in production

Choice of online metrics is a relatively simple question. There are a number of evident metrics

Ads system revenue
Average observed CTR
CTR@k for the sponsored search ads

Offline metrics is the main topic of this article. In Predictive model performance: offline and online evaluations, Bing, 2013 authors provide the following classification of offline metrics.

Probability-based. Mainly ROC-AUC and its variations
Log Likelihood-based. Logloss, Relative Information Gain (RIG), etc
Prediction Error (PE). Root Mean Square Error, etc
DCG-based: NDCG, etc
Information Retrieval (IR): Precision/Recall, F1-score, etc

I would personally add calibration and money-based offline metrics in the list. They will be described in next sections.

Why Prediction Error, DCG-based and Information Retrieval are probably bad metrics to choose

Prediction Error metrics are hard to apply to Ads CTR models. First, they don’t evaluate the quality of ranking. Second, they does not fit well to bernoulli random variables (0/1). If you calculate the error between the predicted CTR and click/not-click target than it’s better to use logloss which is designed for bernoulli random variables. If you calculate the error between the predicted CTR and average historical CTR than you will face with many items that have zero or very small amount of clicks.

NDCG (Normalized Discounted Cumulative Gain) is a good metric for ranking models. It can be calculated query-wise for search ads ranking.

\(DCG = rel_1 + \sum_{i = 2}^N \frac{rel_i}{\log_2 (i + 1)}\)

\(NDCG = \frac{DCG}{IDCG}\)

Here rel stands for relevance. Thats because NDCG is traditionally used for relevance score which can be an arbitrary value (say 1 to 5). In case of CTR models, rel equals 1 for click and 0 for non-click. IDCG stands for ideal DCG, e.g. DCG for ideal ranking. See details on Wikipedia

The main problem of NDCG is that it is designed to prefer a ranking algorithm that places ads with higher CTR at earlier ranks. But auctions rank ads in a different way. For example, it ranks by (CTR * click_bid). It makes NDCG an inappropriate metric for advertisement system.

IR metrics. I have not seen IR metrics in practice for Ads CTR models at ranking stage. I would say that precision and recall are more suitable metrics for candidate generation rather than ranking. It may be very informative to measure recall@k to evaluate how well the candidate generation stage can find relevant ads. But such approach does not show how well the most profitable ads are pushed to the top.

ROC-AUC: problems and variations

ROC-AUC (Receiver Operating Characteristic Area Under the Curve) is a common ML metric for the classification task. It is widely used for Ads CTR models although it is far from the ideal metric.

In Practical Lessons from Predicting Clicks on Ads at Facebook, 2014 and Predictive model performance: offline and online evaluations, Bing, 2013 authors analize downsides of ROC-AUC

ROC-AUC ignores the predicted probability values. It means it does not consider calibration
ROC-AUC if used without adjustments works on the entire spectrum of data mixing users and queries. Because of that higher ROC-AUC does not mean better ranking for a specific user-query
ROC-AUC treats omission and commission errors equally which is not true for ads systems. In the context of sponsored search not placing of optimal ads (omission error) is more critical than placing sub-optimal ads (commission error)
ROC-AUC is highly dependent on the average CTR of the dataset. It means that you can not compare ROC-AUC between different ads placements with different average CTR

The first problem can be solved by using additional metrics for calibration. The third and the forth ones are significant but not critical.

The second problem is the critical one but there are some possible adjustments to be applied. ROC-AUC can be calculated over groups and than averaged. You can see it in the following papers

In Optimized Cost per Click in Taobao Display Advertising, 2017 authors propose Group AUC (GAUC) metric. The idea is to calculate AUC over user-position group and than calculate the weighted average proportional to group impressions
In Soft Frequency Capping for Improved Ad Click Prediction in Yahoo Gemini Native, 2023 authors propose Stratified AUC (sAUC) which calculates the weighted AUC over sessions weighted by the number of clicks

The aggregation principle can be adjusted to the specific case of your ads system. The only problem in this approach is that it does not work with groups with zero clicks or 100% clicks.

Probability-based metrics: Logloss and its variations

Logloss (the same as cross-entropy, CE) is actually a native metric that consider both ranking and calibration of probability scores.

\(logloss = CE = \frac{1}{N} \sum_{i = 1}^{N}(y_i \log p_i + (1-y_i)\log (1 - p_i))\)

The problem of logloss is that it has no human-understandable interpretation. Say, you have the logloss value of 3.1 and it means nothing.

Also, in Predictive model performance: offline and online evaluations, Bing, 2013 authors mention that various CTR predictions give different contribution to the metric. According to authors, over-estimation of click probability, in practice, makes less impact on online metrics.

The first popular metric is RIG (Relative Information Gain) which . Here are some details on RIG calculation. First, let’s define the entropy, which is calculated for the dataset

\(H(\bar{p}) = -\bar{p} \log \bar{p} + (1-\bar{p})\log (1 - \bar{p})\)

Here p with upper bar stands for the empirical (average) CTR of dataset. Then, RIG show the relative value of cross-entropy/logloss

\(RIG=\frac{H(\bar{p}) - CE}{H(\bar{p})}\)

There are some modifications of entropy-related metrics that are used in practice along the different papers

In Practical Lessons from Predicting Clicks on Ads at Facebook, 2014 authors propose Normalized Entropy metric (NE), or Normalized Cross-Entropy, to be more accurate. This metric is very similar to RIG. It is actually a logloss which is normalized by logloss obtained by average empirical CTR

\(NE= \frac{CE}{-\bar{p}\log\bar{p} + (1 - \bar{p})\log(1 - \bar{p})}\)

Such normalized metrics as RIG and NE make cross-entropy to be comparable between different datasets.

In Click-through Prediction for Advertising in Twitter Timeline, 2015 authors propose using Normalized Relative Information Gain. Normalized RIG means that the RIG computed using the prediction scores which are normalized to have an average equal to the empirical CTR.

\(\hat{p_i} = p_i \frac{\bar{p}}{\sum_{j=1}^{N} p_j}\)

The drawback of this metric is its negative impact on the calibration evaluation.

Calibration metrics

Calibration metrics should be considered as additional metrics to CTR models. For example, you can use logloss or ROC-AUC as a main metric and calibration metric as a secondary one.

In Practical Lessons from Predicting Clicks on Ads at Facebook, 2014 authors mention a simple calibration metric which is just a ration between average empirical CTR and average predicted CTR. I find this metric to be very raw. In practice, you can have good average calibration but poor calibration in some categories or ads positions.

In practice, calibration is a less studied topic among data scientists compared to classical metrics such as ROC-AUC. Here I give some extended information without references on research papers, based just on my experience.

I propose using Percentage Calibration Error metric which is calculated in the following manner.

Predicted CTR values are sorted and aggregated in ten bins (or some other number if you find it more appropriate)
For each bin average predicted CTR and average empirical CTR are calculated
The final metric is an average percentage error

\(PCE = \frac{1}{10}\sum_{i = 1}^{10} \left| \frac{(p_i - \bar{p_i})}{\bar{p_i}} \right| \cdot 100 \% \)

Where pi is the predicted CTR and pi with bar is the empirical one. PCE also can be calculated for various slices of the dataset. For example, for different ads categories, query breadth and positions.

Also, Probability Calibration Curves can be used for better understanding of how your model works. For examples, check out scikit-learn page on this topic.

Money-based offline metrics

All metrics mentioned above does not consider the auction nature at all. The most obvious way to evaluate how the auction will perform under a novel CTR model (and get results which will be closer to online experiments) is to perform a simulation. Development of an auction simulator is a hard technical task for big web platforms with billions of daily ads impressions. Here are some references from research papers.

In Predictive model performance: offline and online evaluations, Bing, 2013 authors propose the auction simulation to estimate CTR models performance. They run the auction with new model predictions on historic data. Auction clicks are simulated based on historic CTR.

During the simulation, user clicks are estimated using historic user clicks of the given (query, ad) pair available in the logs in a following manner

If the (query, ad) pair is found in the logs at the same display location, the historic CTR is calculated as-is based on these data
If the (query, ad) pair is found on different display location (position) they apply position correction
If the (query, ad) pair is not found in the logs, the average CTR for the position is used

In Offline Evaluation of Response Prediction in Online Advertising Auctions, Criteo, 2015 authors propose a more theoretical approach for simulation of the second price one item auction (and true bids). They evaluate the profit of the auction. The approach is pretty sophisticated. Here I provide the main ideas.

The profit of the auction is defined as difference between the winning bid and the second bid

\(profit = \sum_{i} (a_i v_i - c_i)\mathbb{I}(p_i v_i > c_i)\)

Here a is the actual click, v is the winning bid (and true value by design), c is the second bid and p is the predicted CTR. Indicator function stands for the fact that the winning bid is higher than the second bid

The authors propose to perform a simulation on the historical data of auction winners. The problem of such approach is that the proposed metric does not penalize overprediction. E.g. the best model outputs 1.0 CTR for every item. To handle this problem, authors propose a trick.

They introduce the Expected Utility (EU) as a function which includes the second bid distribution

\(EU = \int_{0}^{pv} (av - \tilde{c})Pr(\tilde{c}|c)d\tilde{c}\)

Pr is the distribution of the second bid

The idea is that taking the second bid from some proper distribution will penalize overprediction. In case of very high predicted CTR, the winner wins auctions with very high second price. It leads to negative EU for such auctions and decreases the metric.

In the paper they propose different distribution functions and compare simulation results with online experiments in Criteo.

Conclusions

I observed a huge variety of different metrics in this article, and it may be still unclear which metrics to use. Here are some takeaways

Choice of online metrics (in most cases) should be a straightforward thing. Use online CTR and revenue
Logloss and its variations such as Normalized Entropy should be considered as first choice metrics since they pay respect to both ranking power and calibration
ROC-AUC still can be used as a metric but you should use adjusted versions of it such as stratified AUC
Consider using Percentage Calibration Error metric to control calibration more precisely
Auction simulator can provide a great support in CTR model evaluation but its creation is a rather difficult technical task

References in one place

Research papers

Links

About me

I work as a Data Science Team Lead at digital C2C marketplace Avito. I work in AdTech department and responsible for ads mechanism design both in search and recommendations. We have 1+ million active promoted items and thousands of search RPS.

Alexander Ledovsky’s Substack

Discussion about this post