Stats accelerator – acceleration under time-varying signals optimizely blog gasbuddy nj


This is part two of a series on Stats Accelerator . In the first part , we explained the when, why, and how of Stats Accelerator. In today’s installment we will discuss a major roadblock to successful productionization of bandits in the A/B testing context and how we eventually overcame it. This is a high-level overview. For a more detailed treatment, see the technical white paper .

Last fall, Optimizely announced Stats Accelerator , a pair of multi-armed bandit algorithms for accelerating experimentation. These algorithms are designed to either optimize rewards for a period of time or identify a statistically significant variant as quickly as possible by intelligently changing the allocation of traffic between variations (or arms, in machine learning lingo) of the experiment.

However, when underlying conversion rates or means are varying over time (e.g. due to seasonality), dynamic traffic allocation can cause substantial bias in estimates of the difference between the treatment and the control, a phenomenon known as Simpson’s Paradox. This bias can completely invalidate statistical results on experiments using Stats Accelerator, breaking usual guarantees on false discovery rate control.

To prevent this, we developed Epoch Stats Engine, a simple modification to our existing Stats Engine methodology, which makes it robust to Simpson’s Paradox. gas density at stp At its core is a stratified estimate of the difference in means between the control and the treatment. Because it requires no estimation of the underlying time variation and is also compatible with other central limit theorem-based approaches such as the t-test, we believe it lifts a substantial roadblock to combining traditional hypothesis testing and bandit approaches to A/B testing in practice. What is time variation?

A fundamental assumption behind many A/B tests is that the underlying parameter values we are interested in do not change over time. When this assumption is violated, we say there is time variation . In the context of A/B experiments using Stats Accelerator, Simpson’s Paradox can only occur in the presence of time variation, so a precise understanding of time variation will be useful going forward.

Let’s take a step back. electricity cost per watt In each experiment we imagine that there are underlying quantities that determine the performance of each variation; we are interested in measuring these quantities, but they cannot be observed. These are parameters . For example, in a conversion rate web experiment we imagine that each variation of a page has some true ability to induce a conversion for each visitor. If we express this ability in terms of the probability that each visitor will convert, then these true conversion probabilities for each variation would be the parameters of interest. Since parameters are unobserved, we must compute point estimates from the data to infer parameter values and decide whether the control or treatment is better. In the conversion rate example, the observed conversion rates would be point estimates for the true conversion probabilities of each variation.

Noise and randomness will cause point estimates to fluctuate. In the basic A/B scenario, we view these fluctuations as centered around a constant parameter value. However, when the parameter values themselves fluctuate over time, reflecting underlying changes in the true performance of the control and treatment, we say that there is time variation .

The classic example is seasonality. For example, it is often reasonable to suspect that visitors to a website during the workweek may behave differently than visitors on the weekends. Therefore, conversion rate experiments may see higher (or lower) observed conversion rates on weekdays compared to weekends, reflecting matching potential time variation in the true conversion rates which represents underlying true differences in how visitors behave on weekdays and weekends.

Time variation can also manifest as a one-time effect. A landing page with a banner for a 20%-off sale valid for the month of December may generate a higher-than-usual number of purchases per visitor for December, but then drop off after January arrives. This would be time variation in the parameter representing the average number of purchases per visitor for that page.

Time variation can take different forms and affect your results in different ways. e85 gasoline Whether time variation is cyclic (seasonality) or transient (a one-time effect) suggest different ways to interpret your results. Another key distinction regards how the time variation affects different arms of the experiment. Symmetric time variation occurs when parameters vary across time in a way such that all arms of the experiment are affected equally (in a sense to be defined shortly). Asymmetric time variation covers a broad swath of scenarios where this is not the case. Optimizely’s Stats Engine currently has a feature to detect strong asymmetric time variation and will reset your statistical results accordingly to avoid misleading conclusions, but handling asymmetric time variation in general requires strong assumptions and/or a wholly different type of analysis. This remains an open area of research.

In what follows, we will restrict ourselves to the symmetric case with an additive effect for simplicity. natural gas jokes Specifically, we imagine the parameters for the control and treatment θ C (t) and θ T (t) may be written as θ C (t) = μ C + f(t) and θ T (t) = μ T + f(t) so that each can be decomposed into non-time-varying components μ C and μ T and the common source of the time variation f(t) . The underlying lift may therefore be written by the non-time-dependent quantity μ T – μ C = θ T (t) – θ C (t) .

Generally, symmetric time variation of this sort occurs when the source of the time variation is not associated with a specific variation but rather the entire population of visitors. For example, visitors during the winter holiday season are more inclined to purchase in general. Therefore, a variation that induces more purchases will tend to maintain the difference over the control even with a higher overall holiday-influenced click-through rate for both the variation and the control.

In general, most A/B testing procedures such as the t-test and sequential testing are robust to this type of time variation. This is because we are often less interested in estimating the individual parameters of the control and treatment, and more interested in the difference between the parameters. Therefore, if both parameters are impacted in an additive manner by the same amount, then such time variation will be cancelled out once differences are taken and any subsequent inference will be relatively unaffected. Using the notation above, this can been seen in the fact that the difference in the time-varying parameters θ T (t) – θ C (t) = μ T – μ C does not contain the time-varying factor f(t) .

If the traffic split in an experiment is adjusted in sync with underlying time variation, then a disproportionate amount of high- or low-performing traffic may be allocated to one arm relative to the other, biasing our view of the true difference between the two arms. This is a form of Simpson’s paradox, the phenomenon of a trend appearing in several different groups of data which then disappears or reverses when the groups are aggregated together. This bias can completely invalidate experiments on any platform by tricking the stats methodology into declaring a larger or smaller effect size than what actually exists. gaz 67b for sale Let’s motivate this by an example.

If traffic is split 50% to treatment and 50% to control (or any other proportion for that matter) for the entire duration of the experiment, then it is clear that the final estimate of the difference between the two variations should be close to 10%. What happens if traffic is split 50/50 in November but then changes to 75% to control and 25% to treatment in December? For simplicity, let’s assume that there are 1000 total visitors to the experiment in each month. A simple calculation shows that:

So both control and treatment have equal numbers of visitors from low-converting November, but the treatment has far fewer visitors than the control from high-converting December. This imbalance clues us in that there will be bias, and doing the math confirms this: the conversion rate for the control will be around 16%, and the conversion rate for the treatment will be around 23%, a difference of only 7% rather than the 10% that we would normally expect.

In this example, the diminished point estimate might cause a statistical method to fail to report significance when it otherwise would with an unbiased point estimate closer to 10%. But other adverse effects are also possible. When there is no true effect (e.g. we ran an A/A test), Simpson’s Paradox can cause the illusion of a significant positive or negative effect, leading to inflated false positives. Or, when the time variation is especially strong or the traffic allocation syncs up well with the time variation, then this bias can be so drastic as to reverse the sign of the estimated difference between the control and treatment parameters (as seen in Figure 2a), completely misleading experimenters as to the true effect of their proposed change. Epoch Stats Engine

Since Simpson’s Paradox manifests as a bias in the point estimate of the difference in means of the control and treatment, mitigating Simpson’s Paradox requires coming up with a way to remove such bias. electricity in the body In turn, removing bias requires accounting for one of the two factors causing it: time variation or dynamic traffic allocation. Since time variation is unknown and therefore must be estimated, but traffic allocation is directly controlled by the customer or Optimizely and therefore known, we opted to focus on the latter.

Our solution for Simpson’s Paradox follows from the observation that bias due to Simpson’s paradox cannot occur over periods of constant traffic allocation. Therefore, we may derive an unbiased point estimate by making estimates within periods of constant allocation (called epochs ) and then aggregating those individual within-epoch estimates to obtain one unbiased across-epoch estimate. If we can show that this quantity is compatible with the sequential testing methodology underneath Stats Engine’s hood, then we have a plug-and-play solution that works seamlessly with experiments using Stats Accelerator. As we will also see, this estimator is also simple enough to be applied to other statistical tests such as the t-test.

Let’s get into the math a bit. Suppose there are K(n) total epochs by the time the experiment has seen n visitors. gas prices going up in nj Within each epoch k, denote by n k,C and n k,T the sample sizes of the control and treatment respectively, and by X̄ k and Ȳ k the sample means of the control and treatment respectively. Letting n k = n k,C + n k,T, the epoch estimator for the difference in means is

At a high level, T n is a stratified estimate where the strata represent data belonging to individual epochs of fixed allocation. At a low level, this is a weighted estimate of within-epoch estimates where the weight assigned to each epoch is proportional to the total number of visitors within that epoch. This is also why we surface the epoch estimate as “ Weighted improvement ” on the results page for experiments using Stats Accelerator.