4.13 A/B Testing Lesson 13 | ND111 - Data Science II

4.13 A/B Testing `Lesson 13`

This is an application of Confidence Intervals and Hypotheses Testing.

The A/B Testing is a comparison between two groups (A and B), as ilustrated in Figure 1.

Figure 1 - Two group to be tested.

A/B tests are used to test changes on a web page by running an experiment where a control group sees the old version, while the experiment group sees the new version. A metric is then chosen to measure the level of engagement from users in each group. These results are then used to judge whether one version is more effective than the other. A/B testing is very much like hypothesis testing with the following hypotheses:

\[H_0 : \text{The new version is equal or worse than the older version.} \\ H_1 : \text{The new version is better than the older version.} \]

Decision:

If we fail to reject the null hypothesis, the results would suggest keeping the old version, or;
If we reject the null hypothesis, the results would suggest launching the change.

Drawbacks

It can help you compare two options, but it can’t tell you about an option you haven’t considered. It can also produce bias results when tested on existing users, due to factors like change aversion and novelty effect.

Change Aversion: Existing users may give an unfair advantage to the old version, simply because they are unhappy with change, even if it’s ultimately for the better.

Novelty Effect: Existing users may give an unfair advantage to the new version, because they’re excited or drawn to the change, even if it isn’t any better in the long run.

4.13.1 Example: New Homepage

The Audacity company want to perform an A/B Testing of two versions of a new homepage.

\[H_0 : CRT_{new} - CTR_{old} \leq 0 \\ H_1 : CRT_{new} - CTR_{old} > 0\]

Where CRT stands to Click Through Rate.

There are two version: control and experiment.

The difference between the CRT is about 0.03.

Bootstrapping

Example: New version of Home page.

The bootstrapping provide a histogram presented in Figure 2.

The null hypothesis histogram is showed in Figure 3.

Finally, Figure 4 ilustrate both histogram.

P-value

Founded on the entire population (excepting the duplicated user id, etc.), I have calculated the diff.

diff = experiment_crt - control_crt = 0.030034443684015644

The diff could be interpreted as a threshold which I will use as delimiter, to do it I will calculate the proportion of \(H_0\) (orange graph) that has a difference between CRT’s higher than diff.

For this reason, I will calculate the average of a list of bool, which will return the proportion I want.

Based on the p_value of 0.5% we reject the \(H_0\).

Conclusion: Audacity should launch the new version of the home page.

4.13.2 Example: Average Reading Time

Same idea, two version of a website, one control and other experiment.

Average Reading time of control: 115.38637100678429
Average Reading time of experiment: 131.3208410471793
Difference observed: 15.9

On average, visitor using the experiment version of website spent almost 16 more seconds.

Hypotheses posed:

\[H_0 : ART_{new} - ART_{old} \leq 0 \\ H_1 : ART_{new} - ART_{old} > 0\]

Where ART stands to Average Reading Time.

Bootstrapping

Let’s apply the bootstrapping, and plot a histogram in Figure 5.

The null hypothesis histogram is showed in Figure 6.

Finally, Figure 7 ilustrate both histogram.

P-value

The p_value is zero.

Conclusion: Reject the \(H_0\) because p_value < \(\alpha\)

Where \(\alpha\) is 0.05.

4.13.3 Example: Enrollment Rate

This is an example to show a case where the \(H_0\) is failed to reject.

I will use the same principle of CRT to evalute the Enrollment rate.

Enrollment rate control: 0.23452157598499063
Enrollment rate experiment: 0.2642986152919928
Difference observed: 0.02977703930700215

Hypotheses posed:

\[H_0 : ER_{new} - ER_{old} \leq 0 \\ H_1 : ER_{new} - ER_{old} > 0\]

Where ER stands to Enrollment Rate.

Bootstrapping

Let’s apply the bootstrapping, and plot a histogram in Figure 8.

The null hypothesis histogram is showed in Figure 9.

Finally, Figure 10 ilustrate both histogram.

P-value

The p_value is 0.0624.

Conclusion: Due to p_value > \(\alpha\) we fail to reject \(H_0\).

Where \(\alpha\) is 0.05.

4.13.4 Bonferroni Correction

If you remember from the previous lesson, the Bonferroni Correction is one way we could handle experiments with multiple tests, or metrics in this case. To compute the new bonferroni correct alpha value, we need to divide the original alpha value by the number of tests.

The new \(\alpha\) will be:

\[\alpha_{adjusted} = \frac{\alpha}{4} = \frac{0.05}{4} = 0.0125\]

Based on the several test we have done:

Enrollment Rate: 0.0624 (Read the Jupyther Notebook)
Average Reading Duration: 0 (Read the Jupyther Notebook)
Average Classroom Time: 0.0384 (Read the Jupyther Notebook)
Completion Rate: 0.0846 (Read the Jupyther Notebook)

This new \(\alpha\) will generate only one A/B Testing statistical significant.

New Feature	p value	\(\alpha_{adjusted}\)	Result	\(H_0\)
Enrollment Rate	0.0624	0.0125	>	Fail to reject \(H_0\)
Average Reading Duration	0	0.0125	<	Reject \(H_0\)
Average Classroom Time	0.0384	0.0125	>	Fail to reject \(H_0\)
Completion Ratee	0.0846	0.0125	>	Fail to reject \(H_0\)

This is the reason the Bonferroni method is considered conservative.

4.13.5 Difficulties in A/B Testing

As you saw in the scenarios above, there are many factors to consider when designing an A/B test and drawing conclusions based on its results. To conclude, here are some common ones to consider.

Novelty effect and change aversion when existing users first experience a change

Sufficient traffic and conversions to have significant and repeatable results

Best metric choice for making the ultimate decision (eg. measuring revenue vs. clicks)

Long enough run time for the experiment to account for changes in behavior based on time of day/week or seasonal events.

Practical significance of a conversion rate (the cost of launching a new feature vs. the gain from the increase in conversion)

Consistency among test subjects in the control and experiment group (imbalance in the population represented in each group can lead to situations like Simpson’s Paradox) — Udacity notebook

A work by AH Uyekita

anderson.uyekita[at]gmail.com