Mindmap on A/B tests designing and experiments

Udacity has a quite interesting course on A/B testing. I’ve made a mind map for one of the lessons Designing and Experiment to better understand the mechanics of tests. Also, there is a python script for empirical estimation of the required group size for an experiment. It’s a modified version of the R script by Udacity and uses binary search to speed up computations.

Intro

A/B testing is a method of comparing two versions of a product or service to determine which one is more effective. It is commonly used in the fields of marketing and user experience design to evaluate changes to a website or app, and to determine which version is more effective at achieving a specific goal, such as increasing conversions or improving user satisfaction. In an A/B test, a control group is shown the original version of the product, while a second group is shown the variation. The behavior of both groups is then compared to determining which version is more effective.

Unit of Diversion and Unit of Analysis

In A/B testing, a unit of diversion refers to the individual element that is being tested. For example, if you are conducting an A/B test on a website, the unit of diversion might be a specific button or piece of content. The control group would be shown the original version of this element, while the variation group would be shown a modified version. By comparing the behavior of both groups, you can determine which version is more effective.

In an A/B test, the unit of diversion is the individual user or visitor who is randomly assigned to either the control group or the variation group. The unit of analysis, on the other hand, refers to the level at which the data from the test is analyzed. For example, in an inter-user experiment, the unit of analysis might be the group as a whole, while in an intra-user experiment, the unit of analysis might be the individual user. The choice of unit of analysis can affect the conclusions that are drawn from the data, so it’s important to carefully consider which unit is most appropriate for your specific A/B test.

Units of diversion:

User ID
- Stable within all devices
- Personally identifiable
Anonymous ID of cookie
- Specific for a device
- Can be deleted
Event
- Only for changes that are not visible
Device ID
- Only for mobile
- Specific for a device
- Personally identifiable
IP address
- Can be changed

Important notes

Remember about consistency. Consistency is important for A/B tests because it helps to ensure that any differences between the control group and the variation group are due to the changes being tested, and not to other factors. For example, if an A/B test is conducted over a period of time during which the website or app being tested is also undergoing other changes, it may be difficult to determine which changes are responsible for any differences in the results. By keeping all other variables consistent, A/B tests can provide more reliable and accurate results.

Visibility is lower when unit of diversion and unit of analysis are the same. When the unit of diversion and the unit of analysis are the same, visibility may be lower because the data from the test is being analyzed at a more granular level. For example, if the unit of diversion and the unit of analysis are both individual users, the data from the test will be analysed on a user-by-user basis. This can make it more difficult to detect trends or patterns in the data, and may result in lower visibility of the overall results of the test. By contrast, if the unit of analysis is the group as a whole, the data can be more easily summarised and visualised, which may result in higher visibility of the results.

Inter- and Intra-User Experiments

Inter-user experiments and intra-user experiments are two different approaches to conducting A/B tests. In an inter-user experiment, the behavior of different users is compared to determining which version of a product or service is more effective. In an intra-user experiment, the behavior of the same user is compared under different conditions (for example, by showing them the original version of a website on one day, and the variation on another day). Intra-user experiments can be useful for testing changes that are too subtle to be detected in an inter-user experiment, but they can also be more difficult to implement and analyze.

It is difficult to say which type of experiment, inter-user or intra-user, is better because the choice of experiment will depend on the specific goals and circumstances of the A/B test. Inter-user experiments are typically easier to implement and analyze, and can be useful for testing changes that are likely to be noticed by a large number of users. Intra-user experiments, on the other hand, can be more useful for testing subtle changes that may be difficult to detect in an inter-user experiment. Ultimately, the best approach will depend on the specific goals of the A/B test and the resources available to conduct the experiment.

A target cohort is a group of users or visitors who are specifically selected to participate in the experiment. For example, a target cohort might consist of users who have previously completed a specific action on a website (such as making a purchase) or who fit a certain demographic profile (such as being located in a specific geographic region). By carefully selecting a target cohort, A/B tests can be more focused and effective, and can provide more reliable results.

Size and Duration

In an A/B test, size and duration are two important factors that can affect the reliability and accuracy of the results. The size of the test refers to the number of units of diversion (i.e. users or visitors) who are included in the experiment, while the duration refers to the length of time over which the test is conducted. A larger sample size and longer duration can help to increase the reliability of the results, but they may also require more resources and time to implement. It’s important to strike a balance between these factors to ensure that your A/B test is both effective and efficient.

Tips for conducting A/B tests

Here are some tips for conducting successful A/B tests:

Start with a clear hypothesis: Before you begin an A/B test, it’s important to have a clear idea of what you are trying to achieve and what you expect to see as a result. This will help you to design a test that is focused and effective.
Use a large enough sample size: In order for your A/B test to be statistically significant, you will need to include a large enough sample size. The exact number will depend on a variety of factors, including the size of your user base and the size of the effect you are trying to measure.
Randomize your samples: In order for your A/B test to be truly objective, it’s important to randomly assign users to the control group and the variation group. This will help to ensure that any differences between the two groups are due to the changes you are testing, and not to other factors.
Keep your test consistent: Once you have set up your A/B test, it’s important to keep all other variables consistent. This means using the same time period for both groups, and making sure that any external factors (such as changes to your website or app) do not affect the results of the test.
Analyse the results carefully: Once your A/B test is complete, it’s important to carefully analyze the results to determine which version is more effective. This may involve using statistical techniques to determine whether the differences between the two groups are statistically significant.
Conduct A/A experiments. An A/A experiment is a type of A/B test in which the control group and the variation group are shown the same version of a product or service. In other words, there is no “B” in an A/A experiment. A/A experiments are typically used as a way to test the validity of an A/B testing methodology, or to verify that the data from an A/B test is accurate and reliable. By comparing the results of an A/A experiment to the results of an A/B test, researchers can determine whether any observed differences between the two groups are due to the changes being tested, or to other factors.

Plus, there are articles to read:

Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Description of the Google’s overlapping experiment infrastructure.
Large-Scale Validation and Analysis of Interleaved Search Evaluation. Comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature.

import math

import scipy.stats as ss


def get_z_star(alpha):
    """
    Parameters
    ----------
    z_star : float
        A desired alpha for the two-tailed test

    Returns
    -------
        A z-critical value of the two-tailed test
    """
    return -ss.norm.ppf(alpha / 2)


def get_beta(z_star, s, d_min, N):
    """
    Parameters
    ----------
    z_star : 1d-array, shape (n_times)
        A z-critical value
    s : float
        A varianve of the metric at sample size n=1 in each group
    d_min : float
        A practival significance level
    N : int
        A sample size of the test group

    Returns
    -------
        A beta value of the two-tailed test
    """
    SE = s / math.sqrt(N)
    return ss.norm.cdf(z_star * SE, loc=d_min, scale=SE)


def required_size(s, d_min, ns=range(10000000), alpha=0.05, beta=0.2):
    """
    Parameters
    ----------
    s : float
        A varianve of the metric at sample size n=1 in each group
    d_min : float
        A practival significance level
    ns : range
        A range of sample sizes to consider
    alpha : float
        A desired alpha level of the test
    beta : float
        A desired beta level of the test

    Returns
    -------
        Smallest sizes of groups for an experiment
    """
    mid = len(ns) // 2
    n = ns[mid]
    b = get_beta(get_z_star(alpha), s, d_min, n)
    if b == beta or len(ns) <= 1:
        return n + ns.step
    elif b < beta:
        return required_size(s, d_min, ns[:mid], alpha, beta)
    else:
        return required_size(s, d_min, ns[mid:], alpha, beta)


print(required_size(s=0.0188 * math.sqrt(1000), d_min=0.015))