Lies, Damned Lies, and A/B Testing

Ben Moss By Ben Moss  |  Sep. 07, 2015

A/B testing is frequently billed as a scientific way to validate design decisions. Occasionally referred to as split testing, A/B testing is a simple process that on the surface appears to promise concrete results:

Create two variations on a design element, randomly swap them on your site and record how your users react, compare the results, implement whichever variation performed best. It makes sense.

The classic example is: a red button vs. a green button, which will be tapped more? However, the more interesting question is: a green button vs. the same green button, which will be tapped more?

What happens when we A/B test two identical variations? An A/A test, if you will.


Green button vs. green button

In order to test the validity of any A/B test, we need a test that has a ‘correct’ answer. We need a correct answer because we want to know, all things being equal, how likely is it that the A/B test will produce the result it should, and for that we need to know what result to expect.

If we A/B test two identical buttons, the result should be a dead heat

So, let’s assume that we’re testing a green button vs. the same green button, and that the button is so enticing that 100% of users will tap it.

(The percentage doesn’t actually matter, it could be 14.872%. What matters is that because the buttons are identical, the tap rate should also be identical.)

If we A/B test two identical buttons, the result should be a dead heat.


The coin toss test

Toss a coin. Which side will come up, heads or tails? We know there are two sides, both identical, so it’s a 50-50 chance.

If we toss our coin twice, we know that there are three possible outcomes: 2 heads, 2 tails, or 1 head and 1 tails. And so on…

Let’s say the coin toss is our A/A test; the odds of the heads side coming up are identical to the odds of the tails side coming up, just as the odds of either of our green buttons being tapped are equal.

So let’s write a quick script in the browser (because most A/B testing happens in the browser) to simulate users tapping one button or the other, depending on which one they’re presented with.

Remember: we’re testing two identical variations of a button, and the way we know they’re identical is that we’re treating the likelihood of them being tapped as identical. All we’re looking for is a consistent (and therefore correct) result.

Firstly, we need an HTML table to record our results in, the table will look like this:

<table id="results">
        <th>Margin of Error</th>

<div id="summary"></div>

In the first column we’ll record the number of the test (all good A/B tests, are repeated to verify results, so we’ll repeat the test a few times). Next we’ll record the number of Heads results, then the number of Tails results. The column after that will be the difference between the two results (which should be zero). Then we’ll record the margin of error (which again, should be 0%). Beneath the table we’ll print out a summary, the average of all the results, and the worst case result.

Here’s the script:

var bestOf = 12, // the number of times we want to run the test
    testRepeat = 12, // the number of times we’d like to repeat the test
    testCount = 0, // the number of the current test
    testInterval = setInterval(performCoinToss, 100), // call the coin toss function
    totalDifference = 0, // used for calculating the average difference
    worstDifference = 0; // the worst case

function performCoinToss()
    testCount++; // increment the current test

    var testCounter = bestOf, // the current iteration of the test
        headsCounter = 0, // the total number of times the script came up with "heads"
        tailsCounter = 0; // the total number of times the script came up with "tails"

    while(testCounter--) // loop 'testCounter' times
        Math.round(Math.random()) ? headsCounter++ : tailsCounter++; // finds 0 or 1 randomly, if 1 increments headsCounter, otherwise increments tailsCounter
    var difference = Math.abs(headsCounter - tailsCounter), // the difference between the two
        error = (difference / bestOf) * 100; // the error percentage

    document.getElementById("results").innerHTML += "<tr><td>" + testCount + "</td><td>" + headsCounter + "</td><td>" + tailsCounter + "</td><td>" + difference + "</td><td>" + error + "%</td></tr>"; // add result to table
    totalDifference += difference; // increments the difference counter
    worstDifference = difference > worstDifference ? difference : worstDifference; // updates worstDifference
    if(--testRepeat == 0) 
        var averageDifference = totalDifference / testCount, // finds average difference
            averageError = (averageDifference / bestOf) * 100; // finds the average error margin
        document.getElementById("summary").innerHTML = "<p>Average difference: " + averageDifference + "</p><p>Average margin of error: " + averageError + "%</p><p>Worst Case: " + worstDifference + "</p>"; // write summary to page
        clearInterval(testInterval); // if the test has been repeated enough times, clear the interval

The code is commented, so here are just the highlights:

Firstly we set up some variables including the number of times we want to toss the coin (bestOf) and the number of times we want to repeat the test (testRepeat).

Spoiler alert: we’re going to get into some fairly high loops, so to avoid breaking anyone’s browser we’re running the test on an interval every 100ms.

Inside the performCoinToss function we loop the required number of times, each iteration of the loop we use JavaScript’s random function to generate either a 1 or a 0, which in turn increments either the headsCounter, or the tailsCounter.

Next we write the result from that test to the table.

Lastly, if the test has been repeated the number of times we’d like, we find the averages, and the worst case, write them to the summary, and clear the interval.

Here’s the result. As you can see the average difference is, well it will be different for you, but as I’m writing this the average difference is 2.8333333333333335, the average error is therefore 23.611111111111114%.

Over 23% error does not inspire confidence, especially as we know that the difference should be 0%. What’s worse is that my worst case result is 8, that’s 10–2 in favor of heads.


Using some realistic numbers

Okay, so that test wasn’t fair. A real A/B test would never claim to find a conclusive result from just 12 users.

A/B testing uses something called “statistical significance” meaning that the test has to run enough times in order to achieve an actionable result.

So, let’s double the bestOf variable and see how far we need to go to reach a margin of error, of less than 1% — the equivalent of 99% confidence.

At a bestOf of 24 (at the time of writing) the average difference is 3.1666666666666665, which is 13.194444444444445%. A step in the right direction! Try it for yourself (your results will vary).

Let’s double it again. This time, my average difference 6.666666666666667, with a margin for error of 13.88888888888889%. Worse still, the worst case is 16, that’s an error of 33.33333333333333%! You can try that one for yourself too.

Actually, no prizes for guessing that we can keep going: best of 96, best of 192, best of 384, best of 768, best of 1536, best of 3072, best of 6144, best of 12288, best of 24576, best of 49152, best of 98304.

Finally, at a best of 98304, the worst case scenario drops below 1%. In other words we can be 99% confident that the test is accurate.

So, in an A/A test, the result of which we knew in advance, it took a sample size of 98,304 to reach an acceptable margin of error.


The $3,000,000,000 button

Whenever A/B testing is discussed, someone recalls a friend of a friend, who A/B tested a single button on his/her site, and promptly made some improbable profit (the actual dollar value of the button increases each time I hear the story).

In those tales, the buttons are usually tested for micro-copy, “Download my ebook” vs. “Download my free ebook”. It shouldn’t be a surprise that the latter wins. It’s an improvement that any good copywriter would make. A more appropriate A/B test would be “Download my ebook” vs. “Download the ebook” (my money’s on the latter).

If you find yourself with a result heavily weighted towards one of the options, it suggests that something is very wrong with one of your variations. More often, a good result will be an improvement of less than 5%, which presents a problem if you’re testing with around 1000 users (the margin for error of which is around 5%).

The more useful a test is, the tighter the margin of victory for one variation or the other. However, the tighter the margin of victory, the greater the sample size needed to give you an acceptably small margin of error.


Lies, damned lies, and A/B testing

Mark Twain, possibly quoting Disraeli, once used the phrase: lies, damned lies, and statistics. By which he meant that something proved by statistics, is not necessarily true. Statistics can be used to prove anything you want them to.

A/B testing will provide you with a result, but it’s a result that will say more about you and about what you expected to find, than about your customers

The most dangerous thing about A/B testing is that it can prove anything you want it to; it can produce false positives, and it enables us to discern patterns that aren’t properly supported.

Furthermore an A/B test may indicate that a green button outperforms a red button, but what about a blue button? Even successful A/B testing only allows us to validate our design decisions within the context of the test itself.

For an A/B test to function as intended, we need two opposing conditions to be true:

  1. there should be minimal variation between options, so the test is not weighted by our preference;
  2. the sample size should be sufficient that the margin of error is less than the strength of the result.

Unfortunately most sites don’t have a sample size large enough to reach a sufficiently small margin of error. And because we can’t increase our sample size (we would if we could), our only choice is to increase the variation of the options in order to produce a clear result, skewing the test by our preferences.

A/B testing will provide you with a result, but it’s a result that will say more about you and about what you expected to find, than about your customers. When it comes to making design decisions on any site other than those with very high volumes of traffic, we might as well toss a coin, as A/B test.


Featured image, coin toss image via Shutterstock.