Lies, damned lies, and A/B testing

Ben Moss.
September 07, 2015
Lies, damned lies, and A/B testing.

A/B testing is frequently billed as a scientific way to validate design decisions. Occasionally referred to as split testing, A/B testing is a simple process that on the surface appears to promise concrete results:

Create two variations on a design element, randomly swap them on your site and record how your users react, compare the results, implement whichever variation performed best. It makes sense.

The classic example is: a red button vs. a green button, which will be tapped more? However, the more interesting question is: a green button vs. the same green button, which will be tapped more?

What happens when we A/B test two identical variations? An A/A test, if you will.

Green button vs. green button

In order to test the validity of any A/B test, we need a test that has a ‘correct’ answer. We need a correct answer because we want to know, all things being equal, how likely is it that the A/B test will produce the result it should, and for that we need to know what result to expect.

[pullquote]If we A/B test two identical buttons, the result should be a dead heat[/pullquote]

So, let’s assume that we’re testing a green button vs. the same green button, and that the button is so enticing that 100% of users will tap it.

(The percentage doesn’t actually matter, it could be 14.872%. What matters is that because the buttons are identical, the tap rate should also be identical.)

If we A/B test two identical buttons, the result should be a dead heat.

The coin toss test

Toss a coin. Which side will come up, heads or tails? We know there are two sides, both identical, so it’s a 50-50 chance.

If we toss our coin twice, we know that there are three possible outcomes: 2 heads, 2 tails, or 1 head and 1 tails. And so on…

Let’s say the coin toss is our A/A test; the odds of the heads side coming up are identical to the odds of the tails side coming up, just as the odds of either of our green buttons being tapped are equal.

So let’s write a quick script in the browser (because most A/B testing happens in the browser) to simulate users tapping one button or the other, depending on which one they’re presented with.

Remember: we’re testing two identical variations of a button, and the way we know they’re identical is that we’re treating the likelihood of them being tapped as identical. All we’re looking for is a consistent (and therefore correct) result.

Firstly, we need an HTML table to record our results in, the table will look like this:

<table id="results">
 <th>Margin of Error</th>

<div id="summary"></div>

In the first column we’ll record the number of the test (all good A/B tests, are repeated to verify results, so we’ll repeat the test a few times). Next we’ll record the number of Heads results, then the number of Tails results. The column after that will be the difference between the two results (which should be zero). Then we’ll record the margin of error (which again, should be 0%). Beneath the table we’ll print out a summary, the average of all the results, and the worst case result.

Here’s the script:

var bestOf = 12, // the number of times we want to run the test
 testRepeat = 12, // the number of times we’d like to repeat the test
 testCount = 0, // the number of the current test
 testInterval = setInterval(performCoinToss, 100), // call the coin toss function
 totalDifference = 0, // used for calculating the average difference
 worstDifference = 0; // the worst case

function performCoinToss()
 testCount++; // increment the current test

 var testCounter = bestOf, // the current iteration of the test
 headsCounter = 0, // the total number of times the script came up with "heads"
 tailsCounter = 0; // the total number of times the script came up with "tails"

 while(testCounter--) // loop 'testCounter' times
 Math.round(Math.random()) ? headsCounter++ : tailsCounter++; // finds 0 or 1 randomly, if 1 increments headsCounter, otherwise increments tailsCounter
 var difference = Math.abs(headsCounter - tailsCounter), // the difference between the two
 error = (difference / bestOf) * 100; // the error percentage

 document.getElementById("results").innerHTML += "<tr><td>" + testCount + "</td><td>" + headsCounter + "</td><td>" + tailsCounter + "</td><td>" + difference + "</td><td>" + error + "%</td></tr>"; // add result to table
 totalDifference += difference; // increments the difference counter
 worstDifference = difference > worstDifference ? difference : worstDifference; // updates worstDifference
 if(--testRepeat == 0) 
 var averageDifference = totalDifference / testCount, // finds average difference
 averageError = (averageDifference / bestOf) * 100; // finds the average error margin
 document.getElementById("summary").innerHTML = "<p>Average difference: " + averageDifference + "</p><p>Average margin of error: " + averageError + "%</p><p>Worst Case: " + worstDifference + "</p>"; // write summary to page
 clearInterval(testInterval); // if the test has been repeated enough times, clear the interval

The code is commented, so here are just the highlights:

Firstly we set up some variables including the number of times we want to toss the coin (bestOf) and the number of times we want to repeat the test (testRepeat).

Spoiler alert: we’re going to get into some fairly high loops, so to avoid breaking anyone’s browser we’re running the test on an interval every 100ms.

Inside the performCoinToss function we loop the required number of times, each iteration of the loop we use JavaScript’s random function to generate either a 1 or a 0, which in turn increments either the headsCounter, or the tailsCounter.

Next we write the result from that test to the table.

Lastly, if the test has been repeated the number of times we’d like, we find the averages, and the worst case, write them to the summary, and clear the interval.

As you can see the average difference is, well it will be different for you, but as I’m writing this the average difference is 2.8333333333333335, the average error is therefore 23.611111111111114%.

Over 23% error does not inspire confidence, especially as we know that the difference should be 0%. What’s worse is that my worst case result is 8, that’s 10–2 in favor of heads.

Using some realistic numbers

Okay, so that test wasn’t fair. A real A/B test would never claim to find a conclusive result from just 12 users.

A/B testing uses something called “statistical significance” meaning that the test has to run enough times in order to achieve an actionable result.

So, let’s double the bestOf variable and see how far we need to go to reach a margin of error, of less than 1% — the equivalent of 99% confidence.

At a bestOf of 24 (at the time of writing) the average difference is 3.1666666666666665, which is 13.194444444444445%. A step in the right direction!

Let’s double it again. This time, my average difference 6.666666666666667, with a margin for error of 13.88888888888889%. Worse still, the worst case is 16, that’s an error of 33.33333333333333%!

Finally, at a best of 98304, the worst case scenario drops below 1%. In other words we can be 99% confident that the test is accurate.

So, in an A/A test, the result of which we knew in advance, it took a sample size of 98,304 to reach an acceptable margin of error.

The $3,000,000,000 button

Whenever A/B testing is discussed, someone recalls a friend of a friend, who A/B tested a single button on his/her site, and promptly made some improbable profit (the actual dollar value of the button increases each time I hear the story).

In those tales, the buttons are usually tested for micro-copy, “Download my ebook” vs. “Download my free ebook”. It shouldn’t be a surprise that the latter wins. It’s an improvement that any good copywriter would make. A more appropriate A/B test would be “Download my ebook” vs. “Download the ebook” (my money’s on the latter).

If you find yourself with a result heavily weighted towards one of the options, it suggests that something is very wrong with one of your variations. More often, a good result will be an improvement of less than 5%, which presents a problem if you’re testing with around 1000 users (the margin for error of which is around 5%).

The more useful a test is, the tighter the margin of victory for one variation or the other. However, the tighter the margin of victory, the greater the sample size needed to give you an acceptably small margin of error.

Lies, damned lies, and A/B testing

Mark Twain, possibly quoting Disraeli, once used the phrase: lies, damned lies, and statistics. By which he meant that something proved by statistics, is not necessarily true. Statistics can be used to prove anything you want them to.

[pullquote]A/B testing will provide you with a result, but it’s a result that will say more about you and about what you expected to find, than about your customers[/pullquote]

The most dangerous thing about A/B testing is that it can prove anything you want it to; it can produce false positives, and it enables us to discern patterns that aren’t properly supported.

Furthermore an A/B test may indicate that a green button outperforms a red button, but what about a blue button? Even successful A/B testing only allows us to validate our design decisions within the context of the test itself.

For an A/B test to function as intended, we need two opposing conditions to be true:

  1. there should be minimal variation between options, so the test is not weighted by our preference;
  2. the sample size should be sufficient that the margin of error is less than the strength of the result.

Unfortunately most sites don’t have a sample size large enough to reach a sufficiently small margin of error. And because we can’t increase our sample size (we would if we could), our only choice is to increase the variation of the options in order to produce a clear result, skewing the test by our preferences.

A/B testing will provide you with a result, but it’s a result that will say more about you and about what you expected to find, than about your customers. When it comes to making design decisions on any site other than those with very high volumes of traffic, we might as well toss a coin, as A/B test.

Featured image, coin toss image via Shutterstock.

Ben Moss

Ben Moss is Senior Editor at WebdesignerDepot. He’s designed and coded work for award-winning startups, and global names including IBM, UBS, and the FBI. One of these days he’ll run a sub-4hr marathon. Say hi on Twitter.

Read Next

24 Best Creative Portfolio Websites in 2023

For anyone working in a digital creative field, whether design, illustration, animation, video, or a combination of…

15 Best New Fonts, September 2023

Nothing upgrades your designs like selecting the right font. It’s all too easy to fall into the trap of using the same…

Weekly Design News #1

Every Sunday we’re rounding up the best of the previous week’s stories from, and in this issue #1,…

The 20 Most Controversial Logos of All Time (Ranked)

When you hire graphic designers to create your company's logo, what do you expect? Professional designs, culturally…

LimeWire AI Studio Generative Art App

If you’re looking for the most exciting way to launch a career in AI-generated art, then you’re in the right place.

20 Best New Websites, September 2023

Are you in need of design inspiration? Are you looking for the best websites designed in 2023 to pull ideas,…

The Dangers of Deceptive Design Patterns (And How to Avoid Them)

As web designers, our role in crafting user-friendly digital landscapes is critical. We are tasked with creating user…

10 Best Ecommerce WordPress Themes in 2023 [September update]

You plan to set up shop with an online store. You know there’ll be competition. And to compete with or beat that…

5 Marketing Tools Every Designer Needs

Yes, designers do need marketing tools. From freelance graphic designers who need to land more work to designers who…

Exciting New Tools For Designers, September 2023

At the end of another summer, we are all getting ready to knuckle down for some serious work in the fall. But we want…

Elon Musk calls LinkedIn ‘Cringe’—Announces Competitor

Elon Musk recently announced his intentions to create a direct competitor to LinkedIn. Musk’s grand plan is to make his…

Everything You Need to Know to Embrace the Y2K Design Trend

The turn of the millennium was a major cultural shift, and the Y2K aesthetic emerged as a visualization of what the…