Building Simple A/B Testing on top of Urban Airship’s Customer-Facing APIs

Urban Airship's Core Data team is really excited about the introduction of per-push reporting as it affords a much richer understanding of end-user response than the solely-time-sliced views we've had in the past. I am particularly excited about the API for per-push reporting, given the capability for data-savvy customers to build useful tools on top of it.

API v3 allows customers to retrieve information such as number of sends and number of influenced opens associated with each message individually. This kind of feedback is crucial for understanding the relative successes of different messaging strategies.

To demonstrate what customers can do with this information, I put together a quick implementation of A/B testing using three of our API endpoints (in addition to reading per-push stats, I had to get a list of devices to push to and send some pushes). This only took me a couple of days of part-time hacking, but it's a sketch that an interested developer could build on.

This version of the A/B testing idea starts with a set of messages, a list of devices to push to, and a timeframe to complete the sends within. At regular intervals within the timeframe, it sends batches of pushes selected from among the provided messages with selection based on the chances that it is the message most likely to influence a user to open the app. In this implementation, only the history of sends and influenced opens is considered in making this choice, but you can imagine extending the technique to take additional things into account like the times of past sends in the recipients' time zone and characteristics of individual users that may make them more receptive to particular messages.

Here is the main execution loop of the program:

There are two Java interfaces I wanted to mention being used here: MessageChooser and StatsChecker.

StatsChecker makes calls to our new per-push reporting API, which enables our customers to do things they couldn't do before. With my apologies if this isn't the most idiomatic usage of the HTTP and JSON libraries, here's the implementation (the super class just manages authentication and holds an HTTP client object for reuse):

MessageChooser is perhaps more interesting. This is where you need to implement the flavor of A/B testing that suits you. I chose to implement this version of the Bayesian bandit. This is perhaps the simplest possible form of Thompson sampling. In terms of how much you might 'regret' showing sub-optimal messages to users over many trials, Thompson sampling has nice theoretical and empirical performance characteristics. It also has the nice property that a statistically aberrant early sample won't ruin your long-term behavior, which you might see if you used the naive "send some pushes, test for a significantly different response, always send the 'winning' push thereafter" approach. And it extends naturally to the more complicated, individualized statistical models mentioned above.

However, it is left as an exercise for the reader to replace the Beta prior/Bernoulli likelihood model implemented here with a regularized regression model. Or to implement UCB, if that's your thing.

Anyway, this takes a few more than the 20 lines of code in Ted Dunning's post, because we aren't just ticking up a counter. We have to keep a list of push IDs associated with each of the messages we're experimenting on, collate these after updating the response rates from the reporting API and proceed from there. Nonetheless, it's not bad:

You can download the rest of the code on github, and play with the DummyRunner class to see what happens if you fake the sends and users' app-opening behavior. But, just to give you a flavor, I ran an experiment with our internal development app.

I tested two pushes, hoping to see that more people open one over the other, with the winner sent out more over time. Let's see if that's what happened by first examining the cumulative sends over time.

That data looks pretty good. A got sent about 124 times total, compared to 30 for B. By the end we were definitely learning that users were more likely to respond to A than to B. It's not as stark of a difference as you could imagine, especially if you had more users involved in your test.

Since the basis of the algorithm is a model we're building of how likely a user is to open each push, let's look at how we modeled that probability over time. Here the X axis is time, and the Y axis is the estimated probability of a user opening each message at that time.

Early on it looked like A was going to have a much higher rate of opens, but then it tailed off and was even briefly passed by B. How does this happen? The final piece of this puzzle is in the app open data:

During the second hour of the experiment we got a big spike in opens from users who received A. But in the few hours after that, we only saw three A-related opens to one B-related open. Over that same period, we were also sending out A somewhat more often, so this disparity in opens isn't enough to give A a higher probability of generating opens. Clearly, my "compelling" A message was not that compelling. Therefore, we went a few hours where we were sending about the same number of pushes with each message. Then toward the end of the test more people started to respond to message A, and we resumed sending it much more often. Unfortunately, we ran out of users soon after that. If we had more users we would have seen A continue to receive the bulk of the sends, especially as we became more confident in its superior chance of influencing users to open the app.

This exercise is a simple demonstration an A/B testing technique that utilized the flexibility and utility of our per-push reporting API.

Customer success stories are a huge source of motivation for us at Urban Airship. If anyone tries testing deliveries with their users, we would love to hear how it turns out. Drop us a line, or add a reader comment.