Test Strategy for Location Data Filtering / Presentation Algorithms

One of the most fundamental elements of the Mobile App is the logic that takes a set of data points from the user’s location history + a set of points of concern, and turns it into something to present to the user like this:

Note: I talk about filtering rather than matcihing for reasons descibed in User Stories: What info about diagnosed cases & their trails do users really want from an App?

However there is more than just filtering going on here, as our current implementation does not actually share the point of concern itself, but rather some data derived from a set of points of concern that have been identified as relevant.

As I understand it the current algorithm for filtering is any match within (+/-20m, +/-4 hours) - I have many doubts about this per GitHub #516.

I have not yet seen our current presentation algorithm, but hopefully we will get that written up & described somewhere soon.

This function is the absolute heart of this App, and there is a lot of complexity.

Roughly we can model it like this:

There are 5 key processes / algorithms in play here:

The process by which we map the end user’s behaviour into a set of data points. Currently there is a data point generated at a variable cadence, approx 4-6 mins between data points.
The same process applied to an infected person. If they use the Safe Paths app for location tracking, this is the same algorithm as 1. If they use 3rd party location data such as Google Takeout, it may be higher resolytion (approx every 2 mins, I believe). This data could potentially be normalized to a cadence closer to Safe Paths data under 3.
The processing performed by Safe Places on an infected user’s data export. Currently this is just Redaction. Exact redactio policies are yet to be clearly defined - they include redaction of home address, and associates hme addresses for privacy reasons. Private car journeys may also be redacted, primarily for reasons of efficacy (on the understanding that people inside cars are unlikely to receive r transmit COVID-19).
Is the algorithm by which we filter the public data about points of concern, into a data set to expose to the user. Currently we do this simply by filtering for +/-20m, +-/4 hours, within the points in the user’s data trail. We treat the user as if he teleports from one point to the next, every 4-6 minutes, which can cause detection problems when two users' GPS logging cadences are in anti-phase - see #516. We do not currently incorporate any other data into this algorithm, though there may be reasons to do so - see e.g. User Stories: What info about diagnosed cases & their trails do users really want from an App?
Is the algorithm by which we determine what to display to the user. I’ve not seen details of this but is appears that we condense the set of filtered data points into a single number of minutes exposure in that day. It would be possible to provide more information, such as specific place and time (though there may be privacy risks for infected persons in doing so - which may be what is behind the current design). Potentially the user’s own location data, and other known information such as their home address could be factored into the presentation, though that is not the case today.

Most of the above algorithms could, in principle at least, be subject to variations based on user preferences, and possibly based on local Health Authority rpreferences, reflecting local conditions, known local transmission vectors etc.
The final advice presented to the user is almost certainly in need to customization to the local situation.

None of the above elements works in isolation - as can be seen from e.g. the example in GtHub #516, the nature of 1. and 2., and specifically the fact that the independence of these two processes means the data sets can be in anti-phase, has implications for what might be an effective algorithm in 3.

How to Test It?

So now we have a model for what’s going on, this can help us to think about how to test this.

The first point is that we can avoid a lot of risks of errors by testing this system end-to-end. It might be attractive to just test 3. with test data consisting of synthetic sets of GPS points. But then it would be very easy to construct test data sets that don’t reflect the full variety of possibilities, for example it would be easy to just use test data on fixed 5 minute boundaries, and miss the issues that may arise from variable cadences and mis-matched offsets in time series.

However, if you try to think about how to test this end-to-end using the complete App, you run into a different series of issues. It is difficult to feed false inputs into a phone’s GPS receiver. It is also non-trivial to feed points of concern into the app: it requires pushing the data up to an HTTP server, from where the app can retrieve it. These things are not impossible - se e.g. @Jonathon Wright's work on feeding fake GPS co-ordinates into a phone. But they do add a great deal of complexity, and wil slow down the testig (thereby reducing the volume of tests we can execute.

The best solution appears to be to build a framework that incorporates all the elements above - hence end-to-end with regard to the above model - but does so out of the context of teh mobile app.

Here’s a straw man:

We develop a simple syntax for describing a set of movements in space of an individual. This could be e.g. a set of data points & timestamps, with an express assumption that the individual moves linearly from one point to the next (if you want non-linear movement, either in space or time, add more data points. Call this a “movement set” (MS)
We develop a tool that maps an MS into a number of possible sets of GPS data points. We can implement our own model of this mapping - for example pick data points at randomized intervals, evenly distributed between 4 & 6 minutes. This is close enough to what we’ll see in reality (though note caveat that we should also consider that Google data is finer grained at ~2 mins)
Using this tool we can generate a number of data sets that represent possible GPS Representations (GRs) of those movements that the app might be exposed to. We might generate 100 such representations, to get good coverage of the many possible ways that set of movements could end up represented.
Now, we can define test cases in terms of MSs, one for the user, and one for the infected person we are analyzing their interactions with. Note: I’m not aware of any reason why we’d need to consider more than one infected person at a time - in our current model the interations between each (user, infected person pair) can be analyzed independently.
For a given pair of MSs, we can define a set of expected outcomes, or “user presentations” (UPs). These are data representations of what the user may see as a result of the two MSs. In our current output model, there seems to be just a single parameter, which can be 0 (no match), 5, 10, 15 etc. up to 60 minutes of detected contact. .
In a perfect world, each pair of MSs would map to one & only one UP. But this is not a perfect world (if you hadn't noticed!), so lets accept that the same 2 MSs can lead to multiple UPs. And let’s group UPs into outcomes that are “OK”, “tolerable” and “unacceptable” for a given pair of MSs.
So I can define a Test Case in terms of 2 x MSs, one for the user & one for the infected person, and a set of possible outcomes. Something like this:

MS1	MS2	OK UPs	Tolerable UPs	Unacceptable UPs

MS1

MS2

OK UPs

Tolerable UPs

Unacceptable UPs

Starts at 0,0.

→ 0,1 (10 mins)

→ 1,1 (10 mins)

→ 1,2 (15 mins)

Starts at (1,0)

→ (1,2) (30 mins)

→ (2,2) (5 mins)

15, 20, 25

10, 5, 30, 35

0, 40+

Then I can run my model (Monte-Carlo simulation) over say 100 or even 1000 MS->GR mappings, and determine what percentage of those runs give an OK UP, what % give a Tolerable UP, and what % give an unacceptable UP.

We might set pass criteria that We should be “OK” > 75% of the time, and “Unacceptbable” < 5% of the time. We could tighten those targets as we step up our quality goals & tune the algorithm.

(Note: I am not claiming the above is an especially interesting test case, or that the UP categorization is correct - just intended to be an illustration).

Getting Closer to the Real World

The above model could work, but it;s still a bit abstract and hard to generate test cases for.

We can do better if we add an additional layer to the model, which I will call “Movement Set Descriptions”.

Here’s an example of a test case in Movement Set Descriptions.

MSD1	MSD2	OK UPs	Tolerable UPs	Unacceptable UPs

MSD1	MSD2	OK UPs	Tolerable UPs	Unacceptable UPs
Gets on bus at 10:00. Bus drives along the A10 at 20kph Stops every 3-4 mins (random) for 30 to 90 seconds (random) Rides on bus until the 5th stop. Gets off and walks south	Gets on same bus at the 2nd stop. Stays on the bus to the end of the test case.	15, 20, 25, 30, 35	5, 10, 40	0, 40+

These Movement Set Descriptions map to Movement Sets, but again in a random manner. The description of the bus journey above could lead to many possible Movement Sets, which in turn could lear to many possible GPS Representattions, which would potentially lead to many different User Presentations.

Ideally we want to tune our algorithms so that a consistent set of Movement Set Descriptions maps fairly reliably to an acceptable set of User Presentations.

It should be possible ot build a regression suite consisting of a set of Movement Set Descriptions (the exact syntax for these still TBD), and a corresponding set of User Presentations that are OK, Tolerable and Unacceptable. We can run this full suite of tests, and determine that our algorithms are performing within tolerances.

Runs are random, so there is some risk of unpredictable test results, but if we can crank the numbers in the Monte-Carlo similutions high enough, we should get good enough reliability.

All the above is a more abstract representation of a test idea I posted a few days ago (but didn’t yet manage to execute on). If you are stuggling to get the general ideas, that may be a useful read too: User Story Location Mapping: Bus Journey #1

Implementation

I’m a big fan of incremental implementation. I’m also aware that if we were to rework the UX on User Presentation that would cause us to need to substantially rework any code validating UXs.

A proposed starting point is to build a test framwork that isolates the relevant App code for algorithms 4 & 5, and allows us to build test data in terms of “Movement Sets”, where we can manually validate the User Presentation output.

As a 2nd phase, we could then either move towards developing the “Movemebt Set Description” language (which I think we should do as a priority over building a large library of tests described in terms of Movement Sets), or if we have confidence that the User Presentation UX is not going to change much, we could work on automated validation of User Presentatiosn output.

Ultimately we’ll want to do both these steps, but we’ll want to react to the overall project context to decide which to do next.

Building the initial infrastrutcture that allows us to run Monte Carlo simulations of the App algorithm code using Movement Sets seems like an obvious place to start, which will deliver immediate value.

Before we do start, we should apply a bit of pressure to the question of whether GPS logging cadences are definitely going to remain variable & unsynchronized. If cadences are unsynchronised, then there are good Prvacy reasons for cadence variability i(it makes it much harder to de-pool data). But if we were able to syncrhnize all cadences (so e.g. everyone always outputs data at 12:00, 12:05 etc. exactly, that would greatly simplify our matching / filtering algorithms, and reduce the complexity of some of this testing, and change the implementation of the Movement Set → GPS Representation mapping somewhat.

Final close: these are very early-stage ideas. Comments, feeback or any other form of engagement is always welcome - just contact @Diarmid Mackenzie - usually available on slack.

Testing

Test Strategy for Location Data Filtering / Presentation Algorithms

Analytics

How to Test It?

Getting Closer to the Real World

Implementation