I have developed some analysis based on the approach described here::
Test Strategy for Location Data Filtering / Presentation Algorithms
Headlines
As we already suspected, current GPS accuracy is not good enough, and the reliability issues are also a major problem.
When 2 people meet for 20 minutes, we will detect an exposure only 43% of the time.
Similarly when 2 people travel on a bus together for 20 minutes, we will detect an exposure only 37% of the time.
However, we are very prone to false positives: 2 people who spend 24 hours 150m apart will trigger an exposure on 58% of days, due to GPS inaccuracy.
Therefore we urgently need to fix both issue for the app’s exposure detection to be remotely viable.
The good news is that if we do, then we can get excellent exposure detection in all the 3 basic scenarios analyzed. There will be other scenarios that will be more problematic, but we have established a good basic level of utility of the current algorithm, and also learned that the odds of failing to detect exposure due to antiphase logging timers combined with movement (GitHub #516) may be lower than we had feared.
Technical Details
The analysis was done using Python, source code here:https://github.com/diarmidmackenzie/safepathstools/tree/master/algorithm-test
Due to technical challenges interfacing between Python & Javascript, this does not test the actual product code (https://github.com/tripleblindmarket/covid-safe-paths/blob/b9e862a51da39c1b34917310d8adf3e42c73d699/app/encryption/intersection.py ) - instead it uses a Python implementation of the same algorithm (which right now has some short cuts - e.g. some hard-coded values for the distance of a degree of longitude or latitude.
These imperfections can be worked on - or we could even tackle the Python-Javascript integration, or re-implement this design in Javascript. But for now, we have enough to give us a broad overall picture.
Scenarios
So far, I have implemented 3 test scenarios. These are quick to develop, and we could easily extend to more of these.
coffee-meeting,.json - Two people meet for coffee, and stay in the same place for 20 minutes
150m-apart.json - Two people spend 24 hours stationary, 150m apart. This exists for false-positive detection
bus-trip.json - Two people travel on a bus which drives at 20kph between bus stops 1km apart, stopping for a minute at each bus stop.
Noise Data
We use data sets to represent noise on GPS locations recorded, and variations in the frequency at which GPS position is recorded. These are based on real-world data, from Android devices, as recorded here: 25 April 2020 - Real world GPS #2
This includes large swings in GPS signal (50m+) and highly variable gaps between GPS logs, including rare gaps of multiple hours.
We have also performed tests with zero noise - to explore how good the algorithms could be if we were to fix these issues with GPS logs & timing.
Even with fixed 5 minute GPS logs, the model still allows for these to be out of sync with ech other.
Test Method
Test cases are defined in terms of “movement sets” for two parties: A & B. A test also defines a “minimum” and “maximum” number of exposure detections that should occur with this set of movements.
Data on GPS noise & timing variations is used to map these movement sets, to a number of different possible sets of GPS logs, which might be geerated by a user following those movements.
Typically we generate 1000 pairs of sets of GPS logs, and apply the intersection algorithm to each of these 1000 pairs. For the 24 hour test, we run a small number of samples, because our very simplistic intersection detection algorithm scales n-squared with time.
We then use this to determine how likely we are to fall within the “minimum” and “maximum” number of exposures, how likely we are to have too few (false negative), and how likely we are to have too many (false positive.
Due to the random elements in the scipts, results do vary from one run to the next, but usually only only 1-2%.
Results - with current GPS accuracy & Reliability
With current GPS accuracy & reliability, we get the following results.
Coffee Meeting | 150m apart | Bus Trip | |
---|---|---|---|
Minimum & maximum expected exposures | 2, 4 | 0 | 2, 4 |
OK (in range) | 40% | 42% | 20% |
No exposure (false negative) | 57% | 0% | 63% |
Too few exposures (false negative) [includes no exposures] | 60% | 0% | 80% |
Too many exposures (false positive) | 0% | 58% | 0% |
With reliability (timing) fixed, but accuracy unchanged, we see the following - major improvements on the Coffee Meeting and the Bus Trip, but now many more false positives on the “150m apart” - having GPS logging sometime lock up completely actually helps us avoid false positives!
Coffee Meeting | 150m apart | Bus Trip | |
---|---|---|---|
Minimum & maximum expected exposures | 2, 4 | 0 | 2, 4 |
OK (in range) | 100% | 6% | 81% |
No exposure (false negative) | 0% | 0% | 2% |
Too few exposures (false negative) [includes no exposures] | 0% | 0% | 19% |
Too many exposures (false positive) | 0% | 94% | 0% |
With accuracy fixed, but reliability unchanged, we see the following. Dramatic improvements on false positives at distance. Slight improvements o the bus trip (though not as good as we got from fixing accuracy), and almost no impact on the Coffee Meeting scenario.
Coffee Meeting | 150m apart | Bus Trip | |
---|---|---|---|
Minimum & maximum expected exposures | 2, 4 | 0 | 2, 4 |
OK (in range) | 41% | 100% | 41% |
No exposure (false negative) | 57% | 0% | 57% |
Too few exposures (false negative) [includes no exposures] | 59% | 0% | 59% |
Too many exposures (false positive) | 0% | 0% | 0% |
WIth both accurracy and reliability fixed, we get (unsurprisngly) the best results all round. What is slightly surpring, however is that we get 100% in-range results across the board.
These might not be considered “perfect” in that in a 20 minute bus trip, we might ideally detect 3-4 exposures, whereas our range is 2-4 (and in fact, 55% of the time we only detect 2 exposures).
However these results are better than I expected, given the potential issues that exist when the GPS log timers are in anti-phase (see GitHub #516). The bus stopping for a minute every km is enough for us to detect people in the same place. A shorter bus trip (e.g. 10 mins) would probably still lead to some false negatives, as would a journey on a bus that stops less frequently (e.g. Greyhound) - but there are at least some public transit scenarios where the effects described in #516 are not too serious.
Coffee Meeting | 150m apart | Bus Trip | |
---|---|---|---|
Minimum & maximum expected exposures | 2, 4 | 0 | 2, 4 |
OK (in range) | 100% | 100% | 100% |
No exposure (false negative) | 0% | 0% | 0% |
Too few exposures (false negative) [includes no exposures] | 0% | 0% | 0% |
Too many exposures (false positive) | 0% | 0% | 0% |
What have we learned?
We have confirmed what swe suspected: current GPS accuracy is not good enough, and the reliability issues are also a major problem.
GPS accuracy tends to lead most of all to false positives.
Reliability issues mostly contribute to false negatives.
Just fixing the reliability issues without fixing GPS accuracy would actually make false positives a lot worse. (how many people might there be within 150m in an urban environment - we can expect quite a few!)
We have also learned that fixing both of these will result in excellent results in an initial selection of basic scenarios - and that the effects of antiphase logging timers (GitHub #516) may be less than feared.
What next?
Two key directions for future investigation
We need to actually deliver the changes to GPS accuracy and reliability. Wherever we get to, it won’t be “perfect”, so we should revisit this analysis when we have updated data sets.
We should also extend the set of scenarios that we are modelling to cover a wider range of scenarios that might be problematic - more public transit situations like trains & Greyhound buses, as mentioned above, and more false-positive scenarios.
Test Log
For reference here are the commands run, and the output log for the tests.