Accuracy of Info (False Positives / False Negatives)
Â
With reference to these two rows in oour undiagnosed users value table in the Quality Map
Accuracy (False Positives) | I don’t want to be warned about encounters where there was no plausible risk of infection |
Accuracy (False Negatives) | If I have an encounter with someone that risks infection, I want to be notified |
This article brainstorms some possible scenarios that might lead to false positives & false negatives in matching user’s location data with HA data for infected persons.
Â
CAVEAT: I have not yet found any documentation for the algorithm that we use to compare data sets for matches. Learning about this algorithm would be useful.
UPDTE: Some notes on Algorithm here: https://github.com/tripleblindmarket/covid-safe-paths/issues/516#issuecomment-615346720
Currently we simply compare point by point, and consider them a match if:
They are within a certain distance of each other (20m)
And within a certain time of each other (+/- 4 hours, but probably ought to be -5 minutes to +4 hours)
Â
Assumptions about nature of DATA: user data is recorded every ~5 minutes (though cadence may not be exact).
HA infection data may be of this format. Or may be of some other format if the user did not have historic location data, and HA instead generates synthetic data to match their reported movement history. In this latter case, HA infection data can still probably occur on regular 5 minute intervals. For synthetic data, we might assume a perfect 5 minute cadence..
Â
False Positives
There are a range of scenarios that we can imagine triggering false positives. We’d be looking for ideas as to what we might do to reduce the impact of these false positives.
Included also, for comparison, some scenarios which might look ~identical from a data pov, but should trigger as genuine contacts.
(FP) Driving along the freeway. Rush hour traffic queued up on the other direction. How many C+ people do you pass?
(FP) Driving along the freeway for an hour, 50m behind an infected driver.
(Counterpoint) Driving along the freeway for an hour, with an infected passenger
(Counterpoint) On a greyhound bus with an infected passnger
(FP) 20 storey apartment block. All residents will be within 70m of each other, from a GPS pov. (GPS can measure altitude too, though we don’t use that today; but accuracy is ~3x worse than x/y position, so to approx +/-30m. 60m = ~15 storeys.
Left phone in locker at gym. Infected person comes into changing room, leaves their phone in next locker.
Â
False Negatives
Misses due to data point cadences in anti-phase - see #516 in GitHub
2 passengers on a flight. Phones switches to airplane mode, so no locations logged. We fail to identify that they spent 2 hours together (perhaps we catch 5-10 mins before they turn phones off & after landing).
When else do you turn your phone off?
When does a phone lose GPS? Underground trains? (Bluetooth to the rescue?) Cinema/Theater? Everyone turns phones off . AIrplanes. Near medical equipment?
At the gym / swimming pool - leave your phone in a locker?
When might a 5 minute point in time fail to represent users actual movements / risk?
Runner running laps in a park?
Short trips done in < 5 mins.
Â
How to Test
One key issue is how to get the data into the app.
For the infection data, that's easy. We just need to serve it up on an HTTP server somewhee. The only issue is generating realistic data, in the correct geo-location.
Correct geo-location is solved, but making the data realistic for that geo-location is something that needs a lot more work. Since Safe Places has a mapping tool, and spits out JSON files for HA servers, it might be we can leverage this somehow for generating realistic infection data.
For the user's data, if we weren't on lockdown, I'd just get users out on the streets with their phones. Â Through Applause (crowdsourcers) we may be able to get access to people in locations that aren't locked down (they have 500k testers globally).
Alternatively. we have on & off had a capability to import position history from Google (but disabled in v0.9.4 for usability reasons). SInce I suspect t's hard to insert ourselves directly between the GPS & the phone, I think this is the best way to feed user's positon into the app. (and we'll need to get the capability re-instated) So synthetic data just needs to be generated in the Google format.
Again the real challenge is getting realistic synthetic data in a given area.
And figuring out all the "interesting" use cases (as per discussion above).
UPDATE: Adam Leon Smith has made some big steps forwards in this area - using real data from Google as Seeds and generating further synthetic data using ML.
Â
We also need to do some work clarifying redaction procedures - e.g. if standard procedure is to redact all Car journeys (which I guess might be a soluton for some of the false positive issues above) then we should be mostly building data sets that match that pattern.
(Also Safe Places test cases needed to figure out how that is actually done & if it works)Â
Â
Ideas for fixes
Regular polling to user to record notes for their movemnets? This will help when they review the data for any matches.
In particular can be prompted by changes in movement - e.g. detect start/end of car journey, label as one period, and ask user to note if they were in a car, bus, train etc.
Add velocity info to location data? OK for private data. Not sure if this is OK for infection data (further privacy concerns).
Redact all Car journeys? Safe Places users would need support to help identify car journeys? Again based on velocity data? Or gaps between 5 min location pings?
Â