What do Health Authorities really want from Exposure Notifications?
Kyle reported that a lot of our HAs weren’t bothered about our Exposure Notification function, and wouldn’t be bothered if we dropped it. He says they see 80% of the value being simply in getting the GPS trails of infected patients.
This concerned me because we have a huge amount of product function dedicated to Exposure Notifications: Safe Places Redaction & Publishing, Health Authority config in Safe Paths, and the Exposure Notification processing itself. Betweeb them these add up to about 50% of the product. Is it really not needed?
So this got me asking: “What should HAs & Contact Tracers want from Exposure Notifications?”.
A useful reference here is Tomas Pueyo’s article on Contact Tracing.
This diagram sums up the problems with traditional contact tracing:
He explains that in South Koean contact tracing they were able to:
use location data from mobile phones, and credit card data to job the patient’s memory over where they had been
use CCTV footage from those locations to identify people to follow up with.
Aside from the obvious privacy concerns, this investigation took considerable elapsed time - too long given that the majority of contagions happen between 3 & 7 days
(and bearing in mind that if the contact tracing starts with symptoms + testing, then we are probably already at day 4 or 5).
(model to the left is from this Oxford paper - note that there is not unanimous agreement that it is correct)
https://science.sciencemag.org/content/368/6491/eabb6936
So what HAs & Contact tracers should want is a way to identify more of the people in the groups of pale grey dots much more quickly.
That’s teh function that Exposure Notifications should play.
This means:
We don’t need to care much about Exposure Notifications for the workplace (easy enough to follow up & figure out who is at risk there), or Bob’s personal contacts (his recollections should be enough).
The key value we can add it helping to identify strangers that Bob does not know, who spent time near him over the previous period, for example in restaurants, shops, buses etc.
And we shouldbe particularly concerned about indoor spaces (including public transit) where Bob spent extended periods of time. See e.g. these collections of information from actual contact tracing
https://www.erinbromage.com/post/the-risks-know-them-avoid-them
So what technology can we provide to help the HA’s Contact Tracers to help to surface these people?
I think this description points to us moving away from talking about a patient’s “redacted trail”. There may be many places they have been which are not particularly sensitive or private, but also don’t actually represent any serious risk of having passed on the virus.
If 90% of transmissions happen indoors, then the chances of a patient having passed on the virus to someone they walked past on the street are pretty minimal. And because GPS data is not fine-grained (loggs every 5 minutes), we’ll only pick out a totally arbitrary subset of such encounters anywway.
Furthermore, although none of these points individually feels like a privacy risk, in aggregate they may help someobody to stitch the patient’s trail together, and de-anonymize it.
Better to stop thinkign about the HA pulishing a “redacted trail”, and instead think of them as publishing a set of “zones of concern” arising from their (GPS-assisted) interview with Bob.
These areas of concern might be things like:
The number 2 bus, from Stop A at 10:04 to Stop C at 10:37
Bob’s Bistrot, from 7:30pm to 9:30pm on Tuesday.
The City Gym, from 6:30am to 7:45am on Monday
Etc.
By publishing just this specific set of “zones of concern”, and dropping the rest of the patient's trail, I think we substantially reduce the privacy risk & re-identification risk to the patient.
Concerivably, we’d be at the point that a short list of such “zones of concern” might not even be considered “personal data” (we’d need a privacy expert's view on this).
Implementation
Assuming the above is correct view of the function we’d like, let’s shift back now to thinking about implementation.
Is the current implementation of sharing a subset of the patient’s GPS data points correct?
It feels clumsy, and poorly targeted - you can make it work for some scenarios (e.g. the Restaurant), but it doesn’t look like the best way to communicate what’s really needed - and it’s pretty problematic for public transit (the patient’s 5 minute pings may not align with the person sat next to them on the bus), and there are problems with other scenarios too, like the Cinema or Theater where phones will be switched off for the majority of the period of concern.
A better implementation might be a set of space-time boxes, with defined levels of criticality.
For example, the restaurant might be a 30m x 30m box, with a 2 hour duration, and a minimum threshold for notification of 30 minutes.
The notification box(es) for the cinema or theater show might be a larger area (covering the whole cinema/theater lobby), but specifically targeting 10 minute windows at the start and end time of the show, with a goal to notify anyone who is spotted in that area, in that time window (even if they only match on a single GPS point). Even more sophisticated: we could also match on the fact that their phone goes offline for the duration of the show.
For the bus, you want a small moving box. This could be described as a very dense series of GPS space/time points, but might be better described as a series of vectors covering the bus route, that the Safe Paths App we could match against. Potentially we could reduce our false negative rate by also collecting speed / bearing data on the Safe Paths App & also making that a part of the match algorithm.
There are a lot of different possible situations to potentially consider, although we’ll get 80% of the benefit from the top 20%. Here’s a long list of potentially problematic scenarios I compiled - but in practice I don’t think we need to cover more than a few of these.
Encryption
Where does this put us with regard to encryption? I have written elsewhere about the importance of not serving up this kind of data as plain text.
A few points:
By dramatically reducing the amount of personal data we share, moving from a “redacted trail” model to a “zones of concern” model, the personal nature of the data is massively reduced, and therefore the privacy issues are reduced.
But I don’t think the privacy risks are eliminated - we need to think not only about the privacy of the infected patient, but also the privacy of businesses, and other individuals known to frequent affected locations. So I don’t think the encryption requirement goes away.
Potentially we need to serve up a much more semantically rich set of descriptions of “zones of concern” - not just space/time boxes like the restaurant example, but more sophisticated examples like the cinema/theater and bus example.
Our current encryption proposals assume that the data served by the HA is a homegenous set of place/time data points. A different approach may be needed for these more sophisticated examples.
It’s not clear to me how a user’s space-time points can be assessed against a rich description of a “zone of concern”, without either the space-time point being disclosed to a server (server-side comparison), or the description of the “zone of concern” being disclosed to the App (client-side comparison). The problem being that a cryptographic hash function will not preserve any of the topology of the space-time region being hashed.
A possible solution would be for the “zone of concern” to be resolved into a discrete set of space-time points, which could be served to the client in a hashed form, which the client could compare this with their own hashes of their location data. These hashed space-time points could retain a “criticality” value without any obvious loss of privacy. Matching based on speed/bearing as well gets complicated, though!
Summing Up
The key points I want to pull out from the above are:
We should move away from a "redacted trail" model, to a "zones of concern" model. Just as effective & much more privacy protective.
"zones of concern" do not need to be comprised of location/time pairs from the patient's original trail. They could be made up of newly synthesized location/time pairs to better match the matching needs of the HA for a particular environment.
Location/time pairs can & should have a "criticality" associated with them.
Negative criticality may be a useful concept (e.g. in the theater/cinema case, anyone who had their phone on in the middle of the show at this location, was probably not watching the show).
It would be nice to have a much richer model for describing "zones of concern" (e.g. vector-based rather than point-based, and factoring in speed/bearing) to help with cases like bus journeys, . But I can't see any way to do that in a manner that would enable encryption, and I am doubtful we are going to be able to invent such techniques quickly.
Plan for MVP1
If we agree on all of the above as our correct overall direction, what’s the bare minimum we need to do for MVP1
Ensure that “Redaction” guidelines are up-to-date to ensure that all data that is likely to be ineffective for exposure detection (e.g. walking on the street outside) is redacted.
Consider renaming “Redaction” to shift emphasis from privacy towards efficacy. This will include in privacy language used towards users.
All data points are stored as one-way hashes - see Hashing details below, to +/-76m accuracy. Whether or not to include salt in MVP1 is TBC.
Update Safe Paths App to match based on hashed geohashes of:
The recorded GPS point
Points 20m to the N, NE, E, SE, S, SW, W & NW (if these generate different Geohashes)
Update Safe Paths App to log a minimum number of points of concern before generating a notification (suggested default: > 66% of points over a 30 min period)
These parameters to be specfied by the HA in their HA JSON file (as a global setting), with guidance provided on what we believe are suitable settings.
Reduce default exposure time for a point of concern from (0 mins to 4 hours) to (-5 mins to +5 mins). This reflects the fact that we believe that trying to capture fomite transmission will yield too many false positives, so we are only focussing on person-to-person transmission.
Hashing Details
Published data points should be geohashes (less-precise than specific GPS points), and stored as a SHA-256 hash of (geohash, time-bin) (where time-bin is a 5 minute rounded-down time interval in UTC).
Geohash accuracy (this is at the equator, slightly more accurate further from the equator)
Number of digits | m accuracy |
---|---|
6 | +/-610 |
7 | +/-76 |
8 | +/-19 |
9 | +/-2.4 |
https://gis.stackexchange.com/questions/115280/what-is-the-precision-of-a-geohash
For additional security a salt can be added to the hash, Ideally this is:
Specific to a single HA
Changes daily & is not pre-announced
Can be published by the HA alongside the points of concern
Future Phases - all beyond MVP1
If we deliver MVP1 as above, what would future phases look like? (we can also conaiser whether any of these is so important it should be in MVP1
Variable geohash blurring depending on geography (urban vs. rural) and number of points of concern.
Add a “criticality” value to a point of concern, to allow the contribution a given point of concern makes towards hitting the threshold for notification to be different from the default value.
Add a “time-window” value to a point of concern: to allow the time-window that counts for an overlap to be different from the default value.
Add basic tools to Safe Places to allow “criticality” and “time window” to be set on individual data points.
Add targeted tools to Safe Places to replace user-provided data points with synthetic data points that are optimal for generating user matches, for example:
(e.g.) A “Bus” tool, which traces a bus route with a much finer set of data points, each with a very low “time window”
(e.g.) A “Cinema/Theater” tool, which sets high-criticality points of concern at the start and end times of a given show, and sets negative-criticality points of concern during the middle of the show.