Why we need to encrypt HA JSON data

Our current plan for MVP1 is to ship with the current implementation of redacted location data shared in plain text from Public URLs.

However, this decision creates a number of problems.

While data is redacted, we believe that publishign individual trails represents an unacceptable breach of privacy. Our intention is to protect users' privacy by pooling the data with that from other patients, and using this “crowd” to generate privacy. To this end, we are planning to determine some minimum number of user records that can be published at a time.

Because the data can trivially be scraped and stored by 3rd parties, who could therefore determine the entire version history of the JSON data, this represents the minimal number of cases that can be added in a single increment.

We don’t know what this number is yet, but let’s suppose it’s 10. That means you have can only publish case data 10 cases at a time.

Until you have collected 10 cases, you just have to wait. In low-volume deployments (which all our initial targets are, this is a major problem).

In Haiti, for example, the number of new cases is often between 1 and 10 - in fact it’s often fewer than 5 (unfortunately the rates are going up in the last week).

It’s unrealistic to expact 100% of these to agree to publish their data - that figure could easily be 50% or lower.

This means that it will be common for case data from individuals to have a publication delay of 2 or 3 days, perhaps even longer.

Countries we are targetting have total cases as follows. As least 4 of the 7 targets would have this issue.

Mexio: 33,460, Puerto Rico: 2,173, , KCMO: 664 (on 5 May), Lake County FL: 223, Haiti: 151, Guam: 151, Teton County: 97.

This chart from Oxford (https://science.sciencemag.org/content/368/6491/eabb6936) shows why this is such a big problem.

This shows transmission from day of infection.

Adding 2-3 days' delay in altering (and hence testing and/or quarantining) patients has a huge impact on their onwards transmission of the virus.

For “index patients” (not notified by the app), who decide to get tested based on symptoms, they are already typically 4d post-infection. Allow 24h for testing, and we’re at day 5, already at peak transmission.

If we add a 2-3d delay before notifying their contacts, well be at day 7-8, by which time the majority of transmission has already occurred.

For patients notified by the app, we have a bit more of a head start. However I don’t believe we can go on & notify their contacts until they either have symptoms or a possible test.

Having notified at day 7 of the index patient, these contacts are (on average) already 2d past infection (some may be much further advanced). Once they’d been tested, they are at day 3d. They will already have passed on the infection to a small, but significant 3rd Tier. If we add another 2-3d delay before notifying (and therefore quarantinging) the 3rd Tier, we are massively undermining the efficiency of the solution.

WIth every Tier we add this 2-3d delay, and miss a huge opportunity to contain the virus on each occasion.

I haven’t done detailed modelling of the impact of a 2-3d delay in notifications, but this modelling from Oxford shows the impact of 0 to 3 days delay (0d on the right, 3d on the left) on the % of transmissions that have to be identified, and the % of success with quarantine, to get R below 1.0.

To be successful with a 3d delay in the process, you basically have to identfy 100% of infections and have 100% success with quarantine - both utterly implausible.

Not that this is delay vs. symptoms onset, so simply waiting for a test after symptom onset already uses a day, and puts you in the box 2nd from the right (this is why Oxford advised NHSX to do notifications on symptoms, not diagnosis, which is what led them away from GAEN).

So a 2 day delay in JSON publication then puts us in the box on the far left.

We’ll do slightly better for non-index patients, who get tested based on an App notification, rather than symptoms - but the impact of a 2-3d delay is still huge.

In short - for the solution to be effective, we need to be able to publish individual users data promptly (within 12 hours; sooner would actually be significantly better). We can’t do that if we are dependent on “crowd” effects for privacy, because the crowds simply will not be there in many of our small scale deployments.

Note also, my assumption that a crowd of 10 will be enough for privacy purposes may be a major underestimate. In some jurisdictions, it is not even clear that there is any value N for which the pooled data on N users can be considered to result in adequate privacy. Adam Leon Smith (Unlicensed) tells me that this would be the case in the EU (though that may change if we got a more explicit consent from users).

And beyond what the law says, I would not be surprised if there were significant public outcry when it is discovered that we publish this personal data in plain text. The only defense I can see against that is total transparency with patients about how their data will be published, and how exposed this could make them - and I suspect that such transparency will massively hinder uptake.

What options are available?

There are a couple of options available that can help here.

The first is to publish location data points with a one-way hash function applied to them.
The second is a more sophisticated scheme, with 2 separate independent servers, used to provide stronger sryptographic protection.

Another option might be to have the HA Servers authenticate Safe Paths Apps based on a secret built into release builds of the app, from outside our Open Source repo - I am not sure why this approach does not seem to be under consideration.

ALS: I think that this should be done regardless, it provides an additional control, albeit weak.

The first solution is described in this paper:

https://arxiv.org/pdf/2003.14412v2.pdf

And this WIRED interview with Ramesh.
https://www.wired.com/story/covid-19-contact-tracing-apps-cryptography/

It is known to have weaknesses (vulnerability to brute-force attacks), but it provides considerably more protection than plain text, and is relatively inexpensive to implement (Abhishek Singh (Unlicensed) tells me it is mostly implemented already).

The second solution is described in this paper (as soluton #4), and also referred to in the WIRED interview above.
https://github.com/PrivateKit/PrivacyDocuments/blob/master/GpsEncryption.pdf

Our current thinking is that this is the ideal solution to the problem, but we are concerned that it is too complex & expensive to implement for MVP1 (1 June).

Based on my discussion above, I don’t believe that a plain text solution is acceptable.

I understand there is reluctance to deploy the hashing solution. There is a concern that we will be subject for ridicule for deploying such a solution.

However:

It is not clear to me that we will be subject to any less ridicule for deploying a plain text file with zero protection.
This is a solution that has already been presented publically as an intermediate option in both the MIT paper above, and Ramesh’s interview with WIRED. I am not aware of us having been ridiculed for that yet.
There are many circumstances in which imperfect security measures deliver significant protection in spite of their imperfections - the locks that we mostly have on our front doors being a good example.
The main group I would see this protecting against would be the tech-literate (but not highly skilled with security) “concerned public” who might well be attracted to digging through a plain text file, but would mostly be put off by hashed data that would require substantial effort to decrypt.

The key question for me regarding the Hashing solution is whether or not it would deliver enough protection that we could consider dropping the crowd-size N that represents the minimum number of cases that we can publish at one go.

If it does allow us to reduce this number to low figures, 1, 2 or 3, say - then I think it makes MVP1 viable (as per above, I don’t think the current MVP1 plan is viable).
If it does not, then there is little point in spending time on a Hashing solution, and we should be working on the “full” solution as a priority, as a necessary part of MVP1.

References:

https://www.worldometers.info/coronavirus/
https://www.worldometers.info/coronavirus/country/us/
https://www.kcmo.gov/Home/Components/News/News/332/16