Why we need to encrypt HA JSON data

Our current plan for MVP1 is to ship with the current implementation of redacted location data shared in plain text from Public URLs.

However, this decision creates a number of problems.

While data is redacted, we believe that publishign individual trails represents an unacceptable breach of privacy. Our intention is to protect users' privacy by pooling the data with that from other patients, and using this “crowd” to generate privacy. To this end, we are planning to determine some minimum number of user records that can be published at a time.

Because the data can trivially be scraped and stored by 3rd parties, who could therefore determine the entire version history of the JSON data, this represents the minimal number of cases that can be added in a single increment.

We don’t know what this number is yet, but let’s suppose it’s 10. That means you have can only publish case data 10 cases at a time.

Until you have collected 10 cases, you just have to wait. In low-volume deployments (which all our initial targets are, this is a major problem).

In Haiti, for example, the number of new cases is often between 1 and 10 - in fact it’s often fewer than 5 (unfortunately the rates are going up in the last week).

It’s unrealistic to expact 100% of these to agree to publish their data - that figure could easily be 50% or lower.

This means that it will be common for case data from individuals to have a publication delay of 2 or 3 days, perhaps even longer.

This chart from Oxford (https://science.sciencemag.org/content/368/6491/eabb6936) shows why this is such a big problem.

This shows transmission from day of infection.

Adding 2-3 days' delay in altering (and hence testing and/or quarantining) patients has a huge impact on their onwards transmission of the virus.

For “index patients” (not notified by the app), who decide to get tested based on symptoms, they are already typically 4d post-infection. Allow 24h for testing, and we’re at day 5, already at peak transmission.

If we add a 2-3d delay before notifying their contacts, well be at day 7-8, by which time the majority of transmission has already occurred.

For patients notified by the app, we have a bit more of a head start. However I don’t believe we can go on & notify their contacts until they either have symptoms or a possible test.

Having notified at day 7 of the index patient, these contacts are (on average) already 2d past infection (some may be much further advanced). Once they’d been tested, they are at day 3d. They will already have passed on the infection to a small, but significant 3rd Tier. If we add another 2-3d delay before notifying (and therefore quarantinging) the 3rd Tier, we are massively undermining the efficiency of the solution.

WIth every Tier we add this 2-3d delay, and miss a huge opportunity to contain the virus on each occasion.

I haven’t done detailed modelling of the impact of a 2-3d delay in notifications, but this modelling from Oxford shows the impact of 0 to 3 days delay (0d on the right, 3d on the left) on the % of transmissions that have to be identified, and the % of success with quarantine, to get R below 1.0.

To be successful with a 3d delay in the process, you basically have to identfy 100% of infections and have 100% success with quarantine - both utterly implausible.

Not that this is delay vs. symptoms onset, so simply waiting for a test after symptom onset already uses a day, and puts you in the box 2nd from the right (this is why Oxford advised NHSX to do notifications on symptoms, not diagnosis, which is what led them away from GAEN).

So a 2 day delay in JSON publication then puts us in the box on the far left.

We’ll do slightly better for non-index patients, who get tested based on an App notification, rather than symptoms - but the impact of a 2-3d delay is still huge.

In short - for the solution to be effective, we need to be able to publish individual users data promptly (within 12 hours; sooner would actually be significantly better). We can’t do that if we are dependent on “crowd” effects for privacy, because the crowds simply will not be there in many of our small scale deployments.

Note also, my assuption that a crowd of 10 will be enough for privacy purposes may be a major underestimate. In some jurisdictions, it is not even clear that there is any value N for which the pooled data on N users can be considered to result in adequate privacy. Adam Leon Smith (Unlicensed) tells me that this would be the case in the EU (though that may chage if we got a more explicit consent from users).

And beyond what the law says, I would not be surprised if there were significant public outcry when it is discovered that we publish this personal data in plain text. The only defense I can see against that is total transparency with patients about how their data will be published, and how exposed this could make them - and I suspect that such transparency will massively hinder uptake..