Why we need to encrypt HA JSON data

Our current plan for MVP1 is to ship with the current implementation of redacted location data shared in plain text from Public URLs.

However, this decision creates a number of problems.

First, to ptotect users' privacy, we need to determine some minimum number of user records that can be published at a time.

Because the data can trivially be scraped and stored by 3rd parties, who can therefore determine the entire version history, this represents the minimal number of cases that can be added in a single increement.

We don’t know what this number is yet, but let’s suppose it’s 10. That means you have can only publish case data 10 cases at a time.

In low-volume deployments (which all our initial targets are, this is a major problem).

In Hait, for example, the number of new cases is often between 1 and 10 - in fact it’s often fewer than 5.

(unfortuanely the rates are going up in the last week)

This means that it will be common for case data from individuals to have a publication delay of 2 or 3 days.

This chart from Oxford (https://science.sciencemag.org/content/368/6491/eabb6936) shows why this is such a bigf problem.

This shows transmission from day of infection.

Adding 2-3 days' delay in altering (and hence testing and/or quarantining) patients has a huge impact on their onwards transmission of the virus.

For “index patients” (not notified by the app), who decide to get tested based on symptoms, they are already typically 4d post-infection. Allow 24h for testing, and we’re at day 5, already at peak transmission.

If we add a 2-3d delay before notifying their contacts, well be at day 7-8, by which time the majority of transmission has already occurred.

For patients notified by the app, I don’t believe we can go on & notify their contacts until they either have symptoms or a possible test.

Having notified at day 7 of the index patient, these contacts are (on average) already 2d past infection (some may be much further advanced). Once they’d been tested, they are at day 3. They will already have passed on the infection to a small, but significant 3rd Tier. If we add another 2-3d delay before notifying (and therefore quarantinging) the 3rd Tier, we are massively undermining the efficiency of the solution.

I haven’t done detailed modelling of the impact of a 2-3d delay in notifications, but this modelling from Oxford shows the impact of 0 to 3 days delay (0d on the right, 3d on the left) on the % of transmissions that have to be identified, and the % of success with quarantine, to get R below 1.0.

To be successful with a 3d delay in the process, you basically have to identfy 100% of infections and have 100% success with quarantine - both utterly implausible.

Not that this is delay vs. symptoms onset, so simply waiting for a test after symptom onset already uses a day, and puts you in the box 2nd from the right (this is why Oxford advised NHSX to do notifications on symptoms, not diagnosis, which is what led them away from GAEN).

So a 2 day delay in JSON publication then puts us in the box on the far left.

We’ll do slightly better for non-index patients, who get tested based on an App notification, rather than symptoms - but the impact of a 2-3d delay is still huge.

It is true that as the virus spreads in an area, the case volume will go up, and we’ll get to the point where we enough cases/day to address our privacy concerns, and therefore be able to publish every 12 or 24 hours.

Personally I am not at all comfortable with a solution that only works if things get worse before they get better - and further that won’t be able to actually hel eradicate the disease, because it is ineffective with small numbers of cases.

In short - for the solution to be effective, we need to be able to publish individual users data promptly (within 12 hours; sooner would actually be significantly better). We can’t do that if we are dependent on “crowd” effects for privacy, because the crowds simply will not be there in many of our small scale deployments.

Further, in some jurisdictions, it is not even clear that there is any value N for which the pooled data on N users can be considered to result in adequate privacy. Adam Leon Smith (Unlicensed) tells me that this would be the case in the EU (though that may chage if we got a more explicit consent from users).

And beyond what the law says, I would not be surprised if there were significant public outcry when it is discovered that we publish this personal data in plain text. The only defense I can see against that is total transparency with patients about how their data will be published, and how exposed this could make them - and I suspect that such transparency will massively hinder uptake..