Next-gen Privacy Design for published data

Proposed Algorithm

Solution #4 from this paper has been proposed as our best eventual solution for privacy-protecting matching of points of concern.
https://github.com/PrivateKit/PrivacyDocuments/blob/master/GpsEncryption.pdf

For me, this description is a bit dense, and not easy to understand. Let’s try to decompact it a bit…

Unpacking this Description

This soluton involves two independent servers, and they must have no contact with each other. It remains an open question how we ensure & deomnstrate that this is the case (see later).

Server 1 has responsibilities for:

  • FHE computation

  • Storage of encrypted locations of infected users.

Server 2 is responsible for:

  • Generating keys

  • Performing partial decryption of data, and returning it to the healthy patient (P).

So what’s FHE computation?

FHE = Fully Homomorphic Encryption.
You can read a nice intro here:
https://blog.cryptographyengineering.com/2012/01/02/very-casual-introduction-to-fully/
And more details here: https://en.wikipedia.org/wiki/Homomorphic_encryption

Basically:

  • “Homomorphic Encryption” means you can perform certain operations on the encrypted objects, and you get the same outputs (still encrypted, of course) as if you had performed the operation on the unecrypted objects.

  • “Fully Homomorphic Encryption” allows arbitrary operations on the encrypted objects to get the same outputs. And of particular relevance for our application, supports both addition & multiplication.

So using FHE allows us to perform calculations on the encrypted data, and get results which, when decrypted, match the equivalent calculations on the decrypted data.

A really important characteristic of Homomorphic Encryptions is that the encryption process adds entropy, so that a given value x will be encrypted to multiple different values (most likely a different value every time) - this will come up later.

 

So how do we use this to solve our problem? Let’s go through the details:

 

Server 2 generates a public/private keypair (pk_Q, sk_Q). (if you aren’t familiar with public/private keys see: https://en.wikipedia.org/wiki/Public-key_cryptography)

  • pk_Q is a public key - this can be shared anywhere, and anyone can use it to encrypt data.

  • sk_Q will be kept secret by Server 2. This private key is needed to decrypt any data that was encrypted using pk_Q.

We have a set of points of concern for our infected patient Q. We call these y_i - i.e. y_1, y_2etc.

These are encypted using pk_Q, and sent to Server 1. The encrypted versions are known as c_i - i.e. c_1, c_2 etc. Since Server 1 does not know sk_Q, it cannot interpret this data.

  • The paper talks about Q (i.e. the infected user’s phone) sending these directly to Server 1. In practice in our solution they would be sent first to the Health Authority, who would redact the data in Safe Places, before encrypting and sending to Server 1.

 

So now we have the points of concern, encrypted, available to be served by Server 1.

 

When another user, P, wants to check for intersections with Q, it gets the public key from server 2, and encodes two points of data:

  • The location he wants to query: x

  • And an additional random piece of informaton: r

These are encrypted as d & R respectively, and sent to server 1

Server 1 now computes the difference between d (encrypted x) and each of the (encrypted) points of infection, c_i (encrypted y_i).

The reason Server 1 can’t just compare the encrypted values directly is that a homomorphic encryption will add entropy, so the exact same value will generate different results each time that it is encrypted.

Server 1 then multiplies together all of these numbers. It also multiplies the result by another encrypted random number s, and adds the encrypted random number R (which came from P).

Why do all that?

Well, by multiplying all the numbers together, if any of the differences is zero (i.e. we have an intersection), the result will be zero.

The random numbers add security:

  • By adding a random number, r (encrypted as R), that only P knows, then only P will be able to recognize this as “zero”. Note that sever 2 never sees R by itself, so although it could decrypt R to determine r, it never has an opportunity to do so.

  • The random number s (known only to server 1) prevents server 2 from being able to obtain any useful information, even if it had access to the infected person’s information (by design it does not have access, but under a security braecj, it might).

So now, the resulting output (which the paper calls CT) is sent to Server 2. Server 2 decrypts this (recall, Server 2 is the only system that can decrypt anything), and sends the result back to P.

The result will be either r (if there was a match) or r + some random number (if there was no match).

  • Note: under ix) the paper says “R”, which is Encrypted r - that looks like an error to me. WHen decrypted the value will be r.

Server 2 can’t tell the difference between these two values, because it doesn’t know what r or R were. And even if it somehow got access to the full set of “points of concern” y_i (through collusion with , it wouldn’t know what s was either.

But when Server 2 passes this value back to P, P can tell whether or not it got back r (in which case there was a match), or something different, in which case there was no match.

The paper contains some more analysis of possible attacks, and also of the computation cost of the various operations - I’m not going to cover those details here.

But in summary, the only technical weakness in this scheme is that Server 1 & Server 2 collude with each other somehow, in which case privacy can be breached.

What about Timestamps? And do Locations have to match exactly?

The above description focussed on location, and the distances between them.

Initially the paper talks about data points as follows:

Then later it talks about subtracting one point from another, without clarifying exactly that that means.

Note that for matching (position + timestamp) x & y, x - y must be exactly zero (not just close to zero), or the multiplication won’t work to identy a zero point.

Since in reality, locations & timestamps won’t match exactly, both must be discretized, so that a close match results in an exact match.

This is similar to what we have done with geohashes and 5 minute time buckets in the MVP1 implementation. The same solution could be re-used again in this implementation.

Then we can define a “subtraction” function as e.g. |geohash_x - geohash_y| + |timestamp x - timestamp y|

My expectation is that such a calculation should still be compatible with the Fully Homomorphic Encryption, and won’t blow up the computation costs - but we should confirm details with Abhishek.

Other Concerns

One problem with this solution is that it does not seem to be consistent with our position that

“An unifected user’s location data never leaves their phone.”

In this scheme, the usre’s data is encrypted, and sent to Server 1. Since the data can only be decrypted by Server 2, and Server 1 will never send it to Server 2, it should be secure - but it may be difficult for a user to trust us regarding what happens to their data once it leaves their phone.

It is unclear how to solve this.

One solution (subject to performance concerns) might be to implement the Server 1 function on the user’s phone.

This may be plausible, since the analysis has shown that collusion between Server 1 and a healthy person’s phone does not result in any information leakage. In fact this would look rather similar to the MVP1 design, with encrypted versions of the “points of concern” data being published to every healthy person’s phone, and a calculation on that phone to dermine whether there was an intersection.

The key difference would be that the phone would not be able to determine the maning of the output of that calculation without consulting with Server 2.

One point of concern with such an approach would be the following:

With Server 1 function moved onto the phone, it would not be possible to rate-limit queries to Server 1. However it would still be possible to rate-limit queries to Server 2.

Since no information is gained without the involvement of Server 2, this should still be sufficient.

However there are difficulties in scoping any rate-limiting that applies.

  • Rate limiting per-IP address could be circumvented by an attacker.

  • Rate limiting based on an API key creates challenges in terms of how the distribution of these API keys is controlled

  • Global rate-limiting would leave us vulnerable to DOS attacks.

Without rate limiting, it becomes possible to brute-force attack the data, simply by iterating through every conceivable point of concern.

 

Even if we disaggregate Server 1 out of the App and back to a discrete server, it is not clear how this would help with rate limiting.

  • The same issues with rate-limiting requests from the App stll apply (not on Server 1)

  • It’s not clear how we would rate-limit between Server 1 and Server 2 without making the system vulnerable to a DOS or DDOS attack.

Solutions here are not clear to me - for discussion with Abhishek.