Design for MVP1 HA JSON Changes

This doc is currently in the Testing Confluence space, for convenience. That’s not the right long-term home - we should move it.

Status & issues:

Draft written by Diarmid following discussions with Sam, Kyle, Abhishek, Adam.
Proposal to move to “accurate” (8 character) geohashes, rather than previous agreement to use “blurred” (7 character) geohashes needs review with Sam, Kyle, Abhishek, Adam. Key missing piece of previous discusion was the way that this impacted entropy & hence made brute force attacks 32 times easier. << I think everyone is on board with this, but we need to make sure GPS accuracy is going to be at the level needed.
Full detail to be reviewed in detail by Safe Paths & Safe Places Dev nominees.
More design details now in place here: https://pathcheck.atlassian.net/wiki/spaces/TEST/pages/81231882

About this Page

This is a design for a proposed set of changes to the JSON data format published by Health Authorities for Exposure Notifications.

It details:

The proposed new data format for MVP1
The exact details of the hashes used, together with the design rationale for these, and some design implications for Safe Paths & Safe Places

This page does not yet include a design for migration from the old format to the new format - where we would need to begin with a discussion about requirements, particularly with regard to seamless migration.

New JSON Format for HA Exposure Data

The intial fields in the JSON format are unchanged. They are as follows

{
  "authority_name": "Test Authority",
  "publish_date_utc": 1589310827,
   "info_website": "https://raw.githack.com/tripleblindmarket/safe-places/develop/examples/portal.html",

New top-level fields - notification sensitivity

Two new top-level fields are added:

"notification_threshold_percent": 66,
“notification_threshold_timeframe”: 30

Defaults are 66 and 30 (minutes). These two parameters work together to control the sensitivity of exposure notifications for this data. The Health Authority can turn these based on their experience and feedback from their community.

With the defaults, we trigger an exposure notification if we get more than 66% of location matches, across any of the 6 consecutive location data points that should appear in a 30 minute window (i.e. if we match 4 or more).

New format for concern points

The format of concern points is changed:

Previously it was like this:

"concern_points":[  
{
      "time":1589224427000,      "longitude":1.00000919,      "latitude":2.00000943
   
}
]

Now it is like this:

"concern_point_hashes":[  
{
      "hash":“87e916850d4def3c”
   
}
]

This location hash is a 16 character hex string representing a 64 bit hash. It is calculated as described in the following section (which also explains why 64 bits is enough).

Note that we don’t include the timestamp in plain text. We don’t need it becvause it is encoded in the hash, and including timestamps will assist someone trying to crack the file in knowing which particular timestamps have data associated with them,

Hash Calculation

Data points are generated as follows:

The Longitude & Latitude of the data point are converted into a 8 character geohash, which gives a region approx 38m x 19m at the equator, slightly narrower away from the equator).

See e.g. https://www.movable-type.co.uk/scripts/geohash.html

A data string is then created by concatenating the following:

The 8 character geohash
The UTC timestamp in seconds of the datapoint, rounded down to the nearest 5 minutes. See: https://www.unixtimestamp.com/index.php

This data string is then hashed using Scyrpt using:

A “cost” of 2^18 (may be customizable by Health Authorities in future - see below).
A “maxmem” of (2^18 + 3) * 1024 = ~262MB (see: https://github.com/nodejs/node/issues/21524 )
A salt of “” (empty string) (see below for future customization by Health Authorities)
A block size of 8
A keylen (output) of 16 bytes.

This generates a 64-bit key, which is represented in the JSON file as Hexadecimal

Example:

Latitude: 51.5019, Longitude: -0.1415
Time: 14 April 2020, 12:03:12pm (which rounds down to 12:00:00pm)

Gives us:

8 character geohash: gcpuuz8u
UTC timestamp: 1586865600000 (msecs, for consistency with other timestamps)

Data string: “gcpuuz8u1586865600000”, Salt: “salt”

Output hash: “c9414c55812d796a”

Hashing Design Notes

Given the same location, 5 minute time-window, and salt, the Safe Paths App will compute the same Hash. By comparing hashes, it can therefore determine whether it has a location match with the data shared by the Health Authority.

Hashes are computationally difficult to reverse. The only feasible way to do so is to pre-compute all the possible hash values in a target area, and check for them.

Our initial proposed implementation using SHA-256 hashes of a 7 character (+/-76m) geohash + timestamp was very easy to brute force, for a modest target area.

a 10km x 10km area will contain about 6,000 x 7 character geohashes.
so to crack all the data for a given day (288 x 5 minute windows), you’d need to compute 1.7M SHA-256 hashes. Let’s say 1M hashes, on the basis that some nighttime hoursare probably not of interest).

A powerful GPU can compute 3 Billion SHA-256 hashes / second, so this computation is utterly trivial, and doesn’t represent any significant protection against a remotely sophisticated attacker.

There are two key strategies against brute force attacks:

Increase the entropy in the data space you are encrypting (i.e. more possible values).
Move from a fast hash algorithm (like SHA-256), to a slow hash algorithm.

The idea of a slow hash algorithm is to make it computationally hard to calculate the hash in the first place. That comes at a cost, as Safe Paths & Safe Places must compute these hashes, but the idea is that the cost to the attacker is much higher.

Let’s look at some numbers.

Safe Paths has to has one data point every 5 minutes, for a total of about 4,000 stored data points (288/day x 14 days).
Safe Places has to hash approx 1,000 data points per published case (assuming 25% of the original set of points are kept). To publish 100 cases a day, would thus require 100,000 hashes. Note that data published the previous day does not need to be re-computed. If the App is widespread, Safe Places can get these transferred from the App (which will already have done the computation). But for MVP1 we expect most Safe Places data to be generated manually as most infected users won’t have the App pre-installed. Safe Places may have to compute 100k hashes. We’ll have to tune cost so that this is a viable amount of time (2-4 hours maybe acceptable). Note that in the very short term, we don’t expect as many as 100 cases/day.

Suppose we can spare 1 CPU-second on a Smartphone compute a hash (we may have to perform up to 4 hashes per data point due to corner-cases). That should be acceptable to perform once every 5 mins without excessive battery drain.

If we can pick a slow hash algorithm that might take 1 CPU second to compute on a Smartphone, that might take 0.1 CPU seconds on a more powerful desktop PC, or 0.01 CPU seconds on a powerful server (these are wild guesses that would beneft from some validation).

Then Safe Places can complete hashing required for publishing 100 records in about 10,000 seconds (3 hours). That might be borderline acceptable. 10 records in 20 mins is probably OK.

An attacker using a powerful server could generate the hashes to scour a 10km x 10km area for a day in 10,000 seconds = 3 hours. That’s OK, though it would be nice to do better - see below.

An attack to cover a wider area (e.g. to 3,000 sq km or South Lake Florida) would take about 90 hours - to be able to analyze the data for one day.

What’s a suitable slow hash algorithm? There are 3 obvious candidates (see: https://en.wikipedia.org/wiki/PBKDF2)

PBKDF2 - Way better than SHA-256, but not ideal as it is vulnerable to attacks on dedicated hardware such as ASICs or GPUs
bcrypt - Great for passwords, but not viable (as fast as I can see) for this application because the hashes use random Salts, and so are not consistent.
scrypt - Gives consistent hashes, but by design it uses high memory use, to protect against attacks on ASICs or GPUs.

Hence scrypt is the obvious choice to use. The key parameters chosen above are as follows:

A “cost” of 2^18 - this takes ~1 second to compute on my 4yo Windows PC. Testing on Smartphones needed, which might adjust slightly.
A “maxmem” of (2^18 + 3) * 1024 = ~262MB (this is the memory needed to work with a “cost” of 2^18 see: https://github.com/nodejs/node/issues/21524 )
A block size of 8 - this is the default.
A keylen (output) of 8 bytes = 16 hex digits. - this is way smaller than the default of 64. But it’s large enough for our purposes, given the fairly low entropy of the data we are enrypting (only about 5 x 10^16 possible values - see below).

Based on the above, it might take a typical high-end server ~3 hours to brute force attack a 10km x 10km area for a single day.

We could do better if we could increase the entropy of our possible values.

One easy win here is to move from 7 character to 8 character geo-hashes. This would increase our entropyby a factor of 32 (moving from 76m accuracy to 19m accuracy).

That would push the time to execute an attack on a day of data, covering 10km sq to approximately 96 hours of computation time on a typical high-end server. That’s a dramatically improved level of detection.

The rationale for using 7 character hashes was to protect privacy by blurring individuals' locations. However the consequence is that we have reduced entropy 32-fold, making brute force attacks far easier.

Privacy would be better preserved by a move to 8 character hashes with +/-19m accuracy.

Note that there is no value from moving to more precise geohashes: since GPS inaccuracy means we need to match over a ~20m range, with more precise geohashe,s the app would end up having to compute many more of them.

If we do this, what is our overall Entropy (space of possible values)?

There are approx 100k 5-min intervals in a year
Land surface area of Earth is approx 510M sq km => 500 Billion 8-character geohashes
Total possible points in a year = (1 x 10^5) x (5 x 10^11) = 5 x 10^16.
That’s a lot smaller than 16^16, which explains why a 16 character hex strig / 64 bits is plenty long enough as a hash.

(of course a given attacker is probably interestded in a small number of days, in a small area, so they will be able to taget a mauch smaller space, e.g. the 96 hours to cover a 10km x 10km area for 1 day).

Increasing entropy further would require some information that is known to both phones at the point they are logging a GPS data point, but is not easily known to a brute force attacker. In theory there may be data from mobile sensors that could be incorporated here (which cell towers are available? what WiFi networks are visible? etc. But these all seem complicated to incoroporate in a reliable way).

Therefore the best we can do to protect our data is:

Increase entropy as high as we reasonably can (i.e. to 8 character geo-hashes / +/-19m accuracy)
Use the best slow hash algorithm, which for this application looks to be scrypt.

The analysis above suggests that this will do a good job of protecting most HA published data from all but the very most determined attackers.

Future Enhancements to JSON Format for HA Exposure Data

The linked article includes some thoughts about how we might extend this spac in future
https://pathcheck.atlassian.net/wiki/spaces/TEST/pages/61079863

These are definitely post-MVP1, and quite possibly much later, so there is no need for detailed review of these at this point, but they are indicated to show some possible future directions as this might impact teh MVP1 design.

JSON Format for Location Sharing

For now, this does not need any changes.

We may make some changes in future to save the cost of computing hashes on Safe Places when already computed by the App, but this is not worthwhile for MVP1, as the App won’t be widespread enough - see
https://pathcheck.atlassian.net/wiki/spaces/TEST/pages/61079863