Points of Concern Hashing - next steps after MVP1
Current Status
Current PoC Hashing implementation uses:
8 char geohashes (38m x 19m)
A Scrypt hash with cost of 2^12 = 4096
(following the general approach described here: privacy-security-GPS/documents/privacy.md at master · Path-Check/privacy-security-GPS )
Performance data points:
Moto G7 ($200 phone) computes 4 hashes in 200 msecs
Redmi 6A ($100 phone) computes 4 hashes in 400 msecs
On Safe Places:
We are computing 25k hashes in 392 seconds on the GCP Cloud Infra (cost = 4096 = 2^12)
That’s 16msecs per hash, about 6x faster than the MotoG7, 12x faster than the Redmi 6A which is in the same ballpark as what we had projected previously.
Plan of Record
Plan of record was to stick with 8 char geohashes, and increase the cost to ~16 or 17, and to do so in MVP1.
Our goal was ~1 second per hash on a “typical” smartphone, which suggests we want a 10-fold increase.
Hence cost of 2^15 or 2^16. (8x or 16x more cost).
That would increase the cost of 100k hashes on Safe Places to 8k seconds or 16k seconds (2 to 4 hours approx). That’s still on Jeff’s laptop. Let’s assume the Google servers are similar (I’m seeking input from Sherif on this(
This is in the same ballpark as the 3 hours projected here: How much protection does hashing offer?
However, discussion with the Safe Places Dev team has revealed that while from a requirements poiint of view there may be no issue with a 3 hour publish step, this creates significant issues from an implementation point of view.
Probably we’d want to split out a new separate service to do the hashing, and we’d need to consider what happens when a second request to hash data comes in when the 1st is still being processed etc.
For these reasons (and because we are not yet very concerned about security for the early MVP pilots, we are leaving the cost at 4096 = 2^12 for now, and tackling increasing the cost as a post-MVP1 activity.
Options for post-MVP1
We could move ahead with a cost increase.
Another alternative would be to move to 9 character geohashes.
We ruled this out previously on the basis that
GPS accuracy was too wayward for 9 char geohashes (5m x 5m) to be meaningful.
There is no improved security benefit. While 9 char geohashes increases the entropy in the data space (by a factor of 29), to cover a 20m radius match area, the number of hashes the app has to compuet increases by about the same factor. So you have to reduce the cost.
Hence we ruled this out as more complexity in implementation for little extra benefit.
Looking at this again, though the argument is not so clear-cut
You can approximate a 20m radius circle much more accurately with 9 char geohashes than 8 char geohashes, giving more accurate matching. With 8 char geohashes, we have to talk about people having been within 100ft of each other. With 9 char geohashes, we could change to talking about 60ft or 70ft. Effectively we’d move from the yelllow line (MVP1) to the orange line.
(note that reducing the match radius from 100 ft to 60 ft reduces the match area nearly 3-fold, so the benefit is substantial in terms of avoidance of false positives)
With accuracy improvements made in MVP1 (on Android at least, and iOS to follow soon), we have GPS accuracy to match this (i.e. the improved accuracy would not be spurious).
While there are no savings in hashing costs on the phone, but there are major savings for Safe Places, which still only has to hash each point of concern once.
Since we already have implemented function to match multiple “nearby” geohashes, it doesn’t make much difference to complexity if this number is 2-3 or 50-60. And the complexity of determining the set of 5m x 5m 9 char geohashes that make up a 20m radius circle is not very hard. Even a simplistic alggorithm with loads of scope for optimization takes less than 10% of the cost of the hashes to compute
What are the downsides?
Increased data storage on the phone. Maybe ~120 hashes stored for each location. But they are only 8 bytes, and there are only 4000 points (for 2 weeks' data). 4000 x 120 x 8 = 3 MB. That’s not enough data to be a big deal.
Increased compute costs on the phone for intersection calculations. But since this just comes down to comparing hash values, teh compute cost should not be too high, and this cost occurs just once every 12 hours. Needs further assessment, though.
Further ahead, this may increase compute costs for more sophisticated encryption schemes, e.g. FHE - see below.
Looking at the numbers in detail:
Degrees north | Number of geohashes in 20m radius circle | Increase in cost vs. current average of 2.3* |
---|---|---|
0 | 58 | 25x |
30 | 66 | 28x |
45 | 78 | 34x |
60 | 107 | 46x |
*Note that this average of 2.3 actually varies by latitude as well, although not by as much.
Assuming we stick with the ~5 seconds allowance for a smartphone to compute all hashes (including 2 timestamps), the we can afford approx 100msecs per hash.
To achieve this on lower end phones, we probably need to reduce the hash cost to 2048 = 2^11.
However, given the 30x entropy increase in the location space, this delivers equivalent protection to a hash cost of 65536 = 2^16 with 8 char geohashes.
As per previous anlaysis, this is roughly net neutral in terms of encryption costs for the app and the attacker. But it massively reduces Safe Places encryption cists, and gives much more accurate proximity detection.
Looking Ahead - FHE
Long-term, we anticipate moving from Scrypt encryption to an FHE based solution.
The FHE solution will no longer depend on the entropy of the location space, so the finer resolution does not deliver much benefit (except perhaps against some hypothetical typs of brute force attackes using rooted phones).
However the increased number of data points to be checked for matches will create issues.
The volume of computations that need to be performed by “Server 1” in the FHE design will increase by a factor of ~30 (as per the above), as it will have to check 30x as many data points.
This may be an issue for the FHE design. However, the alternative is that we stick with 8 character geohashesh indefinitely, and cannot deliver more accurate expsure notifications than the 100 feet that we are delivering in MVP1.
It seems to me that we should make the move to 9 character geohashes, and make it a requirement on the FHE initiative that it cope with this. However I’d value input on this from those who will be developing the FHE code.
The future - Public Data
From our engagement with the Open Security Summit, we have been getting some strong encouragement to drop the idea of Points of Concern being private, and embrace the idea that they are public data.
The principal argument is as follows:
Because the published data is accessible to App users, it is not going to be possible to keep it secret from a sufficiently determined and well-resourced attacker, no matter what we do.
If we believe the data is secret, this may lead to us putting insufficient attention into making sure the data is properly redacted, and sufficiently aggregated to assure anonymity.
The ideal design from a security point of view would be to have a very clear separation between data that is private and secure, and data that is public. And all the “points of concern” should be in this latter category, with sufficient anonymization, that we have full confidence in, that we should be completely comfortable publishing it in plain text, and even encouraging researchers to analyze the data set.
Should we move to such a model for our “points of concern” data?
There could be some significant benefits of this data being public:
Shareable URLs for specific points of concern. For non-app users, the ability to share points of concern with others on social media could be an entrypoint into the PathCheck ecosystem. Once they find these points of concern useful, it should be fairly easy to convert them to using the App as the optimal way to review such data.
Being able to publish more data in the app such as exact place, placename & time for an exposure would be useful for users in assessing their risk.
It opens up the possibility for many more innovative uses of the data.
Probably more…
However we believe that at least some of the data that is acceptable to share with individual users who may be infected, it could be problematic if it were made completely public.
Probably, it is the wrong question to ask: should “points of concern” data be public, and better to ask:
Which “points of concern” data should be public, while continuing to develop technology that allows us to share points of concern with individual infected users, without it being public.
We can, in parallel, continue to develop such technology, while also exploring procedures by which Health Departments could determine that particular data points could be made completely public, and then developing technology to allow this sub0category of points of concern to be exposed in a plain-text, sharable form.
Proposal for post-MVP1
I propose that post-MVP1, we do not increase the cost of the Scrypt hash (as previously planned).
Instead, we move to 9 character geohashes, with an appropriate Scrypt cost (probably 204 = 2^11) with the key benefits being:
Increased location match accuracy on exposure notifications
Reduced hashing time required on Safe Places - and hence faster publishign & lower compute costs.
I also propose that:
The FHE project assumes that it has to cope with 60+ time+location points for each single data point recorded by an app user.
We explore potential options for identifying a subset of “points of concern” data that can be published in plain text, with all the various benefits that that may bring, while continuing to develop technology that allows us to share more-sensitive points of concern in a manner that allows for exposure notifications, but limits use for other purposes.