GAEN Scale & Robustness Testing

There are in fact two servers in the GAEN solution

The Verification Server (VS)
The Exposure Notification Server (ENS)

The role of the Verification Server is to generate (for HD staff) and then validate (when submitted by an App User), the verification codes that are used to confirm a COVID-19 diagnosis.

The Exposure Notification Server then receives the keys from a patient with a confirmed diagnosis, and maintains a set of key files with all keys from diagnosed patients for the last 14 days, including handling aging out of old keys etc.

The key data itself is served to app users by a Content Distribution Network. Neither the ENS or the Verification Server is directly involved in this flow.

Scaling concerns

Suppose we have a population of 10M, with 50% uptake of the App (5M users), and 1% of users diagnosed with COVID in a given week. We assume that all diagnosed users use that Mobile App. These estimates are all on the high-side, but allow us to begin modelling the scale.

Verification Server

The verification server has to generate one code, and verify one code, for each diagnosis. With the numbers above, that is 100k diagnoses per-week, or approx 15k per day. If those are bunched into 8 hours, this suggests a traffic load of about 2k/hour (for each of code verification & code generation) - not a very demanding level of load.

The verification server may be subjected to more load if there were also a large number of invalid requests, perhaps due to user error, an attempt to brute-force the verification process, or a DOS attack.

These scenarios all merit further consideration and testing.

Exposure Notification Server

The Exposure Notification Server receives at most 14 keys for each diagnosed patient. Note that although the Random IDs exchanged by smartphones are changed every 15-20 mins, these are derived from keys that only change once a day, and only those keys need to be shared to/from the ENS.

With 100k diagnoses per week, and keys stored for up to 2 weeks, the ENS database may have to store records for 200k patients. On average it will store 7 keys for each patient (14 keys, but the oldest ones are aged out after a day). So total number of keys is approx 1.4M.

In terms of database size, this is still pretty small, and scale does not seem like a particular risk here.

The verification server should protect the ENS from user errors, or brute-force attacks, but DOS attacks are still a risk to be considered.

Content Distribution Network

While a database of 1.4M keys is small, the whole database has to be distributed to every app user in the community.

This creates some challenges for the mobile app, and also for the CDN.

Supposing each key is ~20 bytes (need to check whether the keys are shared in a text or binary format etc.), that’s approx 28MB of data sent to each user.

(we need to understand whether the whole file is shared every time, or just the delta - suspect there is a technology that allows just the deltas to be served).

To server this data to 5M users in the community, this requires a total of 140TB of egress data from the CDN. That’s a substantial volume or data

This is at least 140TB every 2 weeks. But maybe more if there if the app has to re-download data that it has already downloaded.

With egress data charges typically around $0.05/GB, this comes to maybe $7,000 / week. Not a huge cost to protect a community of 10M people, but still a cost that needs to be planned for.

In terms of actual data volume, 140TB / week, or 20TB / day is fairly small in terms of CDN capacity - equiavlent to 20k hours of Netflix, which could easily be consumed by 1000 households.

So there is no reason to be concerned about the CDN network’s ability to scale to this level of load, but the costs do need to be planned for.

As with the verification server, and ENS, DOS attacks also need to be considered. A small volume of ingress data can trigger a large volume of egress data, so an attacker could flood the network with egress traffic, at great cost to whoever is ruynning the service, with little personal cost or effort incurred.

(to check; is there any authentication on requests to download the keys data?)

App Scaling Concerns

At the App itself, we also need to be concerned with scale.

The App collects Rotating IDs from all devices around it. In a busy public place, it could be exposed to 10s, even hundreds of these identifiers at a time. It records them once every 5 minutes, and the identifiers themselves change every 15-20 minutes.

In a worst case scenario, a device collecting 100 RIDs every 5 minutes for 2 weeks could end up with 400k RIDs recorded - (though in most realistic scenarios, the expected number will be a tiny fraction of this number).

To compute exposures, the smartphone then has t compare these RIDs against the 1.4M keys that may be downloaded from the ENS.

The complexity of this calculation is proportional to the product of these two numbers: 1.4M x 400k - so in the case where both numbers are large, this could prove a very taxing computation for the Smartphone.

Such large numbers will be rare - they would require

A user who spent the majority of the week in close proximity to dozens of other app users
Who lives in an area with a very high COVID-19 case-load.

So such extreme cases will be very rare (and therefore it may be acceptable if they are not handled perfectly) - but it is hard to define significantly lower limits, and be confident that the app might not exceed them.

A further note: testing these scenarios is extremely difficult. Access restrictions to the GAEN function in the OS means that the only way to get a large number of RIDs into a device is to actually expose the device to the Bluetooth broadcast of those RIDs. We do not yet have a technical proposal for how we would achieve this. Running the app on a set of 100 phones all in close proximity for 2 weeks might not be impossible but it is certainly logistically complex! (the same test with maybe 20 phones might be feasible, though).

What about larger communities?

We have modelled a community of 10M users. Some countries are considerably larger than this - over 100 times larger in the case of India or China.

Further, the volume of data distributed is roughly proportional to the square of the population, as it is determined by the # of infections x the # of app users, both of which can be expected to scale up with the population.

Therefore the CDN capacity and costs can be expected to rise N-squared with the population, which needs careful consideration for very large deployments.

For the individual app user, the volume of data downloaded is proportional to the community size, not the square. The number of RIDs stored on the user’s phone depends only on that user’s behaviour, and not on the size of the overall population.

However we do need to watch out for the fact that the number of RIDs stored will go up as the adoption level in the community goes up.

Robustness concerns

As well as confirming that we can handle the required level of scale in benign conditions, we also want to validate that the solution continues to operate well in adverse conditions.

Adverse Conditions - Server

The server implementation is run as a “serverless” deployment within Google’s Cloud Architecture, and is architected to be resilient to a wide range of adverse conditions, for example:

The service is delivered by an automatically scaling pool of multiple instances, spread across multiple servers in multiple Google Data Centers within a single Availability Zone.
In normal operation, load is balanced across all these instances, and when a failure occurs, load is simply redistributed across the other servers.
This architecture, and the related technology is very standard cloud technology, that is well understood, and proven to be effective at covering fault conditions that may arise on individual servers.

Because the ENS and Verification services are implemented using this “serverless” architecture, we do not have the capability to generate failures and fault conditions on the underlying hardware. This means that we are very limited in the level of explicit Robustness testing that we can perform regarding our resilience to faults that occur in Google servers. The approach we are taking is simply to trust the architecture, and Google’s track record in delivering “serverless” infrastructure in a seamless way.

Sherif: how true is this? Could we not e.g. kill 50% of our serverless instances at a time and observe that the traffic load continues to be handled correctly without interruption?

Some Health Departments will want to deploy in clouds other than Google, e.g. AWS. Assuming they pick a serverless deployment model in AWS, the same is likely to be true, and there will be little we can do in terms of explicit robustness testing: the focus will need to be on the careful review of the architecture implemented.

Specific adverse conditions for the Server that we can (and should) test include:

Overload. Can we run 50% or 100% above our expected load, and still function correctly?
DOS. Can we handle DOS attacks effectively.
For discussion with Sherif & other experts: is there anything else we can / should do here?

Adverse Conditions - Mobile App

There are a range of adverse conditions that the Mobile App could be subject to:

Poor or no network connectivity
Low CPU or RAM due to other app activity
Low storage
Low power
Etc.

In general, we don’t need the app to handle maximal scale under such conditions. Our goal should be simply to ensure that when such conditions impact the App’s ability to operate as expected, the conditions are clearly flagged to the user.

This is mostly covered as functional testing - see GAEN-97. Once these tests are completed in low scale functional test scenarios, it would also be desirable to validate that behaviour of the app at large scale is also reasonable. However this is not a top priority.

Testing / Monitoring in Production

As we move from test into production, there are a number of concerns that we will want to monitor for:

Is the volume of traffic in line with our expectations, given the level of uptake?
Is resource usage (Number of instances running, Memory, CPU, bandwidth etc.) in line with expectations given the observed traffic volumes?
Are response times to users in line with requirements and expecations (we might set up & monitor some specific probes for this purpose).
Potential explicit monitoring for particular adverse conditions we might anticipate (e.g. DOS attacks).

Beyond this, we may wish to explictly invoke certain fault or failure conditions in production, to build our confidence that the system is able to handle them smoothly (Netflix-style “Chaos Engineering).

Further review & discussion is needed in this area.

Issues to follow up on

All highlighted in the text above - pulled out here for clarity on what needs to be followed up on:

What technology exists in the GAEN solution to protect against DOS attacks?
How many bytes does it take to distribute a key? (depends on many factors, but one key factor will be whether the data is binary or readable text).
Does a phone have to download the full set of infected keys every time, or are there optimizations so that each infected key only needs to be downloaded once? (or somewhere in between?)
How can we test with a large number of RIDs on a device, given that we cannot add them programatically, but only by running other Bluetooth devices nearby?
How sensible are our estimates…
- For 50% adoption of the app in a community
- For 1% of population being diagnosed in a single week (seems very high, but we do want to model conservatively)
- For the number of RIDs that a user’s device might plausibly pick up in a week.
Robustness: what is the relatity of what we can / cannot do in terms of simulating adverse conditions in the Server Network.
Move forward conversation on Testing/Monitoring in Production.