Testing in Production (TiP)

Goals

Testing in production is an important core competency for mitigating risks that it exposes.

The following page will document provides context, use cases and how to get involved.

Use Cases

 

Safe Paths - Crowd-Testing

Safe Paths - Location-Data

 

Safe Places - Machine-Data / APM

 

Epic / User Story

Target release

MVP

Epic / User Story

https://pathcheck.atlassian.net/browse/TEST-7

Document status

IN PROGRESS

Document owner

@Jonathon Wright

Designer

@Todd DeCapua

Tech lead

@Eran Kinsbruner

Technical writers

 

QA

@Diarmid Mackenzie

Task / Work in Progress (WIP)

Task / Goal / WIP

Status

Task / Goal / WIP

Status

Use Case - Undiagnosed users

https://pathcheck.atlassian.net/browse/TEST-10?atlOrigin=eyJpIjoiMDA2ZGI1NjJiMjhhNGRkNDlmNDJhNjMxZjQxOWQ0NzQiLCJwIjoiaiJ9

Use Case - Diagnosed users

https://pathcheck.atlassian.net/browse/TEST-11?atlOrigin=eyJpIjoiMjExOWVlNWJlNmVlNDk1Zjk1NTE3NDQ4NjViMzMxMDciLCJwIjoiaiJ9

Use Case - Contact tracer

https://pathcheck.atlassian.net/browse/TEST-12?atlOrigin=eyJpIjoiOGNhYTAwNjhiOTRkNDBkZDhkYWE0ZGVhZGJmZjJmMGEiLCJwIjoiaiJ9

Use Case - Health authority (HA)

https://pathcheck.atlassian.net/browse/TEST-13?atlOrigin=eyJpIjoiMTY4MGY4YzcwM2QwNDkzYmIwNjk1YjcxMzkxOWYwNWUiLCJwIjoiaiJ9

Location Data - Diagnostics

https://pathcheck.atlassian.net/browse/TEST-14?atlOrigin=eyJpIjoiMmU4MWZjMThjYTFmNDc1ZGI4NmI2ZjUzOTQwN2QxYWIiLCJwIjoiaiJ9

Location Data - Lifecycle Management

https://pathcheck.atlassian.net/browse/TEST-15?atlOrigin=eyJpIjoiMGU0NDBlYzhlOTc4NDJkODkwNDQ0YjBlOWRmZDIxNGIiLCJwIjoiaiJ9

Location Data - Boston Area (GPX)

https://pathcheck.atlassian.net/browse/TEST-17?atlOrigin=eyJpIjoiYTkzNWJmNjA1ODZlNGQwOWI3YmExMDc5NDAxMDE0NmEiLCJwIjoiaiJ9

Safe Places - Machine Data (APM)

https://pathcheck.atlassian.net/browse/TEST-16?atlOrigin=eyJpIjoiMGEzMzdjNGM5OTQ3NGQ4MGIxYzIxODc5YjdlOWFmMjMiLCJwIjoiaiJ9

Safe Paths - Crowdtesting (Mobile Labs)

https://pathcheck.atlassian.net/browse/TEST-18?atlOrigin=eyJpIjoiMTE0MDJkMmZhNjRmNGM5OGE4MjQzMTJlNGQyNDI2NjMiLCJwIjoiaiJ9

Safe Paths - Crowdtesting (TestFlight)

https://pathcheck.atlassian.net/browse/TEST-19?atlOrigin=eyJpIjoiYzljZDRjYzk5ZjI4NDU0MDhlYWE5MTQzNTU3OTViNDIiLCJwIjoiaiJ9

Safe Paths - Crowdtesting (Google Play)

https://pathcheck.atlassian.net/browse/TEST-20?atlOrigin=eyJpIjoiOWYxY2U4MGE2N2NiNDljOTkzYWJkY2QxYzM1MzRjZjEiLCJwIjoiaiJ9

Provide Access to Local Mobile Devices

https://pathcheck.atlassian.net/browse/TEST-5?atlOrigin=eyJpIjoiMzRmY2MwZjFmZWNkNDRmZjg4NjUwN2U3ODBlYzg1OWUiLCJwIjoiaiJ9

Provide Access to Remote Cloud Devices

https://pathcheck.atlassian.net/browse/TEST-4?atlOrigin=eyJpIjoiZTAwOTFlOTI1NWU2NGRlMmE2ZmU0ODgxOWIzMzQyYmQiLCJwIjoiaiJ9

Provide Access to TestFlight

https://pathcheck.atlassian.net/browse/TEST-3

Measurements / KPIs

  1. How many actually complete set-up?

    No Analytics > Conversions Event to track event cadence for the app 'onboarding process' during startup.

  2. How many turn on location data

    No Analytics > Unable to capture telemetry / instrumentation meta data.

  3. How many subscribe to an HA?

    Splunk for Good > Track event cadence for the ‘subscribe to HA'

  4. How many open the app after they install it? 

    No Analytics > Retentions to track high level ‘engagement’.

  5. How many get an alert that they may be infected?

    No Analytics > Push notifications would need to be implemented.

  6. How many location data points does their app log

    Splunk for Good > Depending on the device this calculation can be made during the export function.

  7. How many infection data points do they have from HAs they are subscribed to.

    No Analytics > Unable to track this information from device (PII / GDPR)

  8. How is performance is effected by network location (NV)

    Splunk for Good : Quality > Performance we can look at both Network Response Latency (NRL) and Device Performance (Duration Traces). Custom Traces (CPU, Memory, Device Attributes) and Data Aggregation (Network / URL).

  9. How can we A/B test / Canary Rollout

    Splunk for Good: Testers can be defined with a subset of user behaviors that can be flagged for rollout.

Deliverables

Apple Build (iOS / IPA) - COVID Safe Paths

Check out the pre-requisites section for any dependencies.

Provide Access to Remote Cloud Devices

Check out https://pathcheck.atlassian.net/wiki/spaces/TEST/pages/14221509

https://www.youtube.com/watch?v=RIZpGNRM_4Y

Provide Access to Local Mobile Devices

Check out https://pathcheck.atlassian.net/wiki/spaces/TEST/pages/14287088

https://www.youtube.com/watch?v=is_xD68xHcs

Provide Access to Local Mobile Devices via Appium

https://www.youtube.com/watch?v=uXdfv-d78_A

Pre-requisites

Clone the GIT repositories

Download the latest IPA / APK files

https://github.com/tripleblindmarket/covid-safe-paths/releases/tag/v0.9.4

Good Practices

1. Make layers – like a stack of pancakes

The idea of “testing in production” can actually mean different things. Are you testing a bunch of test servers from within your production data center? Or are your test applications running separately on top of your production platform? Or are you truly running live tests against 100% production-deployed code? The answer should be all of these. Layer your production testing to give you the ability to test different aspects of the production environment in different ways. Then match up your test cases so as to minimize the impact that your testing – and maintenance of the test environment – has on production users.

2. Time your tests when usage is light

Non-functional testing can have an impact on your entire user base if you let it. It can make a server environment sluggish, and that’s something no one wants. Study your analytics and determine the best time to schedule your tests. For example, look for the lowest levels of:

  • Number of users on the site

  • Resource-intensive processes within the environment

3. Collect real traffic data and replay it to your test systems

Make sure to use actual traffic data you have been collecting in production (such as user workflows, user behavior and resources) to drive the generation of load for test cases. That way when you exercise your tests within your production environment, you’ll have confidence that the simulated behavior is realistic.

4. Introduce a chaos monkey

According to Netflix engineers Cory Bennett and Ariel Tseitlin, “The best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient.” Netflix built what’s called a Chaos Monkey into their production environment. This code actually introduces failures into the production environment randomly, forcing engineers to design recovery systems and develop a stronger, more adaptive platform. You can put your own chaos monkey in place because Netflix released their code to GitHub.

5. Monitor like crazy

When you are running a production test, keep your eye on key user performance metrics so that you know if the test is having any kind of unacceptable impact on the user experience. Be prepared to shut the test down if that’s the case.

6. Create an “Opt-in” experience for experimental testing

A great way to test how your application performs with real users is to have some “opt-in” to new feature releases. This will allow you to monitor and collect data from real time users and make adjustments to your testing strategy accordingly, without as much concern about impacting their experience. After all – they’ve already agreed to become test subjects, so a little hiccup here and there won’t come as a surprise.

Open Questions

Question

Answer

Date Answered

Question

Answer

Date Answered

 

Out of Scope