Testing in Production (TiP)
Goals
Testing in production is an important core competency for mitigating risks that it exposes.
The following page will document provides context, use cases and how to get involved.
Use Cases
Safe Paths - Crowd-Testing
Safe Paths - Location-Data
Safe Places - Machine-Data / APM
Epic / User Story
Target release | MVP |
---|---|
Epic / User Story | |
Document status | IN PROGRESS |
Document owner | @Jonathon Wright |
Designer | @Todd DeCapua |
Tech lead | @Eran Kinsbruner |
Technical writers |
|
QA | @Diarmid Mackenzie |
Task / Work in Progress (WIP)
Task / Goal / WIP | Status |
---|---|
Use Case - Undiagnosed users | |
Use Case - Diagnosed users | |
Use Case - Contact tracer | |
Use Case - Health authority (HA) | |
Location Data - Diagnostics | |
Location Data - Lifecycle Management | |
Location Data - Boston Area (GPX) | |
Safe Places - Machine Data (APM) | |
Safe Paths - Crowdtesting (Mobile Labs) | |
Safe Paths - Crowdtesting (TestFlight) | |
Safe Paths - Crowdtesting (Google Play) | |
Provide Access to Local Mobile Devices | |
Provide Access to Remote Cloud Devices | |
Provide Access to TestFlight |
Measurements / KPIs
How many actually complete set-up?
No Analytics > Conversions Event to track event cadence for the app 'onboarding process' during startup.
How many turn on location data
No Analytics > Unable to capture telemetry / instrumentation meta data.
How many subscribe to an HA?
Splunk for Good > Track event cadence for the ‘subscribe to HA'
How many open the app after they install it?
No Analytics > Retentions to track high level ‘engagement’.
How many get an alert that they may be infected?
No Analytics > Push notifications would need to be implemented.
How many location data points does their app log
Splunk for Good > Depending on the device this calculation can be made during the export function.
How many infection data points do they have from HAs they are subscribed to.
No Analytics > Unable to track this information from device (PII / GDPR)
How is performance is effected by network location (NV)
Splunk for Good : Quality > Performance we can look at both Network Response Latency (NRL) and Device Performance (Duration Traces). Custom Traces (CPU, Memory, Device Attributes) and Data Aggregation (Network / URL).
How can we A/B test / Canary Rollout
Splunk for Good: Testers can be defined with a subset of user behaviors that can be flagged for rollout.
Deliverables
Apple Build (iOS / IPA) - COVID Safe Paths
Check out the pre-requisites section for any dependencies.
Provide Access to Remote Cloud Devices
Check out https://pathcheck.atlassian.net/wiki/spaces/TEST/pages/14221509
https://www.youtube.com/watch?v=RIZpGNRM_4Y
Provide Access to Local Mobile Devices
Check out https://pathcheck.atlassian.net/wiki/spaces/TEST/pages/14287088
https://www.youtube.com/watch?v=is_xD68xHcs
Provide Access to Local Mobile Devices via Appium
https://www.youtube.com/watch?v=uXdfv-d78_A
Pre-requisites
Clone the GIT repositories
Download the latest IPA / APK files
https://github.com/tripleblindmarket/covid-safe-paths/releases/tag/v0.9.4
Good Practices
1. Make layers – like a stack of pancakes
The idea of “testing in production” can actually mean different things. Are you testing a bunch of test servers from within your production data center? Or are your test applications running separately on top of your production platform? Or are you truly running live tests against 100% production-deployed code? The answer should be all of these. Layer your production testing to give you the ability to test different aspects of the production environment in different ways. Then match up your test cases so as to minimize the impact that your testing – and maintenance of the test environment – has on production users.
2. Time your tests when usage is light
Non-functional testing can have an impact on your entire user base if you let it. It can make a server environment sluggish, and that’s something no one wants. Study your analytics and determine the best time to schedule your tests. For example, look for the lowest levels of:
Number of users on the site
Resource-intensive processes within the environment
3. Collect real traffic data and replay it to your test systems
Make sure to use actual traffic data you have been collecting in production (such as user workflows, user behavior and resources) to drive the generation of load for test cases. That way when you exercise your tests within your production environment, you’ll have confidence that the simulated behavior is realistic.
4. Introduce a chaos monkey
According to Netflix engineers Cory Bennett and Ariel Tseitlin, “The best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient.” Netflix built what’s called a Chaos Monkey into their production environment. This code actually introduces failures into the production environment randomly, forcing engineers to design recovery systems and develop a stronger, more adaptive platform. You can put your own chaos monkey in place because Netflix released their code to GitHub.
5. Monitor like crazy
When you are running a production test, keep your eye on key user performance metrics so that you know if the test is having any kind of unacceptable impact on the user experience. Be prepared to shut the test down if that’s the case.
6. Create an “Opt-in” experience for experimental testing
A great way to test how your application performs with real users is to have some “opt-in” to new feature releases. This will allow you to monitor and collect data from real time users and make adjustments to your testing strategy accordingly, without as much concern about impacting their experience. After all – they’ve already agreed to become test subjects, so a little hiccup here and there won’t come as a surprise.
Open Questions
Question | Answer | Date Answered |
---|---|---|
|