Problem context
A dashboard has been wrong for some unknown number of sprints.
Nobody is sure exactly when the drift started or which events are affected.
Decisions have already been made against the bad data, and people are nervous.
What breaks if this is not solved
- Trust. Even after the fix, the team distrusts every chart for a while.
- Time. Cleaning up affected historical metrics takes longer than the original implementation.
- Velocity. Future PRs slow down because nobody trusts the analytics flow.
When this playbook applies
- You have noticed at least one chart with values that do not match reality.
- You can run git history queries against the affected files.
- You have access to the raw event data in your analytics tool, not just dashboards.
System approach
Establish ground truth before chasing fixes. Pull the raw event data, not the dashboard's interpretation.
Bisect with the git history of the files that touch the affected events. The drift commit usually stands out once you know what you are looking for.
Document the failure mode that caused this specific drift so the next playbook step (prevention) addresses the actual root cause.
Execution steps
- Pull raw event counts by day for the affected events from your analytics tool. Do not trust the dashboard.
- Identify the day the count broke (sharp drop, sharp rise, or sudden ratio shift).
- `git log` the file(s) that emit the affected event. Look at PRs merged in the few days before the break.
- Identify the offending PR. Read the diff. Classify the drift (removed, renamed, moved, altered payload, conditional change).
- Decide: roll forward (re-add the missing call) or roll backward (revert the PR). Rolling forward is usually right.
- Recover historical data where possible: if the event was renamed, UNION the two names in dashboards. If it was removed, write a follow-up post explaining the gap.
- Install Skene now, against the current corrected state. The baseline you set today is the floor; you will not drift past it again silently.
Metrics to watch
Days from break to detection
Whatever this incident shows you - it is your worst case. Once Skene is in CI, you bring this to less than a day.
Dashboards rebuilt vs. retired
Some affected dashboards are not worth rebuilding. Make explicit decisions on each.
Failure modes
- Fixing the bug without writing down what failure mode caused it. The same drift will recur in a different file.
- Patching individual dashboards instead of fixing the upstream event.
- Skipping the Skene install because you 'just fixed it'. The next refactor will rediscover this problem.
Related concepts
Adjacent playbooks
