Recovery playbook. When you discover the funnel has been wrong for weeks, here is the order of operations to get back to honest data.

Job-to-be-done: Recover from a known instrumentation drift incident: the dashboard is wrong, find out why and stop it from happening again. · Updated 2026-05-22

Problem context

A dashboard has been wrong for some unknown number of sprints.

Nobody is sure exactly when the drift started or which events are affected.

Decisions have already been made against the bad data, and people are nervous.

What breaks if this is not solved

Trust. Even after the fix, the team distrusts every chart for a while.
Time. Cleaning up affected historical metrics takes longer than the original implementation.
Velocity. Future PRs slow down because nobody trusts the analytics flow.

When this playbook applies

You have noticed at least one chart with values that do not match reality.
You can run git history queries against the affected files.
You have access to the raw event data in your analytics tool, not just dashboards.

System approach

Establish ground truth before chasing fixes. Pull the raw event data, not the dashboard's interpretation.

Bisect with the git history of the files that touch the affected events. The drift commit usually stands out once you know what you are looking for.

Document the failure mode that caused this specific drift so the next playbook step (prevention) addresses the actual root cause.

Execution steps

Pull raw event counts by day for the affected events from your analytics tool. Do not trust the dashboard.
Identify the day the count broke (sharp drop, sharp rise, or sudden ratio shift).
`git log` the file(s) that emit the affected event. Look at PRs merged in the few days before the break.
Identify the offending PR. Read the diff. Classify the drift (removed, renamed, moved, altered payload, conditional change).
Decide: roll forward (re-add the missing call) or roll backward (revert the PR). Rolling forward is usually right.
Recover historical data where possible: if the event was renamed, UNION the two names in dashboards. If it was removed, write a follow-up post explaining the gap.
Install Skene now, against the current corrected state. The baseline you set today is the floor; you will not drift past it again silently.

Metrics to watch

Days from break to detection
Whatever this incident shows you - it is your worst case. Once Skene is in CI, you bring this to less than a day.
Dashboards rebuilt vs. retired
Some affected dashboards are not worth rebuilding. Make explicit decisions on each.

Failure modes

Fixing the bug without writing down what failure mode caused it. The same drift will recur in a different file.
Patching individual dashboards instead of fixing the upstream event.
Skipping the Skene install because you 'just fixed it'. The next refactor will rediscover this problem.

Glossary

instrumentation-drift renamed-event

Adjacent playbooks

Validate analytics in CI as part of code review

System approach

Establish ground truth before chasing fixes. Pull the raw event data, not the dashboard's interpretation.

Bisect with the git history of the files that touch the affected events. The drift commit usually stands out once you know what you are looking for.

Document the failure mode that caused this specific drift so the next playbook step (prevention) addresses the actual root cause.

Execution steps

Pull raw event counts by day for the affected events from your analytics tool. Do not trust the dashboard.

Identify the day the count broke (sharp drop, sharp rise, or sudden ratio shift).

`git log` the file(s) that emit the affected event. Look at PRs merged in the few days before the break.

Identify the offending PR. Read the diff. Classify the drift (removed, renamed, moved, altered payload, conditional change).

Decide: roll forward (re-add the missing call) or roll backward (revert the PR). Rolling forward is usually right.

Recover historical data where possible: if the event was renamed, UNION the two names in dashboards. If it was removed, write a follow-up post explaining the gap.

Install Skene now, against the current corrected state. The baseline you set today is the floor; you will not drift past it again silently.

Fix dashboards that have already drifted

Problem context

What breaks if this is not solved

When this playbook applies

System approach

Execution steps

Metrics to watch

Failure modes

Fix dashboards that have already drifted

Problem context

What breaks if this is not solved

When this playbook applies

System approach

Execution steps

Metrics to watch

Failure modes

Problem context

What breaks if this is not solved

When this playbook applies

System approach

Execution steps

Metrics to watch

Failure modes

Related concepts

Problem context

What breaks if this is not solved

When this playbook applies

System approach

Execution steps

Metrics to watch

Failure modes

Related concepts