Database outage

Incident Report for NTP Pool System Status

Postmortem

I moved the production database to a new version of MySQL earlier in the week (using the same instance the beta system has been using since sometime last year). There were some minor hiccups in the process, but I left it in what I thought was a stable happy place.

Early early this morning California time the database cluster went into read-only mode. I couldn't get it back in sync when I woke up and saw it. I decided the best cause of action was to clear the cluster and restore the most recent backup (from a few hours prior).

I deleted the cluster and pointed to the backup, but the (open source) tool to restore the backup had a bug that made the restore fail and it took me a little while to learn enough about it to add debugging, deploy a custom build and then fix the bug. Ooof.

It was down for almost 6 hours by the time it was up and working again.

It was a major database upgrade now completed though, so hopefully there won't be something like it until maybe converting to Postgres at some point in the future.

Posted Jul 03, 2025 - 23:22 UTC

Resolved

This incident has been resolved.

Posted Jul 03, 2025 - 23:22 UTC

Monitoring

Database has been restored; monitoring the performance.

Posted Jul 03, 2025 - 18:07 UTC

Identified

The mysql cluster is being reset and a backup from a couple hours ago is being restored.

Posted Jul 03, 2025 - 14:16 UTC

Investigating

A couple days ago I upgraded the older MySQL cluster to a newer one. It worked fine and then ... DIdn't.

The DNS and NTP services continue to operate, but the management website and monitoring is having a complete outage.

Posted Jul 03, 2025 - 14:15 UTC

This incident affected: Management Portal, Public website, DNS updates, GeoDNS servers, Global NTP Service, and Monitoring System.