manage.ntppool.org outage

Incident Report for NTP Pool System Status

Resolved

All has been stable since we downgraded `runc`. We think it was this issue -- https://github.com/opencontainers/runc/issues/2865

Posted May 09, 2021 - 23:10 UTC

Monitoring

The system should be stable again. It took a while (to put it mildly), but we appear to have tracked the trouble down to a recent `runc` upgrade that'd sometimes get stuck when starting containers.

The trouble with had with ceph was just a side effect of this. In the process of debugging all this the ceph configuration has been made much more durable. 🤞🏻

Posted May 09, 2021 - 21:31 UTC

Identified

The system has been up and down through the day. We continue to have trouble with the Ceph system locking up. :-/

Posted May 09, 2021 - 05:25 UTC

Investigating

The ceph storage system hung this morning; the system is slowly recovering (hopefully -- it is still a bit choppy).

Monitoring isn't working, but all DNS services and the NTP service is unaffected or only minimally impacted for now.

Posted May 08, 2021 - 15:00 UTC

This incident affected: Management Portal, Public website, and DNS updates.