The system should be stable again. It took a while (to put it mildly), but we appear to have tracked the trouble down to a recent `runc` upgrade that'd sometimes get stuck when starting containers.
The trouble with had with ceph was just a side effect of this. In the process of debugging all this the ceph configuration has been made much more durable. 🤞🏻
Posted May 09, 2021 - 21:31 UTC
Identified
The system has been up and down through the day. We continue to have trouble with the Ceph system locking up. :-/
Posted May 09, 2021 - 05:25 UTC
Investigating
The ceph storage system hung this morning; the system is slowly recovering (hopefully -- it is still a bit choppy).
Monitoring isn't working, but all DNS services and the NTP service is unaffected or only minimally impacted for now.
Posted May 08, 2021 - 15:00 UTC
This incident affected: Management Portal, Public website, and DNS updates.