The system should be stable again. It took a while (to put it mildly), but we appear to have tracked the trouble down to a recent `runc` upgrade that'd sometimes get stuck when starting containers.
The trouble with had with ceph was just a side effect of this. In the process of debugging all this the ceph configuration has been made much more durable. 🤞🏻
May 9, 21:31 UTC
The system has been up and down through the day. We continue to have trouble with the Ceph system locking up. :-/
May 9, 05:25 UTC
The ceph storage system hung this morning; the system is slowly recovering (hopefully -- it is still a bit choppy).
Monitoring isn't working, but all DNS services and the NTP service is unaffected or only minimally impacted for now.
May 8, 15:00 UTC