Saturday, May 02
Lost A Day
Oops. Looks like the server died just before the daily backups, not just after. (I pasted yesterday's post in from the version I cross-posted over on Ace's blog.)
If you didn't see the notice I put up while the server was dead, well, the server died. It's up and running but all the LXC containers where the actually work happens are completely frozen, and I was worried that if I touched anything it would just get worse, so I grabbed all the backups and moved them to the new server I already had set up for that purpose.
Oops. Looks like the server died just before the daily backups, not just after. (I pasted yesterday's post in from the version I cross-posted over on Ace's blog.)
If you didn't see the notice I put up while the server was dead, well, the server died. It's up and running but all the LXC containers where the actually work happens are completely frozen, and I was worried that if I touched anything it would just get worse, so I grabbed all the backups and moved them to the new server I already had set up for that purpose.
Took about twelve hours from the old server failing to the new one being operational, but the move I've been planning for months finally happened, so there's that.
If you're missing anything major or having any other problems, please comment here.
Update: Found the problem. Disk errors threw the ZFS pool on the second SSD (where the containers lived) into "faulted" state, so the server was responding but the load average was around 600 because anything in those containers that tried to write to disk was hanging indefinitely.
I've recovered it (which was easy) but it's still warning about data corruption. Backups were intact because they were in a partition on the boot SSD - the idea being that a disk failure of either one would leave us with intact data. I also have offsite backups but they weren't as up to date.
I've recovered it (which was easy) but it's still warning about data corruption. Backups were intact because they were in a partition on the boot SSD - the idea being that a disk failure of either one would leave us with intact data. I also have offsite backups but they weren't as up to date.
Since we're already on the new server I'll take a final set of backups and the clear and cancel the server.
The new server has a single SSD, but it's in a cluster and backups are synced daily to a storage server with RAID-Z3, so we'd have to lose the main server and four drives on the backup server before we lost data. So we're fine unless there's a datacenter fire.
Another datacenter fire. We survived the last one but that server was down for three weeks while they cleaned up.
The new server has a single SSD, but it's in a cluster and backups are synced daily to a storage server with RAID-Z3, so we'd have to lose the main server and four drives on the backup server before we lost data. So we're fine unless there's a datacenter fire.
Another datacenter fire. We survived the last one but that server was down for three weeks while they cleaned up.
Outage Message
Sorry, our server decided to stop serving. We have full backups from just a few hours ago and are restoring them on a new server.
Hold tight, it's just a tiny bit diffly fiddly.
Update 1:43 AM UTC: Application container has been restored - there was an issue with the backup file and I needed to hand-edit it, but that's done and it all worked.
Database is restoring and that will likely take a couple more hours.
Update 3:54 UTC: Database container restored. Configuring things now.
Also, the database needs some repairs since it was a live snapshot.
-- Pixy
Posted by: Pixy Misa at
06:25 PM
| Comments (3)
| Add Comment
| Trackbacks (Suck)
Post contains 437 words, total size 3 kb.
51kb generated in CPU 0.0587, elapsed 0.1642 seconds.
58 queries taking 0.1448 seconds, 367 records returned.
Powered by Minx 1.1.6c-pink.
58 queries taking 0.1448 seconds, 367 records returned.
Powered by Minx 1.1.6c-pink.









