Following some link I just arrived at LiveJournal and all I got was the message: LiveJournal is currently down due to a massive power failure at our data center. We'll provide updates at /powerloss/ as they're available.
Sometimes life sucks bigtime for a Netadmin, I symphatize with these LJ guys who are now working their asses of to get LJ back on line. Here's their progress report:
Our data center (Internap, the same one
we've been at for many years) lost all its power, including redundant
backup power, for some unknown reason. (unknown to us, at least) We're
currently dealing with verifying the correct operation of our 100+
servers. Not fun. We're not happy about this. Sorry... :-/ More details
Update #1, 7:35 pm PST: we have power again, and we're
working to assess the state of the databases. The worst thing we could
do right now is rush the site up in an unreliable state. We're checking
all the hardware and data, making sure everything's consistent. Where
it's not, we'll be restoring from recent backups and replaying all the
changes since that time, to get to the current point in time, but in
good shape. We'll be providing more technical details later, for those
curious, on the power failure (when we learn more), the database
details, and the recovery process. For now, please be patient. We'll be
working all weekend on this if we have to.
Update #2, 10:11 pm:
So far so good. Things are checking out, but we're being paranoid. A
few annoying issues, but nothing that's not fixable. We're going to be
buying a bunch of rack-mount UPS units on Monday so this doesn't happen
again. In the past we've always trusted Internap's insanely redundant
power and UPS systems, but now that this has happened to us twice, we
realize the first time wasn't a total freak coincidence. C'est la vie.
Update #3: 2:42 am:
We're starting to get tired, but all the hard stuff is done at least.
Unfortunately a couple machines had lying hardware that didn't commit
to disk when asked, so InnoDB's durability wasn't so durable (though no
fault of InnoDB). We restored those machines from a recent backup and
are replaying the binlogs (database changes) from the point of backup
to present. That will take a couple hours to run. We'll also be
replacing that hardware very shortly, or at least seeing if we can
find/fix the reason it misbehaved. The four of us have been at this
almost 12 hours, so we're going to take a bit of a break while the
binlogs replay... Again, our apologies for the downtime. This has
definitely been an experience.