So, #HackerNews went down, twice, in the period of 24 hours, when first the main server HD failed, then a few hours later the failover server's hard drive also failed.
This appears to have been due to a SanDisk hard drive issue which causes drives to fail at precisely 40,000 hours of uptime.
The issue had been submitted multiple times to HN, though those submissions never made it out of the queue.
@dredmorbius oh wow, that's really interesting. Speculations seem to be about some kind of floating point glitch in the firmware related to timing. This is also what I would expect.
@dredmorbius We've been spending quite a bit of time on this and similar problems, mostly last year I think? Lots of vendors had problematic SSDs in their systems. Here's the notice for Cisco UCS hardware, for example: https://www.cisco.com/c/en/us/support/docs/field-notices/705/fn70545.html
It went up and down through the tech press and vendor support channels - you'd think that someone who runs their own hardware in production had at least some of those on their radar...
@dredmorbius if i remember correctly its peculiar to the chipset that drives the SSD, which sandisk used nearly across their product range until they changed when this issue arose.
On the internet, everyone knows you're a cat — and that's totally okay.