We had three boxes in the span of 3 days that went down due to memory errors, two of which went down within 2 hours of each other. All the boxes showed errors like:

ECC single bit correction warning rate exceeded, ECC single bit correction failure rate exceeded.

which is pretty self explanatory. My question is is it random lock that they had issues with in a few days or can it be something environmental causing it? ON reboot one box is hanging on

Configuring memory ...Done.

The other two boxes came up after a reboot. I want to be scientific about the issue. If there is a bad DIMM should a stress test show the issue or can the issue randomly creep up?

I am running some basic test and so far everything looks clean. Shouldn’t a stress test re-produce the issue?

Update: I tested with memtest+ and it came back clean.

If several machines fail at the same time (or report significantly increased error rates) it’s either a vast coincidence, bad power, heat, or radiation.

You’ll want to check the power, temperatures and locate the errors, swap the DIMMs around a bit and check whether the errors move along with them.

