Wednesday, 17 November 2010

HP DL380 G7 Memory failing temperature tests

I thought I'd write about the problems I've been having with 2 HP G7 servers because the problem doesn't seem to exist anywhere else.

We purchased 2 DL380 G7s with 48GB RAM. Unfortunately they arrived and both had a Dimm reporting an issue so we had those replaced. After being tested thoroughly we thought they were ok and shipped them to their respective datacentres.

A few weeks later we found one of the servers kept rebooting itself. In the windows eventlog there were some entries for WHEA_Logger saying correctable error encountered.

We narrowed this down to memory and ran the diagnostic tests offline from the smart start CD. We initially ran the complete test but it completed ok, then we ran all memory tests, still completed ok. Then we ran the custom test and found 2 additional tests which weren't included in the complete test.

When we ran these tests 2 Dimms were reporting errors one on the temperature test, and one with the SPD test (need to clarify this exact test).

Both Dimms were replaced and the SDP error disappeared but the temperature test still failed with:

Error DIMM Temperature out of range.
Device DIMM = PROC 1 DIMM 2D
expected temperature range 20 - 90
actual temperature 00 -
Bus:Address 08:32 Ran on CPU 0

We had the Dimm replaced again but still the same problem. Then they were swapped about and the problem moved to Dimm 3A.

Then we had 3A replaced and it moved to Dimm 5? Thats when we asked for all the Dimms to be replaced. The server was finally sorted.

Then on running the same offline test on the other server we had the same problem with the same slot.

Monday they sent out an engineer to replace 1 dimm. Same problem - same slot.
Tuesday they replaced 2 dimms. Same problem.
wednesday they replaced the motherboard. Still same problem.
I believe based on the previous server that if they replace all the Dimms that it'll be fixed. But HP being HP are reluctant to do that.

What I believe is happening is either the memory is faulty or the motherboard is causing them to become faulty. So we have a motherboard and potentially 12 faulty dimms. They replace the dimms but then they're now faulty because the motherboard is doing something to them. They replace the motherboard, but we still have 12 dimms. This sounds unlikely so my other suggestion is that the test only reports on the first occupied dimm which has a fault, and maybe the engineer didn't swap the right dimm (3 times? not sure about this either), but based on the last experience once one dimm is 'fixed' the next sequential dimm slot reports the problem until all faulty dimms replaced.

If this helps someone then it wont have been a waste of time writing this. If you are affected by it good luck, you're going to need it.

Update - 19/11/10
The saga continues - have been chasing HP for some action nothing has happened for nearly 48 hours. The server has been down for over 7 days and it supposed to be covered by 24x7x4 support. This is a joke. Have to seriously question whether we purchase any HP kit from now on.

Very late update
HP replaced all the memory in both servers that were purchased at the same time.  A few weeks later they advised that the memory was counterfeit and were trying to find out the source.  Gave them all the details of the company we purchased the memory from but didn't hear anything further.


