So, I have recently been involved in a couple of cases regarding power supplies. Back in October I was asked to come to a site during a maintenance windows to see about fixing a problem that won’t seem to go away.
This first case had the following symptoms:
- The IOM3-B module appeared quasi-online. It was there, but not quite.
- Firmware updates did not work. Resetting/re-seating did not do much.
- The DS4246 shelf would not allow the shelf ID to be set.
- I am sure there were other un-diagnosed issues, but these two were most obvious
NetApp was baffled. I asked for and received a whole new shelf, two Power Supply modules and two IOM3 modules to basically have everything on hand to fix whatever the problem could be. This had been festering for a few weeks. The customer and NetApp Support simply wanted this fixed.
During our outage, the first thing we did was eliminate the shelf. We moved all disks, Power Supplies and IOMs over to the new shelf and powered it on. The Shelf ID LED would not come on….at all. Mmm? Ok. Swap the IOM3’s for the new ones. Still nothing! Swap the Power Supplies. Ah HA! The Shelf ID light came on.
To further isolate, we ended up shuffling the Power Supplies around further finding that there was one bad Power Supply that was causing significant problems. When it was in *any* shelf, problems followed. Remove the Power Supply and the problems disappear.
After looking at older ASUP’s it is likely we might have been able to deduce a bad power supply, but the details were in a less commonly used section of the environment output.
This second case had the following symptoms:
- Upon performing A-side / B-side power testing, according to the netapp environment command, both power supplies were now unknown!
- Some / most of the drives powered down
- after power-cycling the shelf (both power supplies) NONE of the drives would power up!
Here we tried a few things, power-cycling a few times, resetting the IOM6 modules. For this case, we removed ONE power supply (PSU #4, lower right from the back of the shelf perspective). As soon as that ONE power supply was removed, the drives started powering on.
This was very odd. Fortunately for me, after I got this rectified and that power supply replaced, my NetApp case owner just happened to be an Electrical Engineer! He was able to dive into the many AutoSupport (ASUP) messages and further determine that power supply #1 in the same shelf was also on the fritz and it should be replaced also.
He was able to deduce that voltages and amperage’s were not quite right and strongly recommended to replace that power supply #1…which we did.
Never discount the power supplies. Also, be careful when you pull them out if you suspect them. In my case number two, we did the A-side test and all appeared OK when power was restored. After the B-side test, that is when everything went nuts so I figured that was the place to start. In hind sight, I would also use the environmental commands to verify amperage and voltage among other items before pulling a power supply.