February | 2018 | Outbrain Techblog

February 22, 2018

Gerardo Laracuente

How Well Do You Know Your Servers?

It’s a sunny day in Secaucus, New Jersey. The Outbrain Facilities team heads home on a high note. They’ve just racked, stacked, and inventoried over 100 new servers for the Delivery team. As they go to sleep later that night, the Delivery team begins their day in Netanya, Israel. They provision these servers for their new Kubernetes cluster, which the Developers are eager to migrate their services to.

Fast forward… microservices are migrated from bare metal to Docker containers running on our shiny new fleet of Kubernetes servers. The developers notice performance issues, but the behavior is not consistent, so they reach out to the engineers in Delivery. Delivery narrows it down to a few problematic servers that are running at ~50% performance. They all run the same services as the well-performing servers. Their CPU throttling configurations are correctly aligned. They all run the exact same Chef recipes. With all other options exhausted, they turn to the hardware:

server says

As they dig deeper, they realize that the problematic hosts don’t stand out at all. All servers in the cluster are the same model, have the same specs and look 100% healthy by all metrics that we collect. They are the latest and greatest cloud platform servers from Dell: The PowerEdge C6320. With not much left to go on, they finally find a single difference: the IPMI system event logs show that all of the problematic hosts are running on only 1 Power Supply (Instead of 2 for redundancy).

#ipmitool sel elist
…
8 | 01/11/2018 | 16:20:52 | Power Supply PSU 2 Status | Power Supply AC lost | Asserted
…

Down the Stack, We Go

Enter Facilities team again: a 2-man team managing thousands of servers across 3 data centers. Of all the issues that could arise, these brand new servers were the least of our worries. It’s a not-so-sunny day in New Jersey when we wake up to an email listing a handful of servers running on one power supply. Luckily, the hardware was fine and they were all just cases of a loose power cable.

Redundancy is fundamental when building out any production environment. A server has 2 power supplies for these exact situations. If one power strip (PDU) goes down, if a power supply fails, or a power cable is knocked loose, a server will still be up and running. So from the Operating System’s point of view, running on one power supply should not matter. But lo and behold, after plugging in those loose cables, the performance immediately goes back to 100% on all of those servers.

We contact Dell Support, which assumes it’s the CPU Frequency Scaling Governor and asks us to change the power management profile to Performance. In a nutshell, the Operating System has control of CPU throttling, and they want us to grant that control to the BIOS to rule out OS issues (Dell does not support Ubuntu 16.04, which is what we currently use).

After making these changes, the issue persists:

This is the command used in every test is: stress-ng --all 0 --class cpu --timeout 5m --metrics-brief

Power Profile: PerfPerWatt(OS)
PSUs: 2
Result:
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [152332] cpu              48697    300.19    320.79      0.06       162.22       151.77

Power Profile: PerfPerWatt(OS)
PSUs: 1
Result:
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [160954] cpu              21954    300.49    315.72      0.28        73.06        69.47


Power Profile: Performance
PSUs: 2
Result:
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [3687] cpu              48182    300.18    319.02      0.09       160.51       150.99

Power Profile: Performance
PSUs: 1
Result:
stress-ng: info:  [152332] stressor      bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [11623] cpu              21372    300.45    306.56      0.14        71.13        69.68

The number to pay attention to in these tests is the final one on each line (bogo ops/s). When we unplug one power supply, we see the number of operations that the CPU can handle drop dramatically. Dell involves a senior engineer, who confirms that this is actually expected behavior and references a Dell whitepaper about Power Capping on the C6320 Servers. After briefly reviewing the paper, we assume that we could just check the iDrac web interface for power capping:

We see that no power cap policy is set, so things still aren’t adding up. It’s time to really dig into this whitepaper and (with some help from Dell Support) we manage to find a root cause and a solution. Although we received these servers with power capping disabled, Emergency Power Capping was enabled.

What is the Emergency Power Capping Policy? Glad you asked! When a C6320 chassis is running on one power supply, throttle the CPU no matter what the power consumption is. This is a field that is completely hidden in the iDrac web interface, so how can you tell if this policy is enabled? Another great question! It is enabled if the 8th byte of the following ipmitool output is “01”

#ipmitool raw 0x30 0xC5
00 00 00 02 00 00 00 01 00

*You can use ipmitool raw 0x30 0xC5 to get the current policy from any node within the chassis.

Translation?

Let us save you time and your sanity by explaining what you are looking at and how you can automate your way out of this hidden setting.

That 8th bit means that the chassis should throttle the CPU when running on one power supply, via PROCHOT. The other options for this bit are to emergency throttle via NM (Node manager), or to turn off emergency throttling. PROCHOT is short for processor hot, a processor technology that throttles the processor when certain conditions are met. More information on PROCHOT can be found in Intel’s whitepaper on page 7 (written for the Xeon 5500 series, but is still relevant). NM is short for Intel’s Node Manager. More information on NM can be found here.

Our testing shows that the CPU does NOT throttle on one PSU when using NM, but we can’t find a clear answer on the logic behind the NM, so we decide to avoid it completely and turn off emergency throttling. Instead, we set the Chassis Power Capping Value to 1300W. Each of the power supplies we use are 1400W, so this limit accounts for a situation where the system is running on one power supply and will throttle to avoid power failure or overheating.

We use a Chef recipe to set our desired policy with this command:

ipmitool raw 0x30 0x17 0x01 0x14 0x05 0x02 0x02 0x00 0x00 0x00 0x00

The first part of the command sets the chassis power capping policy:
#ipmitool raw 0x30 0x17 

These 9 bits, which are explained in the Dell whitepaper, roughly translate to: 
“Enable chassis power capping with a value of 1300W, and disable emergency throttling”. 
0x01 0x14 0x05 0x02 0x02 0x00 0x00 0x00 0x00

The 1st bit enables chassis power capping:0x01
The 2nd and 3rd bits indicated the power limit value:0x14 0x05
The 8th bit disables emergency throttling:0x00

Here is a quick reference for other power limit values:
0x2c 0x01 = 300W
0xf4 0x01 = 500W
0x20 0x03 = 800W
0x14 0x05 = 1300W

*This command can be run from any blade within the chassis, and we tested it on Ubuntu 16.04 and Centos 7.

Our policy only caps power at the chassis level. Dell Support strongly discourages enabling capping at both the chassis and sled/blade level, as this could lead to inconsistent results.

Let’s Improve Together

This obscure hardware issue sent ripples all the way up the stack, and thanks to the close communication between our various development and operations teams, we were able to find the root cause and solution. We hope this information saves other engineering teams time and effort in their future buildouts.

With thousands of servers to manage and a lean Facilities team, hardware automation and visibility are crucial. Have similar challenges? Have a hardware related war story? we’d love to hear about it – leave a reply below!

Blog Posts - February 2018

How Well Do You Know Your Servers?

Down the Stack, We Go

Translation?

Let’s Improve Together

Search

עברית

Categories

Archive

RSS