Facebook’s data center in Prineville, OR, has been one of the most energy efficient data center facilities in the world since it became operational early this year. Some of the innovative features of the electrical distribution system are DC backup and high voltage (480 VAC) distributions, which have eliminated the need for centralized UPS and 480V-to-208V transformation. The built-in penthouse houses the chiller-less air conditioning system that uses 100% airside economization and evaporative cooling to maintain the operating environment.
These features have enabled Facebook to reduce the energy consumption of the data center significantly, which is reflected in power usage effectiveness (PUE) of the facility. The PUE of the Prineville data center was 1.07 at full load, which was verified during commissioning. Since then, during normal operation of the facility, the PUE has varied between 1.06 and 1.1. The histogram of the available PUE trend data for the period of April 14, 2011, to September 30, 2011, is presented in figure 1 below.
Challenges in Operations
Although these features have resulted in high efficiency, we have learned some lessons along the way. And as a part of our commitment to openness via the Open Compute project, we are sharing our experiences and lessons learned with the community, so that everyone might benefit from them.
One challenge we encountered was keeping our air handler lineups from “fighting” with each other as they dealt with the rapid changes in the temperature and humidity of the outside air between day and night. For example, if outside air dampers of one lineup are at 70%, the adjacent lineups would have their outside air dampers at 20-30%. This alternate modulation, or fighting, often led to stratification of air streams.
Another, more significant, issue was an error in the sequence of operation controls that led to complete closure of the outside air dampers, causing the one-pass airflow system to function like a recirculatory system. The problem began to manifest in late June as outside air conditions started changing rapidly. The economizer demand signal began responding to the changes; that’s when the erroneous control sequence drove economizer demand to 0, leading to complete closure of the outside air dampers. Thus the data center was recirculating the hot exhaust air at high temperature and low humidity. The evaporative cooling system reacted to this high temperature and low humidity, spraying at 100% to maintain the maximum allowed supply temperature and dew point temperature. This resulted in cold aisle supply temperature exceeding 80°F and relative humidity exceeding 95%. The Open Compute servers that are deployed within the data center reacted to these extreme changes. Numerous servers were rebooted and few were automatically shut down due to power supply unit failure.
The high temperature and high humidity supply air caused condensation on the concrete slab floor (because concrete has high thermal mass and was in contact with much cooler supply air for a long time). Similarly, upon investigation of the failed power supply units (figure 2), we observed that the failure was condensation-related.
We began investigating this failure by subjecting the server to rapidly changing temperature and humidity conditions in a controlled test chamber. The relative humidity level was raised to 97% and the temperature was ramped up from 15°C to 30°C (59°F to 86°F) in the span of 10 minutes. Under these conditions, the condensation was observed on the non-heated components. The server chassis was dripping wet, as you can see in figure 3. The motherboard, however, showed no signs of condensation due to the fact that it always ran above the dew-point temperature.
Condensation was also evident on the surfaces of power supply components such as capacitors and inductors, as shown in figure 4.
Figure 5 below shows the surfaces of inductors in front of capacitor 1 and the forward vertical surface of capacitor 1. We can see the water droplets formed on the surfaces of these non-heated components.
Figure 6 shows the variation in different temperatures monitored during the test interval. These are both targeted and actual values of ambient as well as dew-point temperature. The surface temperature of capacitor 1 (CAP1) is also plotted.
The plot shows that the surface of CAP1 falls below the dew point at about 6 minutes into the temperature ramp. This is exactly the same time the borescope video starts showing a slight change in the reflectivity of the component surfaces. The condensation then continues for another 9 minutes until the surface temperature of CAP1 rises above dew point. During the entire test interval, the PCB in the power supply always ran above the dew point temperature and showed no signs of condensation.
All these findings suggest the possibility that the failures were caused by water droplets being blown onto the PCB of the power supply, rather than condensation occurring on the PCB itself. As shown in figure 7, the water droplets were observed on the AC/DC cables and connectors. It is highly likely that these droplets were blown into the power supply units when the facilities’ maintenance staff increased the airflow in efforts to mitigate the problem.
The erroneous control sequence was promptly corrected and additional safeguards were added to eliminate the possibility of repeated occurrence of such an event. These safeguards include reevaluation of the minimum economizer demand setting, which will avoid the complete closure of the outside air dampers. Several monitoring points and alarm settings were modified to monitor and notify ahead of time should outside air conditions begin to change rapidly. Even though the supply air humidity, which was more than 95% at times, was out of the operational range of the power supply units (10-90% RH, non-condensing), conformal coating has been applied locally in selective areas of the PCB to avoid condensation and to strengthen the power supply units against such corner cases.