Structured testing of data centres
Post-occupancy evaluation (PoE) to assess how effectively the services in a data centre perform is too late. Thorough testing before a data centre is handed over will ensure that a new facility meets up to expectations, as Dave Wolfenden of heatload.co.uk explains.
A new data centre is expected to operate continuously for the design life of the facility. Even the shortest outage to the power or cooling systems can be disastrous to the IT services operating in the data centre. Equipment failure and replacement strategies must be included in the design and implementation.
Any form of testing for the power and cooling systems needs to take a structured approach. It doesn't matter whether you are testing software, hardware, a data centre or doing the MOT on a car. Without a structured approach it is easy to miss things that could later turn out to be a major challenge.
ASHRAE (American Society of Heating, Refrigerating & Air-Conditioning Engineers) defines five levels of testing.
• Level 1: Factory acceptance testing
• Level 2: Field component verification
• Level 3: System construction verification
• Level 4: Site acceptance testing
• Level 5: Integrated system testing (IST)
Level 1 to 3 testing begins with testing individual components at the factory — such as fully testing a UPS (uninterruptible power supply). As construction progresses, systems are tested together to ensure all the components operate together as a complete system, as designed. Typically this level of testing utilises large load banks located outside of the IT space. For example the load banks test the load capacity of chillers, generators, transformers and UPS individually and as a system.
Level 4 and 5 testing bring together all the combined power and cooling systems supporting the data centre, as well as life safety, security etc. The testing requires the load to be located within the IT space and much more granular in terms of size, location and capacity of the load banks. The load banks should replicate the final layout and capacity of the IT equipment.
There is a temptation to fill the data centre with large space heaters, each of 50 kW or more. This type of load is fine for testing the total capacity of the room. However, it is not suitable for room validation or testing the IT layout. Larger-capacity units are likely to bypass the power distribution, with direct connection to PDUs (power distribution units). This type of load is unsuitable for data centres that have some or all of the IT racks deployed.
We recommend for optimum testing that the heat load should be sized and distributed to replicate the IT layout and be connected to the power distribution system.
It is likely that the design has been tested using CFD (computational fluid dynamics) modelling tools. Level 5 testing should include room validation that proves the model. This will give the end user of the facility full confidence in the model.
Once the data centre is operational, it is unlikely that further heat load testing can be achieved, so changes in the IT infrastructure should be modelled to fully understand the impact to the data centre. The best way of proving the model is to fully replicate the heat load layout. This means using heat load that replicates that of the model and fully monitoring the testing using temperature sensors. Ideally the data centre should be flooded with temperature sensors, with multiple sensors at the front and rear of every rack location.
Recent integration between the capture software of sensor manufacturers and CFD modelling software allows real-time modelling, using real data. The original predictive model can then be compared with the real-time one. During 2016 heatload.co.uk is looking to offer this as a service.
The temptation is to just test the data centre at 100% IT capacity because IT has specified the load, and migration of IT equipment will be rapid. This is not always the case. IT may have over specified the IT load, and the migration plan might be too aggressive for the end users. This can result in a very slow ramp up of IT load, if, indeed, it ever achieves the design load.
If the design team has assumed rapid ramp up of load and IT capacity is correct, they may have designed the data centre in such a way that causes significant problems with the operation. In some circumstances if the IT load is too low, UPS and generators will not hold the load or cooling cannot be managed effectively. Massive amounts of energy can we wasted by lightly loaded systems.
It is optimistic to expect a recently commissioned data centre to go from 0 to 100% of capacity at the flick of a switch. We recommend starting with a load of no more than 10%, allowing the facility to stabilise before moving on to the next incremental step increase.
Before a new airport terminal or cruise ship is handed over to the customer, exhaustive testing is carried out before opening to the public. This involves armies of volunteers or members of staff mimicking the processes the public will go through. Some of the testing will be destructive testing. What happens when the lifts all fail? Can the stairs cope with a sudden influx of passengers rushing to their departure gate? If all the passengers rush over to one side of the ship will it cause instability to the ship? Can the fresh- and dirty-water systems cope with every toilet being flushed simultaneously?
The same is true for a data centre. Test everything.
When the power fails and the generators are starting, does the emergency lighting come on to avoid panic and accidents or is there a period of pitch black?
The design may have several minutes of autonomy in the UPS and batteries to maintain the IT systems, but does the cooling continue at a sufficient level to prevent overheating of the IT systems?
Test cause and effect; for example when the fire alarms are set off do you want the security doors to unlock or remain locked? Should the cooling be shut down or just the fresh-air ventilation be closed off?
Over the design life of the data centre IT will go through several technology refreshes, this could result in hotter or cooler systems being deployed. As space is occupied the original layout may no longer be valid. The technical limitations of IT systems may dictate a higher-density layout than expected. Several of our customers test their data centres with several layouts, as follows.
• Uniform load, with heat load spread evenly throughout the data centre.
• Varied load to try to represent the predicted IT load layout, with different racks having differing heat-load capacities.
• Uneven, worst-case load scenario — with most of the heat load in one or more rows of racks and very little heat load in the rest of the data centre.
During testing, problems will be discovered that require a fix to resolve the situation. The temptation is to apply the fix, complete the failed test and move on to the next test. The risk is that the ‘fix’ may have inadvertently invalidated previous tests. It is imperative that the fix be fully investigated and any previous tests that could be affected be re-run.
Dave Wolfendon is managing director of heatload.co.uk, the data-testing division of Mafi Mushkila Ltd.