Edarat Group  DATA CENTER UNIVERSE   12 Part Series  |  4th Publication
DATA CENTER UNIVERSE
12 Part Series – 4th Publication
The Real Impact of
AI Energy Demands
on your DC
May 2024       5 mins read
The Real Impact of
AI’s Energy Demands
on your Data Center
May 2024      5 mins read
Overview

Artificial Intelligence (AI) technologies have started becoming larger parts of data center (DC) workloads, and the global DC community is challenged to make existing and new facilities AI-ready. Before the DC operation team decides on a range of actions, it is crucial to understand what AI is about, the type of expected IT workloads, and the immediate and future market needs.

The key steps are evaluating infrastructure components like power, cooling, network capacity, space utilization, and structural elements and proposing technical solutions. After analyzing all the above subjects, technical proposals, and points for further investigation, important conclusions and recommendations are presented here.

AI Definition & Global Evolution Costs

Artificial Intelligence (AI) is the theory and development of computer systems’ capability to perceive their environment, reason about their status, and select actions to perform certain tasks, in a similar way humans recognize visual or logical patterns, understand natural language, make decisions, and select actions accordingly.

Like humans, any AI system can develop intelligence based on experience and learning. This procedure is widely known as AI model training, where algorithms are fed massive data sets and trained to identify patterns, create experience, and develop non-linear models that relate the input data patterns to output predictions. AI is a general term that encompasses several technologies like machine learning (ML), deep learning (DL), and natural language processing (NLP), all of which use the same workflow to produce models.

Today, AI is not just a theoretical concept but a wide range of technologies that power services and products we use daily. Applications like face recognition, online recommendation systems, and real-time communication chatbots are tangible examples of AI systems trained on massive amounts of data to identify face characteristics, gender, age, or tokens of word sequences, create meaning, and respond verbally or by text.

DC and AI Demand

Most of this activity by AI systems is taking place inside data centers. Hundreds of servers, storage units, and network switches process and maintain massive amounts of AI (and non-AI) critical data. The modules are arranged into racks and further into clusters to distribute data and perform parallel processing. Two essential measures of IT data flow and equipment intensity define the data center capacity and operation capabilities:

IT Workload:

The measurement in kW/GW indicates the overall computing resources available at the data center for data processing. As compute resources are servers equipped with CPUs and GPUs, these modules define the maximum available IT workload of the data center.

Rack Density:

The number of processing and storage equipment stacks within racks defines rack density, measured in kW per rack. The more intensive and demanding the data stored, recalled, and processed, the more powerful CPUs and GPUs are required, and the rack density increases.

The impact of AI on data center infrastructure operation is directly quantified and measured through the elevated demand for IT Workload and Rack Density. DC operators translate this as an equal increase in power and cooling.
The evolution of AI as part of the computer science discipline creates a global market for AI services. Whether as corporate developments, research activities, or the most expected As-a-Service model, AI software development is continuously growing and expected to follow exponential rates.
Hardware development is closely dependent on it, as the transverse from CPU to GPU technology accelerators constantly increase their floating-point operations (FLOPS) performance. This is not without consequences, as this increased computational advancement in servers is translated to higher power demand and thermal design power (TDP) in the hardware.
DATA CENTER UNIVERSE |(3rd Publication)
Data Center Cost per MW
Types of AI Workloads

AI activity mainly includes leveraging various sizes of data sets and data streams through the data center’s computational resources. By performing storage, processing and modeling, the AI workload is:

a) Defined in terms of various data processing procedures like training, inference, data analytics and high-performance computing (HPC).

b) Quantified through the required IT computing power (kW) needed to implement.

The AI workload types managed by a data center facility are the following:

Training

Training of new AI algorithms that utilize large to massive data sets and create neural networks with millions or billions of parameters. Popular machine learning tools create algorithms that go through extended calculations and iterations to adjust the model parameters and fit the algorithm onto the provided data set-specific patterns and behaviors.

After training completion, the outcome is called an AI model, which, under certain conditions, is ready for testing and deployment to production (e.g., large language models (LLMs) like ChatGPT already publicly available). Training various scales of AI models is expected to be the most demanding and computing power-intensive operation for future data centers.

Inference

Real-time inference is the procedure for deploying an AI model to production. In this case, the trained and tested AI model can receive new input data, provide predictions, generate results, and interact with the end user. Our daily interaction with AI applications like voice assistants, platform recommendation systems, weather forecasting, and email or bank fraud detection systems indicates the importance and massiveness of such developments. Thousands or millions of users simultaneously use a single or combination of models.

In the back scene, there is a continuous recall of AI models running in the DC facilities. Obviously, the more AI services penetrate new markets, the more models are created, and the more demand for real-time deep learning inference will occur in existing and future data centers.

High-Performance Computing (HPC)

HPC is an area of application mostly related to high-end research and academic applications where either traditional analytical approaches or AI based algorithms leverage large data sets and perform precise and floating-point demanding calculations to solve a problem.

In this case, again, the data center infrastructure operator will have to provide stacks of GPUs arranged into high-intensity racks and clusters, which become power-demanding and heat-generative. Recent advancements in medical research with DNA sequence alignment and protein structure prediction indicatively formalize the application of HPC and consequent data center infrastructure requirements.

Data Analytics

Data Analytics is the phase that consumes most of the time in AI training workloads as it is related to massive data sets scraping, grouping, labeling, and creating pipe-flows for further processing and AI algorithms feed. The size of data sets is defined by the size and type of data. like photos or videos can create extended amounts of inputs to be stored and managed. This way, again, the data center workload increases significantly, requiring high-intensity IT equipment.

Discover the power of Edarat Group’s
DC Engineering Services in this video.
AI Growth Projections
The projection of AI growth over the next years can be estimated by considering the evaluated growth of GPU sales as one of the main hardware elements in the AI evolution as a point of reference. The global graphic processing unit (GPU) market size is expected to reach USD 773.07 billion by 2032 and is poised to grow at a compound annual growth rate (CAGR) of 33.8% from 2023 to 2032.
Considering the estimation that AI represents 4.3GW of power demand today and projecting this to grow at a CAGR of 33.8% resulting in a total demand of 18.5GW by 2028, as reflected on the graph below, while the annual CAGR for data center non-AI IT loads is estimated to 11%.
Infrastructure Readiness

Data center infrastructure readiness for AI integration requires a thorough understanding of AI workloads, evaluation of infrastructure components, implementation of technical solutions, continuous investigation, and proactive adaptation.

Practically, the DC operation team shall investigate subjects related to power availability, cooling system capacity, network latency, and structural elements like space or raised floors in detail. Essential points are discussed below as starting points for further investigation.

Power

Most existing data centers work with average rack densities of 3 to 6 kW per rack. However, system records must be recalled by the DC operator to identify exact power consumptions and design references. The operator shall also consider the capacity of the existing overhead power distribution system, along with an available rating of points of connection (socket, tap-off, etc.).

Typical 32A to 63A 3P/230V rated points of rack connection are available within existing data centers that follow the IEC 230/400V voltage system. Depending on the available power distribution system, rack densities from 10 to 20 kW are also possible but not in scale.

Expected rack densities for AI-ready servers are larger than 20 KW and can reach values equal to 50kW or more. The table below presents indicative rack densities based on the type of AI workload.

AI workloads demand more powerful servers and consequently require GPUs with higher power consumption. A chip’s power consumption, measured in watts, is commonly specified with TDP (thermal design power) and becomes the proper reference value for calculation.

Typical GPU TDP value has increased from 300W to 700W (230% increase) within the last three years and is forecast to reach 1kW per GPU soon.

A typical module of racks within an existing DC arranged in cold or hot aisle containment can host up to 30 racks, always considering the rack density limitation. In the case of AI, the workload will define the expected capacity per rack and, consequently, the number of racks arranged within available space. Of course, their arrangement shall consider available power distribution and cooling within existing data halls.

For example, a data hall cold containment module that can serve 30 racks of 4kW each will be able to serve just 6 to 10 racks for AI inference or 2 racks for AI training. The footprint of existing modules will have to be reduced significantly, or their available power capacity will have to be re-distributed to achieve higher intensities in specific areas.

As rack power density increases, power distribution installations above the racks or below the raised floor become a challenge. These racks start to have multiple redundant power feeds of 63A, 100A, or more, three-phase or single-phase, and with higher short circuit ratings.

Cooling

Data The purpose of cooling in any data center is to remove the heat produced by IT equipment inside the data hall area and maintain proper temperature and humidity conditions within the operation range of IT equipment. Most existing data centers use air cooling systems for the IT rooms, where cooled air is provided within the spaces through terminals like CRAC or CRAH units or AHU or even FAN Wall.

Air as a means of heat removal has a certain thermodynamic capacity limitation, and air-cooling systems have been proved to be a satisfactory and mature solution for rack densities up to 20 kW.

In AI clusters where the rack densities are above this value and can potentially reach figures equal to 50 or 80 kW/rack, alternative technologies are needed, with the most appropriate being liquid cooling (LC). Liquid cooling is represented by three main technologies: the Rear Door Heat Exchanger, Direct-to-Chip (DTC), and Immersive Cooling (IC). All can support extreme rack densities.

Rear-door Heat Exchangers

Rear-door heat exchangers are a mature technology that provides a viable solution for managing densities above 20 kW. While these technologies don’t bring liquid directly to the server, they do utilize liquid’s high thermal transfer properties and employ similar data center infrastructure as that required for direct liquid cooling.

Passive or active heat exchangers replace the rear door of the IT equipment rack with a liquid heat exchanger. With a passive design, server fans expel heated air through a liquid-filled coil mounted in place of the rack’s rear door. The coil absorbs the heat before the air passes into the data center. Active heat exchangers include fans to pull air through the coils and remove heat from even higher-density racks.

Rear Door Coolers replace the traditional approach to data center cooling, allowing loads of up to 75kW per cabinet with the added benefit of removing real estate inherent with hot-aisle/cold-aisle arrangement, in-row cooling (IRC), or perimeter cooling (CRAH) designs. The waste heat generated by active equipment within a cabinet is removed at the source using chilled water cooling or another type of fluids with or without a CDU and Leak Prevention System (LPS).

They also increase the availability of energy-efficient ‘free cooling’, delivering air up to the ASHRAE maximum 27°C (2013) room temperature.

These systems can form the foundation for a hybrid approach to data center cooling in which liquid and air-cooling systems work together to cool environments with mixed rack densities. The next phase in this evolution is the move to direct liquid cooling.

Direct- to-Chip (DTC)

The DTC uses a liquid-cooled plate directly in touch with the server CPU/GPU and a network of micro-pipes and manifolds that circulate liquid to a Central Distribution Unit (CDU) and heat exchanger simultaneously.

The CDU is a liquid-to-liquid heat exchanger, with one side connected to the rack cooling manifolds and the other to the data center chilled water system or a refrigerant cooling system. It can be internal to each rack or a separate stand-alone unit within the data hall.

The DTC cooling system has become one of the first options explored by DC operators. It allows implementation onto existing rack arrangements and connection to the existing chilled water system. This allows the conversion of conventional air-cooled racks into high-density racks with liquid-cooled servers.

DTC cooling technology can be a valuable solution for existing data centers that have to provide an immediate transition to AI workload with high-density racks but on a limited scale. Conversion for part of the racks into AI cluster can coexist with remaining traditional air-cooled racks and cooling terminals, while the same chilled water system will be able to serve the CDU(s).

Special measures shall be considered for the leak detection system, especially inside the rack. However, solutions in the market with negative pressure CDU can also provide immediate solutions to this concern.

Immersive Cooling

Immersive cooling (IC) is a solution in which servers are completely immersed in a dielectric liquid, and heat is directly transferred from the electronic components to the fluid. This approach takes direct advantage of the fact that liquids’ thermal conductivity and heat-carrying capacity are multiple times greater than those of air. This allows configurations where each tank can accommodate IT capacity from 150 to 600 kW over a 6 – 10 sqm footprint.

The liquid is within an open-top tank, illustrated as a horizontally placed rack with the servers arranged similarly horizontally, pulling them out vertically. The IC tank is a stand-alone system with a 19’’ fixation racking system for servers, the heat exchanger unit (CDU), and a dry part for access to the data cabling system and PDUs. Each tank is connected through the CDU with the chilled water system and an external heat rejection system or dry cooling unit.

Although this solution seems to be able to be directly connected to the existing data center chilled water system, the operators must take special care as the normal chilled water temperatures an IC tank operates at are much higher (35/40 oC) than the normal (10/15 oC) of the chilled water system.

IC cooling can be considered a proper solution where the DC operators must provide a separate, specific, high-intensity AI cluster within a limited footprint, considering the installation of separate piping and external cooling units. Nevertheless, operators shall consider the IC tank a single-capacity component with no redundancy. In case of failure, all the racked servers are subject to switching off.

Immersive Cooling is a recent development in the market. It has not yet been satisfactorily deployed, and there are only a limited number of available immersed-type servers as market-ready products.

Networking

AI clusters for model training shall engage a large number of GPUs, providing separate network ports for each. Connection with high-speed network cables will allow for the complete utilization of processing speed. The operators shall coordinate with the IT development team to evaluate network cable options.

Using fiber allows larger distances between nodes and clusters, reducing the rack density with fewer servers per rack. On the other hand, in the case of copper cables, limitations on distance will force the team to stack more servers within the same racks to reduce the network latency. In this case, higher rack densities will make the system deployment more complicated and demanding for the DC infrastructure.

DATA CENTER UNIVERSE (2nd Publication): Balancing Availability🔄 and Reliability⚙️
Space Utilization

Except for power and cooling provisions required by DC operators for an AI-ready data center, several other points shall be equally investigated and verified that can serve or be rectified to accommodate the AI upgrade.

Racks sized for AI servers are expected to become larger in width and height to allow enough space for deep learning servers; similarly, in cases where larger PDUs are designed for increased power rating and space for liquid cooling, the manifold will push for deeper racks. Also, the operators shall check the durability of AI-specified racks as more heavy gear is expected to be installed.

For DCs with raised floors, the operators shall carefully evaluate the floor-concentrated design load and wheel rolling load. AI racks will become heavier than normal density racks, and special consideration should be given.

Main runs of chilled water pipes for connection with DTC liquid-cooled racks or IC tanks are not preferred for most operators to be from above. The DC operation team shall verify whether the existing chilled water system can be used or identify new routes. In the case of raised floor existence, normally, the pipe runs will be underneath, but for facilities without, different options shall be explored through adjacent corridors or by providing trenches on the floor.

Strategy & Planning

Data center operators shall investigate their existing and future market needs in coordination with their business targets and budget allocations. The creation of a medium—and long-term plan related to IT capacity management and rack density distribution will allow them to make critical decisions related to all technical subjects analyzed above. Continuous communication with the market, evaluation of technology trends, and cooperation with existing and future clients are key components of strategy development.

In many existing data centers where complete new design and development are not part of business planning, hybrid-type data centers are appropriate for short- to medium-term development. A hybrid DC shall continue to operate all existing facilities, including IT racks and air cooling, but simultaneously will allocate power and space for a small, dedicated AI cluster with increased power supply and liquid cooling following client needs.

Hybrid-type solutions create concerns for Tier-certified DCs and the impact on their certification, although discussions for future DC facilities with multiple Tier certificates may arise as a solution.

Similarly, the absence of international standards for certification of liquid cooling systems is a subject of concern, and further investigation from the DC community is required, as most of the market-proposed solutions are developed and tested only by the vendors. Institutional certifications for AI hardware and liquid cooling solutions have become a necessity, along with the involvement of international facility certification standards like Uptime Institute Tier rating.

Furthermore, the evolution of new technologies, like Tensor Processing Units (TPU), is subject to challenges from GPUs and steers the market even further. Technological trends and different approaches in terms of hybrid cloud models will impact more.

Developers tend to use ‘public data’ to develop a foundational model on the cloud and then augment and make it business-specific using private data in their own computing facilities to maintain control of their data. Then, the model will be deployed, and all the inferencing will happen on the edge. This workflow model leaves colocation DCs out of the loop for much of the workload.

The future is expected to be very interesting, and the evolution of AI as a service within data centers is currently knocking on the door for facility managers.

References
Schneider whitepaper #133, Navigating Liquid Cooling Architectures for Data Centers with AI Workloads
Schneider whitepaper #265, Liquid Cooling Technologies for Data Centers and Edge Applications
Schneider whitepaper #110, The AI Disruption: Challenges and Guidance for Data Center Design
Omdia Analyst Summit Japan presentation, Omdia Data Center Compute Intelligence Servicehttps://omdia.tech.informa.com/products/data-center-compute-intelligence-service
AI Demand vs current data center capacity analysis performed by Semianalysis® (https://www.semianalysis.com/p/ai-datacenter-energy-dilemma-race)
Graphic processing unit (GPU) market by Precedence Research® (https://www.precedenceresearch.com/graphic-processing-unit-market)
Vertiv-liquid-cooling-wp-en-na-sl-70807-web_332687_0

Leave a Reply