Previously, we discussed the differences between AI Model building and AI Inference activities. We also talked about the location sensitivity – latency to requester vs. scale for long running jobs. In this post, we will explore some challenges/solutions inside the data-centers and why each Data Center will have it’s own pool.
This will set the stage for deeper discussions on technology advances and opportunities for a more sustainable future as physical and technical efficiencies are applied to this scale challenge!
DataCenter Short/Medium Capital: Opportunities and Constraints
Everything about the Data Center matters. From the location (cold v. hot climates, as well as power availability), but for now, let’s talk just about what goes inside the Data Centers. We will look at the macros around the long term capital later.
Rack Infrastructure
In high-performance data centers, standard server rack loads typically range from 5 kW to 15 kW, but AI and high-performance computing (HPC) have significantly increased power demands, pushing densities to 30 kW and beyond, with some reaching 100 kW or more per rack. This dramatic increase is driven by the energy-intensive nature of AI workloads and the need for powerful GPUs and accelerators.
Traditional data centers: Often utilize 5-15kW per rack, with some outliers pushing to 20kW.
High-density racks: Specifically designed for AI and HPC, these racks can consume 15 kW to 30 kW or more.
Extreme high-density racks: Some cutting-edge facilities can exceed 30 kW, even reaching 130 kW and beyond.
AI’s Impact on Power Demand
Increased density: AI and HPC workloads require significantly more processing power, leading to higher power densities per rack.
GPU intensive: AI models, particularly those involving large language models, rely heavily on GPUs, which are known for their high-power consumption during both model development, model fine-tuning and during inference operations
Cooling challenges: Higher power densities create thermal challenges, requiring more advanced cooling solutions, such as liquid cooling, to maintain optimal operating temperatures and as power increases so does heat.
Escalating energy consumption: The overall energy consumption of AI-focused data centers is rapidly increasing, with some projections showing AI consuming a substantial portion of global data center energy.
Examples
- NVIDIA’s A100 GPUs can consume up to 10 kW per server, according to LinkedIn.
- Some AI-enabled data centers are seeing rack densities of 100kW or more.
- Hyperscale providers are already operating servers at higher power needs, with average power consumption ranging from 12-20 kW but upcoming Data Centers for 2025 are being spec’d at >100kW/Rack.
- The NVIDIA’s DGX systems, such as the NVL72 can unite 72 Blackwell GPUs in a single rack-scale system. This configuration is rated at an astonishing 132kW per rack
Innovation Inside the Data Center
With these huge power demands and associated heat dissipation, the modern AI Data Centers are being purpose built for new power distribution and heat dissipation [cooling] strategies.
- Direct Liquid Cooling (DLC) and immersion cooling have become essential for managing the extreme heat generated by these powerful systems. DLC involves circulating a chilled liquid directly to the processing chips through cold plates, while immersion cooling involves submerging servers in a non-conductive dielectric fluid. These methods are far more efficient than traditional air cooling and are critical for maintaining the operational stability of such high-density racks.
- The challenges aren’t limited to cooling. These racks also demand significant advancements in power distribution, as a few rows of these systems can consume as much power as an entire data hall filled with traditional racks. This requires new power distribution units (PDUs) and power management systems that can handle the massive current draws and ensure stable power delivery.
- Ultimately, the most powerful racks are not just about the hardware they contain but also the entire ecosystem of power delivery, cooling, and infrastructure designed to support them!
- The following video explains how liquid cooling technology is being used to support the intense heat and power demands of AI workloads in data centers.
- Liquid Cooling Technology in Data Centers: How It Supports AI Workloads
So What’s Coming in the Data Centers of the Future?
The trend towards higher rack densities is expected to continue as AI and HPC workloads become more prevalent [CoreSite].
Data centers will need to adapt by implementing more efficient cooling technologies and power distribution systems.
- Advanced liquid cooling including rack scale, server and chip distribution and quick connect systems
- Direct Liquid Cooling
- Micro filtration to prevent jet fouling
- Highly managed and more sustainable water lagoons to depressurize impact on public water facilities
New Locations, and more specialization -> training/model building data centers near power exporters, inference centers smaller, and closer to populations. Data Gravity and Latency Sensitivity continue to drive critical decisions with equivalence to physical constraints: power, water, space.
And, let’s not forget the critical people aspects of these changes! We need more electricians and plumbers and frequently with physics (E-MAG), analytics and even software skills – there is a new control plane arriving, and we need people to build, optimize and operate it!