Compute at Scale

Compute at Scale - A broad investigation into the data center industry

Konstantin Pilz, Lennart Heim

July 2023 

Abstract

This report characterizes the data center industry and its importance for AI development. Data centers are industrial facilities that efficiently provide compute at scale and thus constitute the engine rooms of today’s digital economy. As large-scale AI training and inference become increasingly computationally expensive, they are dominantly executed from this designated infrastructure. Key features of data centers include large-scale compute clusters that require extensive cooling and consume large amounts of power, the need for fast connectivity both within the data center and to the internet, and an emphasis on security and reliability. The global industry is valued at approximately $250B and is expected to double over the next seven years. There are likely about 500 large (> 10 MW) data centers globally, with the US, Europe, and China constituting the most important markets. The report further covers important actors, business models, main inputs, and typical locations of data centers.

Preview

Main summary This report provides an overview of the data center industry and its relationship with AI development. It primarily reviews existing literature, only occasionally drawing conclusions for AI governance (See Scope). A subsequent piece comments on the role data centers may play in mitigating risks from advanced AI systems.  Data centers are purpose-built industrial facilities hosting hardware at scale and thus efficiently providing computational resources (compute). They primarily run various internet services required for banking, web browsing, online gaming, communications, video streaming, etc. Besides, some data centers host high-performance compute clusters to run computationally intensive operations such as scientific simulations and machine learning (ML). Data centers are an integral part of the (AI) compute supply chain and constitute the link between the semiconductor industry and the compute end-users. Right now, the reader likely interacts with several data centers—by accessing this document, receiving text messages, synchronizing files, or updating newsfeeds. (More in section What are data centers?.)  Key features of a large data center include: Tens to hundreds of MW of power consumption, similar to that of a medium-sized city (~100,000 inhabitants). Extensive heat production requiring immense cooling systems consuming water and additional power. Emphasis on redundant components and backup systems such as power generators to ensure high reliability. Physical security measures to prevent unauthorized access. Spatial requirements similar to other industrial facilities, with 10,000 to 100,000 sqm, the equivalent of several football pitches. High-speed data transmission, requiring low latency, high bandwidth connections both within and from/to the data center. Complex supply chain management due to a high number of specialized inputs. (More in section Key characteristics.)  Today, AI development increasingly depends on data centers as training large ML systems requires dedicated compute clusters of thousands of high-bandwidth interconnected AI accelerators that are operated from large data centers. Further, the efficient deployment of ML models on a large scale (e.g., offering ChatGPT as a service) similarly requires this designated infrastructure. Due to the development of economies of scale, there is an ongoing trend of ML compute becoming more centralized in clusters. Understanding the data center industry hence sheds light on the global distribution of (ML) compute and which actors can, in principle, train large ML systems. The industry’s potential for growth also determines how quickly AI technology can be adopted widely. Further, data centers may present a future target for monitoring and regulating AI development and deployment. (More in sections Data center's relevance for AI governance, Shift to the cloud, and the accompanying comment.)  Data centers can roughly be divided into: (i) 60% self-owned, “enterprise” data centers, where the hardware owner owns and operates their own data center infrastructure, and (ii) 40% shared “colocation” data centers, where a specialized company owns and operates the infrastructure (power, cooling, connectivity, security, backup systems) to host the hardware of other entities. For both configurations, the hardware hosted by the data center can be used directly by its owner, called on-premises, or to provide cloud compute that is rented out online, called off-premises. (More in section Types of data centers. See Figure 6 for a visual overview.)  There are an estimated 110 - 225 of the largest data centers with a power capacity of above 100 MW and 225 - 1,100 large data centers with a capacity of 10-100 MW, the size that could currently host an AI compute cluster for a major training run. These large data centers are predominantly constructed by tech giants such as Google, Amazon, Microsoft, Meta, and Apple. Including smaller builds, starting at 0.1 MW, there are 10,000 - 30,000 data centers globally. Although data is limited, roughly a third of them are likely in the US, followed by 25% in Europe and 20% in China. While most data centers are close to major cities to allow for low-latency connections, large data centers are increasingly constructed in more remote places due to their spatial and power requirements. (More in sections How many data centers are there?, Locations of data centers, and Most important companies.)  The data center market is valued at about $250B and projected to more than double in the next seven years. The colocation sector is currently shared by more than a dozen smaller actors, with the biggest company, Equinix, accounting for 11%; however, it appears to be slowly getting more concentrated. Meanwhile, the cloud market is already dominated by Amazon Web Services (AWS) (34%), Microsoft Azure (21%), and Google Cloud (11%). Due to economies of scale, ML applications are increasingly run on cloud services, leading to significant compute aggregations at cloud companies. (More in sections Market size and growth, Most important companies, and Shift to the cloud.)  A typical investment into the supporting infrastructure for a 20 MW data center is around $100 - 200M (excluding hardware), and a large data center campus (> 100 MW) can cost up to a billion dollars. The costs are due to the requirements of specialized equipment such as cooling infrastructure, power transformers, high-speed connectivity components, and backup systems. (More in section Key Inputs for data center construction.)  A simple estimate suggests that operating a large data center costs at least single-digit millions per MW per year. This is mainly due to the large quantities of power it consumes, but also due to the expensive maintenance of computer hardware and other components. (More in section Key inputs for data center operation.)  Even in scenarios of explosive demand for AI, global data center capacity could unlikely grow by more than 40% per year. This is because the high number of specialized inputs needed for construction leads to bottlenecks in the supply chain, as has recently happened during the COVID-19 pandemic. Furthermore, spare power grid capacity is already limiting growth in several regions, and large cloud providers struggle to find suitable sites for their large data centers. Additionally, technical limits in power consumption and heat dissipation could make future compute clusters increasingly expensive, even barring fast growth. (More in section Limiting factors for data center growth.)  For an interpretation of the main findings of this report in the context of the governance of advanced AI systems, refer to the subsequent piece,  An assessment of data center infrastructure's role in AI governance.