Tech giants to train AI models across multiple data centre campuses

Could happen as soon as 2025.

Tech giants to train AI models across multiple data centre campuses
Photo Credit: Google Maps. Old satellite photo of GDS data centre campus in Johor during construction.

In quest to go bigger, tech giants will soon train new AI models across multiple data centre campuses.

I wrote previously about how the AI boom is driving a surge in electricity consumption globally, badly straining power grids in places such as the US.

It is delaying the retirement of some coal-fired plants and seeing growing demand backfilled with fossil fuel expansion.

While a significant proportion of this power is renewable, it might be a case of taking a step forward - and two steps backward where sustainability is concerned.

And according to a report on SemiAnalysis, data centres to support AI are set to get even larger.

Plans to go even bigger

Leading AI models are being trained on 100,000 GPUs cluster currently, and 300,000 GPUs clusters are now in the works for 2024.

The reason? LLMs show no sign of plateauing, according to OpenAI's CTO Mira Murati.

It's being held up by physical constraints, however:

  • Permitting and regulations.
  • Construction timelines.
  • Power availability.

This means that training the latest AI models at a single data centre site is now increasingly infeasible, even within a data centre campus with multiple data centre buildings.

Multi-campus AI training

The solution? Training them across multiple data centres or multiple data centre campuses.

Google is apparently ahead and has been training its AI models across multiple data centres using its inhouse TPUs, though its behind in its model architecture - set to change with Gemini 2.

AI training across multiple data centres presents additional problems, namely:

  • Communication overheads to synchronise replicas.
  • Latency makes it less suited for synchronous training.

Expect greater network capacity being built between data centre campuses for cross-data centre training, and a switch to asynchronous training.

Where does this end?

For now, larger data centres continue to be built:

  • SemiAnalysis tracked down a three-data centre Google campus with up to 1 Gigawatt. It is expected to go live by end-2025.
  • Microsoft and OpenAI are frantically constructing ultra-dense liquid-cooled data centres to overtake Google in AI training and inference capacity.

Do you think we will run up against the limits of AI training soon?