In the race to dominate the world of generative AI, Meta is determined to outpace its rivals and is pouring billions of dollars into its own AI endeavors. While some of this investment is dedicated to recruiting top AI researchers, the majority is being channeled into the development of hardware, specifically chips designed to power and train Meta’s AI models.
Today, Meta unveiled the latest result of its chip development efforts – the “next-gen” Meta Training and Inference Accelerator (MTIA). This successor to last year’s MTIA v1 is a cutting-edge chip that is responsible for running models, including those used for ranking and recommending display ads on Meta’s platforms (such as Facebook).
Compared to its predecessor, which was built on a 7nm process, the next-gen MTIA boasts a 5nm design. (In the world of chip manufacturing, “process” refers to the smallest component that can be built on the chip.) Not only is the next-gen MTIA physically larger, with more processing cores, it also boasts more internal memory (up to 128MB compared to the previous 64MB) and a higher average clock speed (1.35GHz, up from 800MHz). However, with this increase in power comes a higher energy consumption of 90W, compared to the previous 25W.
Meta reports that the next-gen MTIA is currently live in 16 of its data center regions and offers up to 3x better overall performance compared to the MTIA v1. While the company would not disclose specific details about this “3x” improvement, it did mention that it was based on testing the performance of “four key models” across both chips.
“Because we have control over the entire stack, we are able to achieve greater efficiency compared to commercially available GPUs,” Meta stated in a blog post shared with TechCrunch.
In a unique move, Meta has revealed that the next-gen MTIA is not currently being used for training workloads in generative AI. However, the company claims to have “several programs underway” that are exploring this potential. Additionally, Meta acknowledges that GPUs will not be replaced by the next-gen MTIA when it comes to running or training models, but instead, they will work together to enhance performance.
Reading between the lines, it seems clear that Meta is moving cautiously, perhaps even slower than they would like.
Due to the considerable costs involved, there is undoubtedly pressure on Meta’s AI teams to reduce expenses. The company is projected to spend an estimated $18 billion on GPUs by 2024 to power and train their generative AI models. However, with the training costs for cutting-edge AI models often reaching tens of millions of dollars, in-house hardware presents a tempting option.
Meanwhile, as Meta’s hardware developments drag on, their competitors continue to surge ahead, much to the frustration of Meta’s leadership, I would suspect.
Just this week, Google announced the general availability of its latest custom chip for training AI models, TPU v5p, to its Google Cloud customers, and also revealed its first dedicated chip for running models, Axion. Amazon boasts several custom AI chip families under its belt, and Microsoft entered the competition last year with the Azure Maia AI Accelerator and Azure Cobalt 100 CPU.
In their blog post, Meta claims that it took less than nine months to go from “first silicon to production models” of the next-gen MTIA, which is admittedly shorter than the usual timeline for Google TPUs. However, Meta still has a long way to go if they hope to achieve a level of independence from third-party GPUs and keep up with their fierce competitors.