Meta has just announced the latest addition to its Llama series of open source generative AI models: Llama 3. However, the company has not released just one model, but two from its new Llama 3 family, with the promise of more to come in the future.
By Meta’s own account, the new models – Llama 3 8B and Llama 3 70B – mark a significant jump in performance from the previous Llama models – Llama 2 8B and Llama 2 70B. Parameters play a crucial role in determining an AI model’s skills, such as its ability to analyze and generate text. Generally speaking, higher-parameter-count models are more proficient than those with lower parameters. With custom-built 24,000 GPU clusters, both Llama 3 8B and Llama 3 70B are among the top-performing generative AI models currently available in their respective parameter counts.
Of course, such claims are bound to raise eyebrows. So, how does Meta back this up? It points to the scores of the Llama 3 models on popular AI benchmarks like MMLU (which assesses knowledge), ARC (which evaluates skill acquisition), and DROP (which measures a model’s reasoning over text chunks). These benchmarks have faced criticism regarding their usefulness and validity. But they are still widely used by AI players like Meta to evaluate their models.
Llama 3 8B outperforms other open source models like Mistral’s Mistral 7B and Google’s Gemma 7B, both of which have 7 billion parameters, on at least nine benchmarks: MMLU, ARC, DROP, GPQA (a set of biology, physics, and chemistry questions), HumanEval (a test for code generation), GSM-8K (math word problems), MATH (another mathematical benchmark), AGIEval (a problem-solving test set), and BIG-Bench Hard (an evaluation of commonsense reasoning).
It’s worth noting that Mistral 7B and Gemma 7B are not exactly cutting-edge, with Mistral 7B being released in September last year. Additionally, Llama 3 8B only scores a few percentage points higher than the other two in some of the benchmarks cited by Meta. However, the company also claims that its larger-parameter-count model, Llama 3 70B, is at par with flagship generative AI models like Gemini 1.5 Pro from Google’s Gemini series.
On MMLU, HumanEval, and GSM-8K, Llama 3 70B outperforms Gemini 1.5 Pro. While it cannot match the performance of Anthropic’s top model, Claude 3 Opus, it does still outscore the weakest model in that series, Claude 3 Sonnet, on five benchmarks (MMLU, GPQA, HumanEval, GSM-8K, and MATH).
Meta has also developed a new test set of its own, covering various use cases such as coding, creative writing, reasoning, and summarization. Unsurprisingly, Llama 3 70B beat Mistral’s Mistral Medium model, OpenAI’s GPT-3.5, and Claude Sonnet. According to Meta, its modeling teams were not allowed access to this test set to maintain objectivity. However, given that it was created by Meta itself, the results should be taken with a grain of salt.
Meta claims that users of the new Llama models can expect more flexibility, lower chances of refusing to answer questions, and higher accuracy on trivia questions, historical questions, and STEM fields like engineering, science, and coding recommendations. To achieve this, the models have been trained on an incredibly large dataset of 15 trillion tokens, which equals approximately 750,000,000,000 words. This is seven times the size of the training set of Llama 2. In the AI world, tokens are subdivided bits of raw data, such as the syllables “fan,” “tas,” and “tic” in the word “fantastic.”
But where did this massive dataset come from? Unfortunately, Meta is keeping mum on that front. The company has only revealed that it used “publicly available sources” and included four times more code in the training data set compared to Llama 2. This data set also contains 5% non-English data in around 30 languages to improve the model’s performance in languages other than English. Additionally, Meta has admitted to using synthetic data, i.e., AI-generated data, to create longer documents for Llama 3 models to train on. This approach has been somewhat controversial due to potential performance drawbacks.
In a blog post shared with TechCrunch, Meta writes, “While the models we are releasing today are only fine-tuned for English outputs, the increased data diversity helps the models better recognize nuances and patterns, resulting in strong performance across a wide range of tasks.”
Dataset details are closely guarded by most generative AI vendors, as they are considered a competitive advantage. This secrecy is also related to the possibility of facing intellectual property-related lawsuits. Recently, it was reported that, in its quest to keep up with its AI rivals, Meta used copyrighted e-books for training, despite receiving warnings from its own legal team. Meta’s and OpenAI’s alleged unauthorized use of copyrighted data for training has led to an ongoing lawsuit filed by authors such as comedian Sarah Silverman.
One significant concern with generative AI models, including Llama 2, is toxicity and bias. Has Llama 3 improved in these areas? Meta claims it has.
The company has developed new data-filtering procedures to enhance the quality of its model training data. It has also updated its two generative AI safety suites – Llama Guard and CybersecEval – to detect and prevent the misuse of and unintended text generations from Llama 3 models and others. Additionally, Meta is launching a new tool, Code Shield, designed to identify any potential security vulnerabilities in code generated by generative AI models.
However, filtering processes are not foolproof, and tools like Llama Guard, CybersecEval, and Code Shield can only go so far. (Remember Llama 2’s tendency to create fake answers to questions and disclose private health and financial information?) We will need to see how the Llama 3 models perform in real-world scenarios, including testing by academics on alternative benchmarks.
According to Meta, the Llama 3 models are available for download now. They are also powering the company’s Meta AI assistant on various platforms like Facebook, Instagram, WhatsApp, Messenger, and the web. Soon the models will be made available across different cloud platforms like AWS, Databricks, Google Cloud, Hugging Face, Kaggle, IBM’s WatsonX, Microsoft Azure, Nvidia’s NIM, and Snowflake. Apart from that, versions of the models optimized for hardware from AMD, AWS, Dell, Intel, Nvidia, and Qualcomm will also be launched in the future.
But more advanced models are already in the works. According to Meta, they are currently training Llama 3 models with over 400 billion parameters — models that can converse in multiple languages, process more data, and understand images and other modalities, in addition to text. This would bring the Llama 3 series in line with other open releases like Hugging Face’s Idefics2.
Meta concludes in its blog post, “Our goal in the near future is to make Llama 3 multilingual and multimodal, with longer context and continuing to enhance overall performance in crucial large language model capabilities like reasoning and coding. There’s a lot more to come!”
[…] I wrote recently, generative AI models are increasingly being brought to healthcare settings — in some cases prematurely, perhaps. Early […]