In the fast-evolving AI industry, there is a growing trend towards adopting generative models with longer contexts. However, these models often come with a downside of being extremely compute-intensive.
The CEO of AI startup AI21 Labs, Ori Goshen, believes that this doesn’t have to be the case. In fact, his company is releasing a new generative model that aims to prove it.
“Models with large context windows tend to be very compute-intensive. But with our new model, we are challenging this notion and showing that it is possible to have a powerful generative AI model without sacrificing efficiency.”
The term “contexts” or “context windows” refers to the input data, such as text, that a model considers before generating an output. Models with smaller context windows tend to struggle with remembering even recent conversations, while models with larger contexts have the advantage of better understanding the flow of data.
The latest addition to AI21 Labs’ repertoire, Jamba, is a text-generating and -analyzing model that is comparable to popular models like OpenAI’s ChatGPT and Google’s Gemini. Jamba has been trained on a mix of public and proprietary data and is capable of writing text in English, French, Spanish, and Portuguese.
One of Jamba’s most impressive features is its ability to handle up to 140,000 tokens while running on a single GPU with at least 80GB of memory. That is equivalent to around 105,000 words or 210 pages, which is a decent-sized novel.
For comparison, Meta’s Llama 2 has a 32,000-token context window, which is considered small by today’s standards. However, it only requires a GPU with approximately 12GB of memory to run. (Context windows are usually measured in tokens, which are small units of raw text and data.)
At first glance, Jamba may seem like just another generative AI model, with various similar models already available for download, such as Databricks’ DBRX and Llama 2. But what sets Jamba apart is its unique architecture, which combines two model architectures – transformers and state space models (SSMs).
- Transformers – known for their ability to handle complex reasoning tasks, these models power popular models like GPT-4 and Google’s Gemini. Their defining feature is the “attention mechanism,” which weighs the relevance of every piece of input before generating an output.
- SSMs – these models combine the strengths of older AI models like recurrent neural networks and convolutional neural networks to create a more efficient architecture that can handle longer sequences of data.
Although SSMs have their limitations, some early versions, like Mamba, developed by researchers at Princeton and Carnegie Mellon, have shown impressive results in handling larger inputs while outperforming transformer-based models in language-generation tasks.
In fact, Jamba’s base model is Mamba, and Goshen claims that it can deliver three times the throughput compared to transformer-based models of similar size when dealing with long contexts.
In an interview with TechCrunch, Goshen stated, “While there have been a few examples of SSM models in academic settings, our model is the first to be commercially available and scalable. This innovative architecture not only has potential for further research in the community, but it also provides great efficiency and throughput possibilities.”
Although Jamba has been released under the open-source Apache 2.0 license, which has relatively few usage restrictions, Goshen clarifies that it is a research release and not intended for commercial use. The model does not have any safeguards to prevent it from generating toxic texts or mitigations to address potential biases. However, a fine-tuned and safer version will be released in the near future.
Goshen firmly believes that Jamba is just the beginning of what SSM models can achieve. “With its size and innovative architecture, it can easily fit onto a single GPU. We are confident that with further tweaks and improvements, the performance of Mamba will only get better,” he added.