“Efficiently Produce Code with the Powerful StarCoder 2: Utilizing GPUs for Optimal Performance”

Like most other code generators, StarCoder 2 can suggest ways to complete unfinished lines of code as well as summarize and retrieve snippets of code when asked in natural language. Trained with 4x more data than the original StarCoder, StarCoder 2 delivers what Hugging Face, ServiceNow and Nvidia characterize as “significantly” improved performance at lower costs to operate. Setting all this aside for a moment, is StarCoder 2 really superior to the other code generators out there — free or paid? As with the original StarCoder, StarCoder 2’s training data is available for developers to fork, reproduce or audit as they please. Hugging Face, which offers model implementation consulting plans, is providing hosted versions of the StarCoder 2 models on its platform.

It is no surprise that developers are eagerly embracing AI-powered code generators, such as GitHub Copilot and Amazon CodeWhisperer, and open access models like Meta’s CodeLlama. However, these tools still have their flaws – some are not free, while others have licenses that prevent them from being used in common commercial circumstances.

Recognizing the need for alternatives, AI startup Hugging Face joined forces with ServiceNow, a workflow automation platform, several years ago to develop StarCoder – an open source code generator with a less restrictive license than its counterparts. The original version was launched early last year, and work has been ongoing for the release of StarCoder 2.

StarCoder 2 is not a single code-generating model, but a family of three variants. The first two can be used on most modern consumer GPUs, including:

  1. A 3-billion-parameter (3B) model trained by ServiceNow
  2. A 7-billion-parameter (7B) model trained by Hugging Face

The third variant, with a 15-billion-parameter (15B) model trained by Nvidia, is the newest supporter of the StarCoder project. Note that “parameters” refer to the parts of a model learned from training data, essentially defining its effectiveness in generating code.

Like other code generators, StarCoder 2 can suggest ways to complete unfinished lines of code and retrieve code snippets when prompted with natural language. Trained with 4 times more data than its predecessor, StarCoder 2 boasts significantly improved performance at lower operating costs, according to Hugging Face, ServiceNow, and Nvidia.

Developers can fine-tune StarCoder 2 in just a few hours using a GPU like the Nvidia A100 on first- or third-party data, resulting in applications such as chatbots and personal coding assistants. And with training on a larger and more diverse dataset (~619 programming languages), StarCoder 2 can make more precise and context-aware predictions – in theory, at least.

Head of ServiceNow’s StarCoder 2 development team, Harm de Vries, emphasized that StarCoder 2 was designed specifically for developers who need to build applications quickly. He stated, “With StarCoder 2, developers can use its capabilities to make coding more efficient without sacrificing speed or quality.”

However, not all developers may agree with De Vries’ claims regarding speed and quality. Code generators promise to streamline coding tasks, but at a cost.

A recent Stanford study found that engineers who use code-generating systems are more likely to introduce security vulnerabilities to their apps. Additionally, a poll by cybersecurity firm Sonatype showed that the majority of developers are concerned about the lack of transparency in how code is produced by code generators, as well as the overwhelming amount of code generated that becomes difficult to manage.

Furthermore, StarCoder 2’s license may pose a barrier for some. It is licensed under Hugging Face’s RAIL-M, which imposes “light touch” restrictions on both model licensees and downstream users to promote responsible use. While less restrictive than other licenses, RAIL-M does not permit the use of StarCoder 2 for all applications (such as medical advice-giving apps). Some experts suggest that RAIL-M’s requirements may be too vague to comply with and could potentially conflict with AI-related regulations, like the EU AI Act.

Putting these issues aside for a moment, is StarCoder 2 truly superior to its competitors – whether free or paid?

According to certain benchmarks, it appears to outperform one of CodeLlama’s versions (CodeLlama 33B) in efficiency. Hugging Face claims that StarCoder 2 15B matches CodeLlama 33B on a subset of code completion tasks, but at twice the speed. It is unclear which tasks were used for this comparison, as Hugging Face did not specify.

As an open source collection of models, StarCoder 2 also offers developers the advantage of deploying locally and “learning” a developer’s source code or codebase – a tempting proposition for those concerned about exposing their code to cloud-hosted AI. In a 2023 survey by Portal26 and CensusWide, 85% of businesses expressed caution about adopting AI generation tools due to privacy and security risks, such as employees sharing sensitive information or vendors training on proprietary data.

Hugging Face, ServiceNow, and Nvidia also argue that StarCoder 2 is more ethically and legally sound than its competitors.

All GenAI models essentially mimic the data they were trained on – a risky concept that could potentially land a developer in legal trouble. With code generators trained on copyrighted code, there is a chance that they could recommend copyrighted code without labeling it as such, even with filters and extra measures in place.

While a few vendors, such as GitHub, Microsoft, and Amazon (GitHub’s parent company), offer legal protection to customers accused of copyright infringement through the use of a code generator, this coverage varies among vendors and is mostly limited to corporate clients.

Unlike other code generators trained on copyrighted code, StarCoder 2 was trained exclusively on data licensed from Software Heritage, a non-profit organization that provides archival services for code. Prior to training, BigCode – the cross-organizational team behind StarCoder 2’s development – gave code owners the opportunity to opt-out of the training set.

As with the original StarCoder, the training data for StarCoder 2 is publicly available for developers to inspect, replicate, or audit at their discretion.

Even so, StarCoder 2 is not without its flaws. Like other code generators, it is susceptible to bias. De Vries acknowledges that it can generate code with elements that reinforce stereotypes related to gender and race. Additionally, as StarCoder 2 was predominantly trained on English-language comments and code in Python and Java, it may not perform as well with other languages or with “lower-resource” code like Fortran and Haksell.

Nonetheless, von Werra believes that StarCoder 2 is a step in the right direction.

“We strongly believe that creating trust and accountability with AI models requires transparency and auditability of the entire model pipeline, including the training data and process,” he stated. “StarCoder 2 demonstrates how fully open models can still achieve competitive performance.”

Given that all three companies involved – Hugging Face, ServiceNow, and Nvidia – are businesses, one might wonder what their incentive is to invest in a project like StarCoder 2, as training models is not a cheap endeavor.

So far, it seems that they are following a tried-and-true strategy: establish goodwill and develop paid services based on the open source release.

ServiceNow has already used StarCoder to create Now LLM, a code generation service tailored to ServiceNow workflow patterns, use cases, and processes. Hugging Face, which offers model implementation consultancy services, provides hosted versions of the StarCoder 2 models on its platform. Nvidia does the same through an API and web front-end.

For developers interested in the offline experience at no cost, StarCoder 2 – including its models, source code, and more – can be downloaded from the project’s GitHub page.

Avatar photo
Max Chen

Max Chen is an AI expert and journalist with a focus on the ethical and societal implications of emerging technologies. He has a background in computer science and is known for his clear and concise writing on complex technical topics. He has also written extensively on the potential risks and benefits of AI, and is a frequent speaker on the subject at industry conferences and events.

Articles: 865

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *