Why did Databricks open source its LLM in the form of Dolly 2.0?
Why did Databricks open source its LLM in the form of Dolly 2.0?
Databricks has released an open-source based iteration of its large language model (LLM), dubbed Dolly 2.0 in response to the growing demand for generative AI and related applications. The new release can be licensed by enterprises for research and commercial use cases.
Databricks’ move to release a large language model based on open source data can be attributed to enterprises’ demand for controlling the model and using it for targeted or specific use cases in contrast to close loop trained models, such as ChatGPT, that put constraints on commercial usage, analysts said.
“Because these models (such as Dolly 2.0) are mostly open and do not require months of training on sizable GPU clusters, they are opening up a number of interesting doors for enterprises anxious to build their own, internal generative AI implementation,” said Bradley Shimmin, chief analyst at Omdia.
“These diminutive (as they are trained on a smaller number of parameters) models make heavy use of prompt/response pairs for training data and that’s why they’re perfect for very targeted use cases for companies wishing to control the entire solution, for example, an existing helpdesk database of question/answer pairings,” Shimmin said.
Another reason for the demand for open-source based LLMs, according to Amalgam Insights’ chief analyst Hyoun Park, is Dolly 2.0’s ability to allow enterprises to better track data governance, residency, and relevance associated with the use cases being supported.
“This is because usage of other models such as OpenAI’s ChatGPT is dependent on APIs and for some enterprises, this dependence can cause compliance, governance or data security issues associated with the APIs,” Park said, citing the irony of the name OpenAI.
Open-source based LLMs have features that can benefit researchers as well as enterprises, according to David Schubmehl, research vice president at IDC.
“Researchers can review, adjust, and improve on open source LLMs more quickly, thereby increasing the potential for innovation,” Schubmehl said. Enterprises can combine these types of LLMs with their own enterprise applications, providing a more interactive interface to those applications, he said.
Dolly 2.0 and other open-source based LLMs will be beneficial to enterprises that are highly regulated, according to Andy Thurai, principal analyst at Constellation Research.
“This is a good first step in showing enterprises how they create and own their models without the need to pay API access fees or share data with LLM providers which can be a huge issue for certain enterprises in the regulated industries,” Thurai said.
However, some analysts have warned against the pitfalls of using open-source based large language models.
“Because of the smaller nature of the parameters and the training set of Dolly-like models, the responses can be rude, short, toxic, and offensive to some. The model was created based on data from ‘The Pile’, which was not cleaned for data bias, sensitivity, unacceptable behaviors, etc.,” Thurai said, adding that Dolly 2.0’s current output text is only limited to English.
One needs to note that an open-source based model might not always be the “preferred or a superior approach to closed sourced models from an enterprise standpoint,” said Gartner Vice President and Analyst Arun Chandrasekaran.
“Deploying these models often requires tremendous know-how, continuous iteration, and large infrastructure to train and operate them on,” Chandrasekaran said, citing the dominance of closed models in the generative AI space.
Difference between open-source based and closed LLMs
In contrast to closed LLMs, open-source based models can be used for commercial usage or customized to suit an enterprise’s needs as the data used to train these models are open to public use, analysts said.
Closed models such as ChatGPT are trained on data owned by its developer OpenAI, making the model available for use via a pay access API and barred from direct commercial usage.
“The term ‘open LLMs’ could have multiple connotations. The most visible and significant is the access to the source code and deployment flexibility of these models. Beyond that, openness could also include access to model weights, training data sets and how decisions are made in an open and collaborative way,” Chandrasekaran said.
Dolly 2.0 too follows the open source-based model philosophy, according to IDC’s Schubmehl.
“Dolly 2.0 is an LLM where the model, the training code, the dataset, and model weights that it was trained with are all available as open source from Databricks, such that enterprises can make commercial use of it to create their own customized LLM,” Schubmehl said, adding that this approach contrasts with other LLMs, which have not made their individual components that the model was built with to be open sourced.
The other point of difference between closed and “open” LLMs is the number of parameters the models were trained on, analysts said, adding that closed LLMs are usually trained on a larger number of parameters.
ChatGPT4, for example, was trained on 100 trillion parameters as opposed to Dolly 2.0’s 12 billion parameters.
How was Dolly 2.0 trained?
Dolly 2.0 builds upon the company’s first release of Dolly, which was trained for $30 using a data set that the Stanford Alpaca team had created using the OpenAI API.
“The data set used to train Dolly 1.0 contained output from ChatGPT, and as the Stanford team pointed out, the terms of service seek to prevent anyone from creating a model that competes with OpenAI,” Databricks said in a blog post.
In order to circumvent the issue and create a model for commercial use, Databricks built Dolly 2.0 using a 12 billion parameter language model based on EleutherAI’s Pythia model family.
The model was fine-tuned exclusively on a new, high-quality, human-generated instruction following dataset, crowdsourced among 5,000 Databricks employees, the company said.
The company calls the high-quality, human-generated responses or prompts databricks-dolly-15k, which uses a Creative Commons Attribution-ShareAlike 3.0 Unported License.
“Anyone can use, modify, or extend this dataset for any purpose, including commercial applications,” the company said, adding that the dataset can be downloaded from its GitHub page.
The model weights, according to Databricks, can be downloaded from the Databricks Hugging Face page.
How Dolly 2.0 fits into Databricks’ generative AI strategy
Databricks’ move to launch Dolly 2.0 could be seen as a strategy to gain some share of the generative AI business, according to Constellation Research’s Thurai.
“Essentially, a lot of LLM and foundational model business was going to the hyperscalers, each with their own variation — Microsoft with ChatGPT, Google with Bard, and AWS with providing infrastructure, process, tools, and model sharing and catalog via Huggingface partnership. Databricks wanted to get a piece of that business without losing it all,” Thurai said.
Other analysts said that Dolly’s launch is in line with the company’s strategy to bring open source products to markets.
“Databricks is well known for providing various AI tools and services as open source to help its customers get full use of their data and operations. Dolly is an excellent example of providing organizations with options around the latest in AI, i.e., large language models,” IDC’s Schubmehl said.
However, Databricks Dolly 2.0 might not have an immediate impact on rivals such as ChatGPT or Bard, according to analysts.
“Dolly or any of these open and open source-based generative AI LLM implementations will fully disrupt the current set of LLMs like Bard, ChatGPT, and Galactica. These solutions have a ready and long-lasting position within large-scale solutions like Google Workplace, Microsoft Office, etc,” Omdia’s Shimmin said.
Rather, Dolly will end up being a useful companion to the likes of ChatGPT being used as a general tool, according to Amalgam Insights’ Park.
“People will learn to use and prompt generative AI from general use tools and then use Dolly based models for specific use cases where they are seeking to do more detailed and expert work,” Park said.