Owning your own Large Language Model (LLM) offers many benefits such as control, privacy, performance, and cost advantages. Training from scratch can be costly, but thanks to open-source foundational models is not a must anymore. Optimisation techniques make fine-tuning affordable and yield results surpassing GPT-4 on certain tasks. While serving an LLM can be challenging, leveraging ML platforms can streamline the process and make it affordable.
In the most economical scenario, developing a specialised model based on Falcon 7B, fine-tuned on proprietary data can cost 9.99$ using Google Colab Pro. Deploying the model on-demand with just a single GPU machine, can keep expenses under control with 1.006 $/hr.
Despite the popularity of ChatGPT and OpenAI, concerns like data privacy have led many companies to be cautious about adopting them internally.
A prominent solution to address these issues is a self-hosted Large Language Model (LLM), deployable within a secure infrastructure. Utilising open-source projects like Falcon, you can readily obtain a pre-trained model with performance comparable to ChatGPT, at zero training cost for your business. Furthermore, a private LLM can be fine-tuned with proprietary data, significantly enhancing performance on business-specific tasks.
In this article, we will explore various options for obtaining your LLM and provide estimated costs for each approach.
2. Benefits of hosting your own LLM
Having your own LLM provides numerous advantages across various areas, which can be broadly categorised into data protection, performance, and strategic benefits.
Data protection benefits:
- With increasing regulations like the AI Act and user data protection measures, having an in-house LLM allows companies to maintain control over sensitive data.
- Publicly available LLMs are likely to face more regulations, leading to reduced performance and increased costs due to the inclusion of disclaimers and safety measures.
- By training the LLM on trusted and verified sources, companies can ensure that models used in sensitive areas like healthcare do not exhibit bias based on unreliable data.
- Companies can fine-tune the LLM using their proprietary data, tailoring it to specific tasks and improving its performance. For self-hosted models it’s much cheaper.
- Owning your model provides control over response time through infrastructure improvements while using third-party models may lead to queuing during peak hours.
- Processing a large volume of documents or running complex LLMs can be costly by third parties. Especially when it comes to scaling up to handle thousands of documents per day.
- Owning the data used to train the LLM creates a competitive advantage by building a proprietary dataset that can serve as a moat against competitors.
- By selecting or designing a model that aligns with their needs, such as adjusting size or context window, companies gain greater control and flexibility and cost
- Having control over your model and its version assures stable quality of the model while using a provider the results can change during updates. According to this paper GPT-4 results are declining over time to reduce costs.
In the upcoming sections, we will explore training and hosting options for your LLM, along with estimated costs.
3. Cost of training your private LLM
Having understood the advantages of owning your LLM, let’s now delve into the cost considerations and available options.
Hardware cost can be split into 2 parts: training and serving. Training costs strongly depend on the model size and training parameters. GPU’s — the shovel of the AI gold rush — play a significant role in both, particularly given the current scarcity of GPUs in the market.
Larger models come with higher training costs, but their superiority over smaller models is not guaranteed. General-purpose models like Llama or GPT are trained on diverse data to handle various tasks. Moveworks created a great research that shows that smaller models customised for a specific task can be as effective as even 10x bigger general purpose models.
“However, from our experiments, we have found that LLMs fine-tuned on enterprise-specific tasks and a corpus can understand enterprise-specific language and excel in enterprise-specific tasks as well as a GPT model — even when the model size is 10X smaller!”
Moveworks did a great job creating a benchmark to accurately gauge the performance of various LLMs in enterprise applications. They trained their own proprietary LLM that outperformed bigger models on enterprise-specific tasks using internal and external datasets.
Training a LLM from scratch
Estimating model training cost we need to determine our model architecture, with a primary focus on model size. Selecting the optimal model size is a very challenging task and it relates to the specific task the model aims to accomplish. For instance, the colossal models like GPT-4 are designed to cater to a vast array of tasks and possess extensive knowledge across numerous subjects. However, for highly specialised tasks, it is possible to significantly reduce the model size.
A good example why to train a model from scratch is BloombergGPT where they mixed internet data with BloombergGPT proprietary data to reach SOTA performance on financial tasks.
“The resulting model was validated on existing finance-specific NLP benchmarks, a suite of Bloomberg internal benchmarks, and broad categories of general-purpose NLP tasks from popular benchmarks (e.g., BIG-bench Hard, Knowledge Assessments, Reading Comprehension, and Linguistic Tasks). Notably, the BloombergGPT model outperforms existing open models of a similar size on financial tasks by large margins, while still performing on par or better on general NLP benchmarks.”
Training a LLM from scratch costs
For our analysis, let’s take the example of the Falcon 7B model, an average-sized LLM that demonstrates satisfying performance. Remarkably, this model can be accommodated on just one common Nvidia V100 GPU, making it accessible for various applications.
MosaicML (acquired by Databricks for 1.3B $) is a platform that enables you to easily train and deploy LLMs. According to pricing GPT-3 clone with 30B parameters can be trained at a significantly lower cost, approximately $450,000. Furthermore, smaller yet powerful 7B model can be trained for as little as $30,000 while still delivering comparable performance on specific tasks to their more expensive counterparts.
Using Open Source (OSS) model
The availability of open-source foundational language models has significantly reduced the cost of custom LLM development. Organisations can now opt to utilise pre-trained models like famous LLAMA-2 70B or more accessible options like our Falcon 7B, fine-tune them with proprietary data to align with their specific goals.
The rapidly evolving AI landscape sees new models being released regularly, providing a wealth of options for various applications. To stay up to date and select the best possible option check the HF leaderboard where all OSS models are listed and evaluated.
Fine-tuning OSS model costs
Achieving a new state-of-the-art (SOTA) on a research dataset by fine-tuning an open-source model can be cost-effective. For instance, CRFM Stanford used the Huggingface GPT model and PubMed data, leveraging MosaicML Platform to create a 2.7B parameters model that surpassed the SOTA in MedQA-USMLE evaluation. The total cost for this achievement was $38k.
Using our software stack, we orchestrated training on top of a cluster with 128 NVIDIA A100–40GB GPUs and 1600 Gb/s networking bandwidth between nodes. The physical GPUs were hosted on a leading cloud provider. The total training time for BioMedLM was ~ 6.25 days. Using placeholder pricing of $2/A100/hr, the total cost for this training run on the MosaicML platform was ~ $38,000.
Smart (QLoRa) Fine-tuning costs
Fine-tuning larger models, such as a 65 billion parameters model, can be costly due to the substantial GPU memory requirements (>10 of A100 80 Gb GPUs). However, methods like LoRa or QLoRa offer efficient solutions. LoRa demonstrates that adjusting the model does not necessitate retrain foundational model weights, reducing computational expenses. QLoRa further enhances the process with computational tricks, maintaining performance while significantly improving efficiency. With QLoRa, fine-tuning can be accomplished using just one A100 GPU. More in this paper.
Employing the QLoRa technique to fine-tune a model like Falcon 7B can be achieved cost-effectively using Google Colab Pro, which costs $9.72 per month and you can cancel anytime. Alternatively, fine-tuning on a PC equipped with least 16 GB VRAM graphic card is another viable option. This setup enables efficient and budget-friendly fine-tuning of large language models with results comparable to traditional fine-tuning methods.
The following section will unveil the actual cost associated with deploying a trained model and utilising it effectively.
4. Cost of serving your own LLM
Having explored the advantages and training-related expenses of owning a self-hosted LLM in the previous sections, it is now time to shed light on the costs associated with deploying and utilising a trained model.
Although fine-tuning brings optimism, achieving a usable and scalable model involves more considerations. Availability remains challenging, even for well-funded platforms like ChatGPT. Scaling language models for reliable chat agents demands complex engineering and continuous, potentially costly investments in infrastructure to ensure smooth performance during peak usage. Scaling infrastructure along with user traffic can result in significant costs.
Unlike training or fine-tuning, which are one-time or occasional expenses, the ongoing operational costs of serving the model can be significant, directly influenced by the volume of users and interactions with the system. Companies must carefully manage and optimise resources to handle varying levels of traffic while ensuring efficient and cost-effective LLM usage. A well-calculated and optimised approach is essential for such projects.
Estimating costs of serving a LLM you have to consider following:
- Model architecture — ideally opting for the smallest model that fulfils the task and optimisation techniques (quantisation, pruning, distillation), along with effective parallelisation (DataParallel, TensorParallel, ZeRO). These factors impact resource requirements and overall expenses.
- Output length — The average model output length significantly affects serving costs alongside the number of requests. Depending on the model’s training, responses can vary in length, ranging from short answers to more elaborate ones. Models producing more tokens per request have higher costs due to increased computation time required for generating each token.
- Batching — Batching is a critical aspect when serving a LLM. Bigger batch sizes boost throughput but can also introduce higher latency. Employing techniques such as continuous batching can yield remarkable throughput improvements of up to 23 times.
- Autoscaling — In the context of model autoscaling, startup time poses significant challenges, especially due to slow loading times, which can exceed 15 minutes. Lengthy startup times can make autoscaling ineffective. To mitigate this, storing the model locally and utilising fast drives can help expedite the startup process, ensuring more efficient autoscaling.
Serving machines costs
An LLM typically requires GPU memory roughly 2x the number of parameters. For instance, our Falcon model with 7 billion parameter would need approximately 14GB of GPU space, assuming one 16-bit float (2 bytes) per parameter. Which means it can fit a single Nvidia V100 card.
However, when prioritizing high availability and efficient parallelization, relying on a single GPU card is not optimal. In such cases, a configuration comprising at least four GPU cards is recommended to fully leverage parallel processing capabilities.
Example AWS pricing — prices my vary by different providers.
To achieve real-time responses, continuous machine operation is necessary due to the significant startup time required. In this case an instance can be reserved for a longer period of time which can reduce costs.
- The cheapest machine with 1 GPU cost 0.402 $/hr in 3-years plan
- The cheapest machine with 4 GPU cost 2.269 $/hr in 3-years plan
In certain use cases where high service availability is not essential, opting for on-demand machine provisioning is a viable approach. While this method incurs a higher hourly cost, it offers significant cost savings as we only pay for the machine when it is required.
- The cheapest machine with 1 GPU cost 1.006 $/hr on-demand
- The cheapest machine with 4 GPU cost 5.672 $/hr on-demand
In cases where the current setup is not able to sustain satisfactory response times, two viable options exist. Firstly, larger instances can be deployed to meet the requirements. Secondly, autoscaling can be implemented to provision new instances dynamically. The approach for serving the model must align with the unique requirements and objectives of the business.
Having your own LLMs can lead to cost savings, often making it cheaper than relying on external providers it gives also more other benefits in data privacy, performance and strategy. While training a model from scratch can be expensive, leveraging foundational models and innovative techniques like QLoRa can significantly reduce costs to just a few dollars. However, serving these models during peak traffic or real-time scenarios presents challenges, though it remains reasonably affordable depending on the specific use case.
Falcon 7B based model total costs
In the most economical scenario, developing a specialised model based on Falcon 7B using proprietary data can cost 9.99$ using Google Colab Pro. Deploying the model on-demand with just a single GPU machine, can keep expenses under control with 1.006 $/hr.
Discover what AI automation capabilities are in your team! Book a 30-minute free consultancy here —> LINK
- Attention Is All You Need https://arxiv.org/abs/1706.03762
- LoRA: Low-Rank Adaptation of Large Language Models — https://arxiv.org/abs/2106.09685