Welcome to Infinite Curiosity, a weekly newsletter that explores the intersection of Artificial Intelligence and Startups. Tech enthusiasts across 200 countries have been reading what I write. Subscribe to this newsletter for free to receive it in your inbox every week:
Hello friends,
LLMs are incredible at generating human-like text, but they can be slow and expensive. This has a direct impact on user experience. The thing that usually separates good products from the great ones often tends to be user experience.
Given that AI compute is becoming increasingly scarce and expensive, people need to do more with the available compute power. What can we do to make LLMs faster? To get the answer, the llama decided to have a talk with the race car driver. They came up with 7 techniques to speed up LLMs:
1. Model Pruning
Reduce the size of the model by eliminating parameters. Which parameters do we eliminate? The ones that have little impact on the output. By taking out less critical components, we can accelerate the inference speed and reduce the computational power required. You can do this while maintaining good performance.
2. Quantization
Reduce the precision of the numerical values used within the model. For example, you can switch from float32 to float16 (or even further down to int8). But won't it impact accuracy? Yes, but that's the tradeoff. This technique allows you to save a lot on computational resources without sacrificing too much accuracy.
3. Model Distillation
Train a smaller model to imitate the behavior of a larger model. You can leverage the knowledge of the larger model to make the smaller one produce similar output much more efficiently.
4. Parallel Processing
Split the workload across multiple GPUs. This allows you to process larger batches of data simultaneously and significantly reduce the overall computation time.
5. Subword Tokenization
Break words into smaller units (i.e. subwords). This will allow you to reduce the size of the vocabulary. What does this achieve? It speeds up processing time and reduces memory usage.
6. Optimized Libraries
Use highly optimized libraries (like Nvidia's TensorRT) to run your AI workloads. It can significantly boost the performance. Why? Because these libraries contain a number of pre-optimized algorithms and models to help speed things up.
7. Batch Inference Workloads
A good chunk of the chip's memory bandwidth is consumed by the model parameters that you load. This makes it difficult for LLMs to take full advantage of the chip. How can we overcome this? Through batching. You don't have to load model parameters for every input sequence. You can batch them together and load the parameters only once. This allows you to use these parameters to process multiple input sequences.
8. Adapters
What are adapters in this case? They are compact additional layers in the model (e.g. LoRa, QLoRa). These layers are tunable, which means you can train them to do what you want. You can make these layers lightweight, which helps the model to learn quickly. This is especially useful when you're fine-tuning a model.
If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 friend who’s curious about AI:
Interesting. I haven't thought about some ways listed in this article.
Memo to myself: https://share.glasp.co/kei/?p=Us61LAWyAtULuOKMpQCK