Efficient Training of Large Language Models (LLMs) with LoRA and 4-Bit Quantization

In this guide, we will explore an efficient approach to training LLMs using Low-Rank Adaptation (LoRA) and 4-bit quantization with bitsandbytes. These techniques significantly reduce GPU memory requirements while maintaining model performance. Additionally, we will leverage data parallelism to accelerate training across multiple GPUs.

Hardware Requirements

For optimal performance, your system should meet the following hardware specifications:

2× NVIDIA A100 GPUs (80GB VRAM each)
300GB of mounted storage (for model weights and training artifacts)

Storage Configuration

Mounting Storage

To set up storage on your virtual machine, execute the following commands:

Verify that the storage has been mounted correctly and check for sufficient capacity:

Cloning the Cookbook Repository

Clone the repository containing this guide and navigate to the relevant directory:

Project Structure

src/
├── config.py
├── main.py      
├── data_processing/
│   ├── processor.py

Install Dependencies

To begin, update your system and install the Python development headers:

Next, install the necessary Python libraries. If you've just created your environment, use the following command:

Exporting Environment Variables

To optimize storage and prevent issues during training, export the required environment variables:

The HF_HOME variable points to the mounted storage where models and datasets will be stored, and the HF_TOKEN ensures that you can access gated datasets from Hugging Face.

Note: You can generate an HF token by following the instructions provided on the Hugging Face website: How to generate a Hugging Face token.

Running the Training Script

The training process is contained in the main.py script. Run it with:

This command launches the training script using the accelerate library, which is optimized for multi-GPU and distributed training. The --mixed_precision 'bf16' argument ensures that the training uses bfloat16 precision, optimizing memory usage and improving training speed without sacrificing model performance.

By running this command, the script will initialize all components (like the model, tokenizer, dataset, and trainer) and start the training process as described earlier.

Now that we know how to run training script, let's take a closer look at it. This script orchestrates the entire training pipeline, including model initialization, dataset preparation, and fine-tuning with LoRA and 4-bit quantization. Understanding its structure will help us grasp how these techniques work in practice.

Preparing the Dataset

Loading the Dataset

We will use the TweetSumm dataset, which includes conversation summaries. To load the dataset, run the following code:

Loading the Tokenizer

Before processing the dataset, load the tokenizer from Hugging Face. This ensures that input text is properly tokenized before being passed to the model.

Preprocessing the Dataset

The core of dataset preprocessing involves transforming raw conversations into a format suitable for training. Each sample will be structured into an instruction, input, and response format, enabling the model to learn from structured data. The Processor class handles this by extracting the conversation, removing irrelevant content (like URLs and mentions), and creating training prompts. The conversation is then tokenized.

This ensures that each conversation is formatted as follows:

### Instruction: Below is a conversation between a human and an AI agent. Write a summary of the conversation.

### Input:
user: How do I check in for my flight?
agent: You can check in online through our website or mobile app.

### Response:
The user is asking about the process for checking in for a flight. The agent suggests checking in online.

This format makes it easier for the model to learn the relationship between user queries and AI responses, while also maintaining the necessary structure for summarization tasks.

Loading the model

To load a pre-trained model with 4-bit quantization, we'll use the AutoModelForCausalLM class from Hugging Face, with the necessary configuration for 4-bit quantization. Specifically, we'll use the BitsAndBytesConfig to enable 4-bit quantization and configure the relevant settings. This reduces memory consumption and speeds up model inference, making it suitable for training LLMs.

You can initialize the model using the following code:

Once the model is loaded, it will be placed on the appropriate GPU using the device_map argument. If you're training across multiple GPUs, this will automatically assign the model to the correct device based on the environment variable LOCAL_RANK.

Configuring PEFT (Low-Rank Adaptation)

To further optimize the training of LLMs, we use Low-Rank Adaptation (LoRA), a technique that modifies only a subset of model parameters while leaving the rest intact. LoRA helps achieve good performance with less computational overhead and reduced memory requirements. The LoraConfig class from the PEFT library configures LoRA parameters for the model.

LoRA Configuration

You can configure LoRA with the following parameters:

Once the LoRA configuration is set up, it can be applied to the model during training, allowing for efficient fine-tuning with minimal computational cost.

Training the Model

Now that the dataset is processed, the model is loaded, and PEFT (Low-Rank Adaptation) is configured, we can proceed with training the model. We will use the TrainingArguments class to define important training parameters, such as batch sizes, evaluation strategies, gradient checkpointing, and checkpoint saving. These settings are crucial for ensuring the stability of training and efficient management of system resources.

Training Arguments Configuration

The TrainingArguments class specifies all the hyperparameters and settings for the training process. This includes optimizer configurations, batch sizes, evaluation strategies, gradient accumulation, and more. Below is an example configuration for our training:

Starting the Training with the Trainer

To begin training, we use the SFTTrainer class, which is designed for training supervised fine-tuning models. We will pass the processed dataset and the model along with the training arguments and PEFT configuration:

Handling Potential Issues During Training

With the checkpointing system in place, training can resume from the most recent checkpoint if anything unexpected happens. The model will be periodically saved and evaluated, giving us the ability to monitor performance in real-time. If the model starts to overfit or underperform, adjustments to hyperparameters like learning rate, batch size, and gradient accumulation steps can be made dynamically.

Additionally, the use of gradient_checkpointing ensures that large models can still be trained on GPUs with limited memory. This, combined with the use of 4-bit quantization, helps manage GPU memory constraints and keeps training manageable even with large datasets and models.

Training the Model with `trainer.train()`

Once the SFTTrainer is set up with the model, datasets, training arguments, and PEFT configuration, we can start the actual training process by calling the train() method:

Saving and Loading the Fine-Tuned Model

After training, the fine-tuned model is saved in the ./results directory. You can reload it for inference or further fine-tuning as follows:

This ensures that your fine-tuned model is properly loaded with all LoRA adapters and ready for inference or further training.

Conclusion

By combining LoRA, 4-bit quantization, and multi-GPU training, we achieved efficient fine-tuning of LLMs while minimizing resource usage. This setup enables scalable, cost-effective training without sacrificing performance.

Next steps? Experiment with hyperparameters and datasets to tailor models to your needs. Happy training! 🚀

Efficient Training of Large Language Models