Finetuning llama

For Llama2, refer to Fine-Tuning Llama 2 Using LoRA and QLoRA: A Comprehensive Guide.

For training phase

  • Set up QLoRA and BitsAndBytes configuration
  • Load base model with quantization setting
  • Tokenizer and perfconfiguraiton
  • Start SFTTrainer
  • Save weight to certain folder after training

For evaluation phase

  • Load base model using AutoModelForCausalLM
  • Merge with PeftModel

The detailed code refer to the aforementioned link.

For Llama 3.1, a similar process can be followed. However, double-check if your fine-tuned data conforms to the corresponding llama data format.

Regarding details, refer to llama 3.1 fine-tuning. For the detailed code, refer to the link above.

Moreover, when switching to different LLMs can't be neglected for benchmark evaluation, a better option should be LLaMA-Factory, which supports finetuning a bunch of SOTA language models, such as Llama3, Phi-3, PaliGemma, Gemma, etc. Due to its well-defined interfaces, embracing new language models only involves limited effort, not to mention its support for new algorithms, like SimPO, KTO, PiSSA.

Before fine-tuning, you can also use https://huggingface.co/spaces/hf-accelerate/model-memory-usage to estimate the needed GPU memory in advance.

Besides what was mentioned before, mLoRA is another option that shouldn't be ignored. It designs a shared baseline model with multiple LoRA adapters, supporting efficient pipeline parallelism and multiple reinforcement learning preference alignment algorithms.

Two reference implementations are liste

  1. https://github.com/TUDB-Labs/mLoRA
GitHub - TUDB-Labs/mLoRA: An Efficient “Factory” to Build Multiple LoRA Adapters
An Efficient “Factory” to Build Multiple LoRA Adapters - TUDB-Labs/mLoRA
  1. https://github.com/small-thinking/multi-lora-fine-tune
GitHub - small-thinking/multi-lora-fine-tune: Provide efficient LLM LoRA fine tune
Provide efficient LLM LoRA fine tune . Contribute to small-thinking/multi-lora-fine-tune development by creating an account on GitHub.