Finetuning llama
For Llama2, refer to Fine-Tuning Llama 2 Using LoRA and QLoRA: A Comprehensive Guide.
For training phase
- Set up QLoRA and BitsAndBytes configuration
- Load base model with quantization setting
- Tokenizer and perfconfiguraiton
- Start SFTTrainer
- Save weight to certain folder after training
For evaluation phase
- Load base model using AutoModelForCausalLM
- Merge with PeftModel
The detailed code refer to the aforementioned link.
For Llama 3.1, a similar process can be followed. However, double-check if your fine-tuned data conforms to the corresponding llama data format.
Regarding details, refer to llama 3.1 fine-tuning. For the detailed code, refer to the link above.
Moreover, when switching to different LLMs can't be neglected for benchmark evaluation, a better option should be LLaMA-Factory, which supports finetuning a bunch of SOTA language models, such as Llama3, Phi-3, PaliGemma, Gemma, etc. Due to its well-defined interfaces, embracing new language models only involves limited effort, not to mention its support for new algorithms, like SimPO, KTO, PiSSA.
Before fine-tuning, you can also use https://huggingface.co/spaces/hf-accelerate/model-memory-usage to estimate the needed GPU memory in advance.
Besides what was mentioned before, mLoRA is another option that shouldn't be ignored. It designs a shared baseline model with multiple LoRA adapters, supporting efficient pipeline parallelism and multiple reinforcement learning preference alignment algorithms.
Two reference implementations are liste
- https://github.com/TUDB-Labs/mLoRA