微调大模型

Lee
Lee
发布于 2024-06-17 / 78 阅读
1
0

微调大模型

如何增强模型能力

微调是其中的一个方法,当然还有其他方式,比如外挂知识库或者通过 Agent 调用其他 API 数据源,下面我详细介绍下这几种方式的区别。

  1. 微调是一种让预先训练好的模型适应特定任务或数据集的方案,成本相对较低,这种情况下,模型会学习训练者提供的微调数据,并且具备一定的理解能力。

  2. 知识库使用向量数据库或者其他数据库存储数据,为大语言模型提供信息来源外挂。

  3. API 和知识库类似,为大语言模型提供信息来源外挂。

简单理解,微调相当于让大模型去学习一门新的学科,在回答的时候进行闭卷考试,知识库和 API 相当于为大模型提供了新学科的课本,回答的时候进行开卷考试。几种模式并不冲突,可以同时使用几种方案来优化模型,提升内容输出能力,下面我简单介绍下几种模式的优缺点。

在大模型实际落地过程中,我们需要先分析需求,然后确定落地方式。

  1. 微调:准备数据、微调、验证、提供服务。

  2. 知识库:准备数据、构建向量库、构建智能体、提供服务。

  3. API:准备数据、开发接口、构建智能体、提供服务。

Lora实战

下面我就举个Lora微调ASR语音识别模型whisper的例子带你深入理解:

我们将使用几个流行的 Python 包来微调 Whisper 模型。我们将用于 datasets 下载和准备我们的训练数据,并 transformers 加载和训练我们的 Whisper 模型。我们还需要该 librosa 包对音频文件进行预处理, evaluatejiwer 评估模型的性能。最后,我们将使用 PEFTbitsandbytes accelerate 来准备和微调 LoRA 模型。

加载数据集

使用🤗数据集,下载和准备数据非常简单。我们可以在一行代码中下载和准备 Common Voice 拆分。首先,确保你已接受 Hugging Face Hub 上的使用条款:mozilla-foundation/common_voice_13_0。

由于印地语的资源非常匮乏,我们将合并 trainvalidation 拆分以提供大约 12 小时的训练数据。我们将使用 6 小时 test 的数据作为我们的保留测试集:

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

language_abbr = "hi" # Replace with the language ID of your choice here!

common_voice["train"] = load_dataset(dataset_name, language_abbr, split="train+validation", use_auth_token=True)
common_voice["test"] = load_dataset(dataset_name, language_abbr, split="test", use_auth_token=True)

print(common_voice)

大多数 ASR 数据集仅提供输入音频样本 ( audio ) 和相应的转录文本 ( sentence )。Common Voice 包含其他元数据信息,例如 accent locale 和 ,我们可以忽略这些信息。为了尽可能保持笔记本的通用性,我们只考虑输入音频和转录文本进行微调,丢弃额外的元数据信息:

common_voice = common_voice.remove_columns(

    ["accent", "age", "client_id", "down_votes", "gender", "locale", "path", "segment", "up_votes", "variant"]

)

print(common_voice)

由于我们的输入音频以 48kHz 的频率采样,因此我们需要在将其传递给 Whisper 特征提取器之前将其下采样到 16kHz,16kHz 是 Whisper 模型预期的采样率。

from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

准备特征提取器、分词器和数据

ASR 管道可以分解为三个阶段:

  1. A feature extractor which pre-processes the raw audio-inputs
    一种功能提取器,用于预处理原始音频输入

  2. The model which performs the sequence-to-sequence mapping
    执行序列到序列映射的模型

  3. A tokenizer which post-processes the model outputs to text format
    将模型输出后处理为文本格式的分词器

在 Transformer 中🤗,Whisper 模型有一个关联的特征提取器和分词器,分别称为 WhisperFeatureExtractor 和 WhisperTokenizer

from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)

from transformers import WhisperTokenizer

task = "transcribe"

tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language=language_abbr, task=task)

# 为了简化特征提取器和分词器的使用,我们可以将两者包装到一个 WhisperProcessor 类中。
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(model_name_or_path, language=language, task=task)

准备数据

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch


common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)

训练与评估

评估指标:在评估过程中,我们希望使用单词错误率 (WER) 指标来评估模型。

import evaluate

metric = evaluate.load("wer")
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

为了减少模型的内存占用,我们以 8 位加载模型

from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path, load_in_8bit=True, device_map="auto")
from peft import prepare_model_for_int8_training

model = prepare_model_for_int8_training(model, output_embedding_layer_name="proj_out")
def make_inputs_require_grad(module, input, output):
    output.requires_grad_(True)

model.model.encoder.conv1.register_forward_hook(make_inputs_require_grad)

将低秩适配器 (LoRA) 应用于模型

from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")

model = get_peft_model(model, config)
model.print_trainable_parameters()

定义训练配置

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="reach-vb/test",  # change to a repo name of your choice
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-3,
    warmup_steps=50,
    num_train_epochs=3,
    evaluation_strategy="steps",
    fp16=True,
    per_device_eval_batch_size=8,
    generation_max_length=128,
    logging_steps=100,
#    max_steps=100, # only for testing purposes, remove this from your final run :)
    remove_unused_columns=False,  # required as the PeftModel forward doesn't have the signature of the wrapped model's forward
    label_names=["labels"],  # same reason as above
)
from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR

# This callback helps to save only the adapter weights and remove the base model weights.
class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control


trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

微调模型大约需要 6-8 小时。

基于fast-whisper 我们可以将模型进行CTranslate2,提高速度

GitHub:https://github.com/SYSTRAN/faster-whisper


评论