AI新手村：Huggingface-369IT编程

admin管理员组
文章数量:1031308

AI新手村：Huggingface

HuggingFace

Hugging Face 最早作为 NLP 模型的社区中心，成立于 2016 年，但随着 LLM 的大火，主流的 LLM 模型的预训练模型和相关工具都可以在这个平台上找到，此外，上面也有很多计算机视觉（Computer Vision）和音频相关的模型。

Hugging Face 被视为AI 模型界的 GitHub 。Hugging Face有 3 大核心 Library，分别是 Transformer（对Transformer模型的封装使其更易使用）、Tokenizes（将文本句子拆分成模型可以理解的最小子块）、Dataset（读取外部数据的工具）。

下图是 Hugging Face 的首页，主要常用的功能如图标识的模型和数据集的功能。

Datasets 页面

数据的加载

Hugging Face 的 Datasets 页面有丰富的数据集，包括文本、音频、图片，也提供了直观的可视化页面。

huggingface 首页

使用的数据集的方式也很简单，使用load_dataset直接加载我们需要的数据集即可，如果想使用我们自定义的数据集使用函数load_dataset也是可以的。

代码语言：Python复制

from datasets import load_dataset

ds = load_dataset("clapAI/MultiLingualSentiment")

from datasets import load_dataset
# 读入训练数据和测试数据
data_files = {"train": "./day014/datas/train_data.json", "test": "./day014/datas/test_data.json"}
dataset = load_dataset("json", data_files = data_files)
print(dataset)
# 查看第一条训练数据
print(dataset['train'][0])

模型的使用

以文本分类（情感分析）的任务为例，我们可以通过函数 pipline只需要指定 task 名字就可以调用模型，模型默认使用的是 distilbert/distilbert-base-uncased-finetuned-sst-2-english，你也可以通过参数model指定特定的模型。

代码语言：Python复制

from transformers import pipeline

# 使用默认模型
# pipe = pipeline("text-classification")     

# 指定特定的模型，模型可以通过 Models 页面查找（因为默认的模型使用英文数据做训练数据，我换了一个支持多语言的模型）
pipe = pipeline("text-classification", model="lxyuan/distilbert-base-multilingual-cased-sentiments-student")     

string_arr = [
         "从前从前有个人爱你很久，但偏偏风渐渐把距离吹得好远，好不容易又能再多爱一天，但故事的最后你好像还是说了拜拜。",
        "我一路向北，离开有你的季节，你说你好累，已无法再爱上谁。风在山路吹，过往的画面全都是不对，细数惭愧，我伤你几回。",
        "我很开心"]
   
results = pipe(string_arr)

print(results)
# 输出结果
# [{'label': 'positive', 'score': 0.5694631338119507}, {'label': 'negative', 'score': 0.9576570987701416}, {'label': 'positive', 'score': 0.9572104811668396}]

第一次运行上面程序的时候，模型会自动下载，默认路径是 /HOME/.cache/huggingface/hub。

除了使用 pipline 函数，还可以通过接口的方式使用模型，不过需要提前准备好在网站申请的 token。使用接口的方式调用模型，模型本身不会下载到本地，算是比使用 pipline 方式方便的一点。

代码语言：python代码运行次数：0运行复制

from utilsmon_config import config
import requests
def generate_embedding(text: str) -> list[float]:
    embedding_url = ";
    response = requests.post(
        embedding_url,
        headers={"Authorization": f"Bearer {config.hg_token}"},
        json={"inputs": text})

    if response.status_code != 200:
        raise ValueError(f"Request failed with status code {response.status_code}: {response.text}")

    return response.json()

string_arr = [
         "从前从前有个人爱你很久，但偏偏风渐渐把距离吹得好远，好不容易又能再多爱一天，但故事的最后你好像还是说了拜拜。",
        "我一路向北，离开有你的季节，你说你好累，已无法再爱上谁。风在山路吹，过往的画面全都是不对，细数惭愧，我伤你几回。",
        "我很开心"]
a = generate_embedding(string_arr)
print(a)

# 输出结果
# [[{'label': 'positive', 'score': 0.5694631934165955}, {'label': 'neutral', 'score': 0.2743554711341858}, {'label': 'negative', 'score': 0.15618135035037994}], [{'label': 'negative', 'score': 0.9576572179794312}, {'label': 'neutral', 'score': 0.0352189838886261}, {'label': 'positive', 'score': 0.007123854476958513}], [{'label': 'positive', 'score': 0.9572104811668396}, {'label': 'neutral', 'score': 0.03854822367429733}, {'label': 'negative', 'score': 0.004241317044943571}]]

模型的微调

如果觉得模型本身效果不好，我们还可以使用微调（Fine-Tuning）的方式，使用自定义的数据训练，调整模型参数。

代码语言：python代码运行次数：0运行复制

from datasets import load_dataset
# 读入训练数据和测试数据
import os
data_files = {
    "train": os.path.join(os.path.dirname(__file__), "datas/train_data.json"),
    "test": os.path.join(os.path.dirname(__file__), "datas/test_data.json")
}
dataset = load_dataset("json", data_files = data_files)
print(dataset)
# 查看第一条训练数据
print(dataset['train'][0])

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = DistilBertTokenizer.from_pretrained('lxyuan/distilbert-base-multilingual-cased-sentiments-student')
model = (
    DistilBertForSequenceClassification.from_pretrained(
        'lxyuan/distilbert-base-multilingual-cased-sentiments-student',
        num_labels = 3,
        id2label = {0: "negative", 1: "neutral", 2: "positive"},
        label2id = {"negative": 0, "neutral": 1, "positive": 2},
        # ignore_mismatched_sizes=True
    ).to(device)
)
model_name = "sentiment_model"


from transformers import DataCollatorWithPadding
from sklearn.metrics import accuracy_score

def preprocess_function(example):
  return tokenizer(example['text'], truncation = True, padding = True)

train_dataset = dataset["train"].map(preprocess_function, batched = True)
test_dataset = dataset["test"].map(preprocess_function, batched = True)

data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

def compute_metrics(pred):
  labels = pred.label_ids
  predictions = pred.predictions.argmax(-1)
  accuracy = accuracy_score(labels, predictions)
  return {"accuracy": accuracy}

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
  output_dir = model_name,
  eval_strategy = "epoch",
  learning_rate = 2e-5,
  per_device_train_batch_size = 4,
  per_device_eval_batch_size = 4,
  num_train_epochs = 60,
  weight_decay = 0.01,
)

trainer = Trainer(
  model = model,
  args = training_args,
  train_dataset = train_dataset,
  eval_dataset = test_dataset,
  tokenizer = tokenizer,
  data_collator = data_collator,
  compute_metrics = compute_metrics,
)

trainer.train()

train_results = trainer.evaluate(eval_dataset = train_dataset)
train_accuracy = train_results.get('eval_accuracy')
print(f"Training Accuracy: {train_accuracy}")

test_results = trainer.evaluate(eval_dataset = test_dataset)
test_accuracy = test_results.get('eval_accuracy')
print(f"Testing Accuracy: {test_accuracy}")

训练完成后，我们可以使用新的模型，看一下效果，由于本地训练的数据比较少，新模型的最终效果不是很好。

代码语言：sql复制

from transformers import pipeline
classifier = pipeline(task = 'sentiment-analysis', model = "/Users/shaoyang/.cache/huggingface/hub/sentiment_model/checkpoint-120")
a = classifier(["从前从前有个人爱你很久，但偏偏风渐渐把距离吹得好远，好不容易又能再多爱一天，但故事的最后你好像还是说了拜拜。",
                              "我很开心"])

print(a)
# [{'label': 'negative', 'score': 0.532397449016571}, {'label': 'neutral', 'score': 0.9187697768211365}]

AI新手村：Huggingface

HuggingFace

下图是 Hugging Face 的首页，主要常用的功能如图标识的模型和数据集的功能。

Datasets 页面

数据的加载

Hugging Face 的 Datasets 页面有丰富的数据集，包括文本、音频、图片，也提供了直观的可视化页面。

huggingface 首页

使用的数据集的方式也很简单，使用load_dataset直接加载我们需要的数据集即可，如果想使用我们自定义的数据集使用函数load_dataset也是可以的。

代码语言：Python复制

from datasets import load_dataset

ds = load_dataset("clapAI/MultiLingualSentiment")

from datasets import load_dataset
# 读入训练数据和测试数据
data_files = {"train": "./day014/datas/train_data.json", "test": "./day014/datas/test_data.json"}
dataset = load_dataset("json", data_files = data_files)
print(dataset)
# 查看第一条训练数据
print(dataset['train'][0])

模型的使用

代码语言：Python复制

from transformers import pipeline

# 使用默认模型
# pipe = pipeline("text-classification")     

# 指定特定的模型，模型可以通过 Models 页面查找（因为默认的模型使用英文数据做训练数据，我换了一个支持多语言的模型）
pipe = pipeline("text-classification", model="lxyuan/distilbert-base-multilingual-cased-sentiments-student")     

string_arr = [
         "从前从前有个人爱你很久，但偏偏风渐渐把距离吹得好远，好不容易又能再多爱一天，但故事的最后你好像还是说了拜拜。",
        "我一路向北，离开有你的季节，你说你好累，已无法再爱上谁。风在山路吹，过往的画面全都是不对，细数惭愧，我伤你几回。",
        "我很开心"]
   
results = pipe(string_arr)

print(results)
# 输出结果
# [{'label': 'positive', 'score': 0.5694631338119507}, {'label': 'negative', 'score': 0.9576570987701416}, {'label': 'positive', 'score': 0.9572104811668396}]

第一次运行上面程序的时候，模型会自动下载，默认路径是 /HOME/.cache/huggingface/hub。

代码语言：python代码运行次数：0运行复制

from utilsmon_config import config
import requests
def generate_embedding(text: str) -> list[float]:
    embedding_url = ";
    response = requests.post(
        embedding_url,
        headers={"Authorization": f"Bearer {config.hg_token}"},
        json={"inputs": text})

    if response.status_code != 200:
        raise ValueError(f"Request failed with status code {response.status_code}: {response.text}")

    return response.json()

string_arr = [
         "从前从前有个人爱你很久，但偏偏风渐渐把距离吹得好远，好不容易又能再多爱一天，但故事的最后你好像还是说了拜拜。",
        "我一路向北，离开有你的季节，你说你好累，已无法再爱上谁。风在山路吹，过往的画面全都是不对，细数惭愧，我伤你几回。",
        "我很开心"]
a = generate_embedding(string_arr)
print(a)

# 输出结果
# [[{'label': 'positive', 'score': 0.5694631934165955}, {'label': 'neutral', 'score': 0.2743554711341858}, {'label': 'negative', 'score': 0.15618135035037994}], [{'label': 'negative', 'score': 0.9576572179794312}, {'label': 'neutral', 'score': 0.0352189838886261}, {'label': 'positive', 'score': 0.007123854476958513}], [{'label': 'positive', 'score': 0.9572104811668396}, {'label': 'neutral', 'score': 0.03854822367429733}, {'label': 'negative', 'score': 0.004241317044943571}]]

模型的微调

如果觉得模型本身效果不好，我们还可以使用微调（Fine-Tuning）的方式，使用自定义的数据训练，调整模型参数。

代码语言：python代码运行次数：0运行复制

from datasets import load_dataset
# 读入训练数据和测试数据
import os
data_files = {
    "train": os.path.join(os.path.dirname(__file__), "datas/train_data.json"),
    "test": os.path.join(os.path.dirname(__file__), "datas/test_data.json")
}
dataset = load_dataset("json", data_files = data_files)
print(dataset)
# 查看第一条训练数据
print(dataset['train'][0])

from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = DistilBertTokenizer.from_pretrained('lxyuan/distilbert-base-multilingual-cased-sentiments-student')
model = (
    DistilBertForSequenceClassification.from_pretrained(
        'lxyuan/distilbert-base-multilingual-cased-sentiments-student',
        num_labels = 3,
        id2label = {0: "negative", 1: "neutral", 2: "positive"},
        label2id = {"negative": 0, "neutral": 1, "positive": 2},
        # ignore_mismatched_sizes=True
    ).to(device)
)
model_name = "sentiment_model"


from transformers import DataCollatorWithPadding
from sklearn.metrics import accuracy_score

def preprocess_function(example):
  return tokenizer(example['text'], truncation = True, padding = True)

train_dataset = dataset["train"].map(preprocess_function, batched = True)
test_dataset = dataset["test"].map(preprocess_function, batched = True)

data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

def compute_metrics(pred):
  labels = pred.label_ids
  predictions = pred.predictions.argmax(-1)
  accuracy = accuracy_score(labels, predictions)
  return {"accuracy": accuracy}

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
  output_dir = model_name,
  eval_strategy = "epoch",
  learning_rate = 2e-5,
  per_device_train_batch_size = 4,
  per_device_eval_batch_size = 4,
  num_train_epochs = 60,
  weight_decay = 0.01,
)

trainer = Trainer(
  model = model,
  args = training_args,
  train_dataset = train_dataset,
  eval_dataset = test_dataset,
  tokenizer = tokenizer,
  data_collator = data_collator,
  compute_metrics = compute_metrics,
)

trainer.train()

train_results = trainer.evaluate(eval_dataset = train_dataset)
train_accuracy = train_results.get('eval_accuracy')
print(f"Training Accuracy: {train_accuracy}")

test_results = trainer.evaluate(eval_dataset = test_dataset)
test_accuracy = test_results.get('eval_accuracy')
print(f"Testing Accuracy: {test_accuracy}")

训练完成后，我们可以使用新的模型，看一下效果，由于本地训练的数据比较少，新模型的最终效果不是很好。

代码语言：sql复制

from transformers import pipeline
classifier = pipeline(task = 'sentiment-analysis', model = "/Users/shaoyang/.cache/huggingface/hub/sentiment_model/checkpoint-120")
a = classifier(["从前从前有个人爱你很久，但偏偏风渐渐把距离吹得好远，好不容易又能再多爱一天，但故事的最后你好像还是说了拜拜。",
                              "我很开心"])

print(a)
# [{'label': 'negative', 'score': 0.532397449016571}, {'label': 'neutral', 'score': 0.9187697768211365}]

本文标签： AI新手村Huggingface

版权声明：本文标题：AI新手村：Huggingface 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://it.en369.cn/jiaocheng/1747737639a2211159.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

369IT编程

AI新手村：Huggingface

AI新手村：Huggingface

HuggingFace

数据的加载

模型的使用

模型的微调

AI新手村：Huggingface

HuggingFace

数据的加载

模型的使用

模型的微调

更多相关文章

AI新手村：Huggingface

发表评论

推荐文章

plugin: rewrite rules are lost when WP updates

证件照制作工具免费有哪些？这8个软件的能一键制作

不可或缺！支付宝小程序缓存管理实用技巧，节省流量提升体验

.NET周刊【3月第1期 2025

彻底清理Office：专业卸载指南

热门文章

javascript - &#39;Click&#39; function not working within typescriptangular after finding element using &#39;document

swift - ModelEntity.loadAsync is deprecated in iOS 18 - Stack Overflow

Arthas classloader （查看 classloader 的继承树，urls，类加载信息）

推荐2个.Net开源Html解析器，方便我们提取网页数据

自己开发一个ChatGPT插件并本地部署【超详细指南】

Web3 项目的性能优化

每日技能提升：Word分页符快捷键——手速比同事快10倍的秘密

ChatGPT上线全新功能Canvas

当可穿戴设备遇上增强现实——技术与未来交响曲

k8s部署grafana

最新文章

2025年信创环境下性能测试工具，如何模拟高并发场景？

戴尔笔记本恢复原装系统全攻略

6T算力NPU！基于RK3588国产平台的YOLOv5目标识别案例，真的强！

2025年适用于linux用户的高级web浏览器推荐

基于LLM的异构多机器人操作系统EMOS的深度解析

程序员刚毕业，先去大厂镀金还是先去小厂攒经验？

万象2008清空boss账户密码

【Tools】GitBook简明教程

oracle exadata celldisk 闪存盘受损导致性能下降

SDUT 2138 图结构练习——BFSDFS——判断可达性

javascript - Type &#39;undefined&#39; is not assignable to type &#39;menuItemProps[]&#39; - Stack Overflow

javascript - VS 2015 Angular 2 import modules cannot be resolved - Stack Overflow

javascript - Get the JSON objects that are not present in another array - Stack Overflow

javascript - How to dismiss a phonegap notification programmatically - Stack Overflow

c - Solaris 10 make Error code 1 Fatal Error when trying to build python 2.7.16 - Stack Overflow

javascript - 'Click' function not working within typescriptangular after finding element using 'document

javascript - Type 'undefined' is not assignable to type 'menuItemProps[]' - Stack Overflow