Sentiment Analysis

3.5.2.4. Sentiment Analysis#

Mục tiêu: Nhằm gán nhãn (label) “tích cực”, “tiêu cực” hoặc “trung tính” cho một đoạn văn, câu, hoặc thậm chí ở mức độ chi tiết hơn (từng ý hoặc từ).

Ứng dụng thực tế:

Phân tích phản hồi (review) sản phẩm, khách sạn, nhà hàng.
Khai thác ý kiến trên mạng xã hội (Twitter, Facebook, v.v.).
Xây dựng hệ thống hỗ trợ chăm sóc khách hàng tự động.
Theo dõi và quản trị danh tiếng thương hiệu.

Các phương pháp phổ biến trong Sentiment Analysis

1. Phương pháp dựa trên từ điển/ngữ nghĩa (Rule-based / Lexicon-based)

Từ điển cảm xúc (sentiment lexicon): Là bộ danh sách gồm các từ ngữ chứa hàm ý cảm xúc (từ tích cực như “tuyệt vời”, “đẹp”, “xuất sắc”, từ tiêu cực như “tệ”, “xấu”, “kinh khủng”,…). Mỗi từ thường gắn với một trọng số (score) phản ánh mức độ tích cực hoặc tiêu cực.
Luật và quy tắc (rule): Áp dụng các quy tắc (ví dụ: nếu một câu có nhiều từ ngữ tiêu cực hơn thì khả năng cao câu đó mang nghĩa tiêu cực).

Ưu điểm:

Dễ triển khai khi có sẵn từ điển cảm xúc.
Giải thích được kết quả (biết từ nào quyết định cảm xúc).

Nhược điểm:

Độ bao phủ (coverage) giới hạn nếu từ điển không đầy đủ.
Khó nắm bắt bối cảnh (context) phức tạp, ví dụ mệnh đề phủ định hay các câu mỉa mai, châm biếm (sarcasm).
Khó bảo trì và mở rộng từ điển theo ngôn ngữ, ngữ cảnh mới.

2. Phương pháp Deep Learning Transformer

Transformer-based Models (BERT, RoBERTa, GPT, XLM-R, PhoBERT với tiếng Việt, v.v.)
Transformer đã trở thành kiến trúc chuẩn cho nhiều bài toán NLP, bao gồm Sentiment Analysis.
Các mô hình Pre-trained như BERT có khả năng biểu diễn ngôn ngữ tự nhiên rất tốt nhờ quá trình huấn luyện trên dữ liệu cực lớn.
Khi fine-tuning cho bài toán phân tích cảm xúc, mô hình có thể đạt kết quả cao vượt trội.

Ưu điểm:

Khả năng nắm bắt ngữ cảnh và mối quan hệ giữa các từ tốt hơn so với phương pháp cổ điển.
Tính khái quát cao, mở rộng được cho các ngôn ngữ, miền dữ liệu khác nhau.

Nhược điểm:

Đòi hỏi dữ liệu huấn luyện lớn, tài nguyên phần cứng (GPU/TPU) mạnh.
Khó giải thích kết quả hơn so với phương pháp rule-based hay machine learning truyền thống.

Quy trình triển khai một hệ thống Sentiment Analysis

Thu thập dữ liệu: Lấy dữ liệu đánh giá, phản hồi, bình luận, v.v.
Tiền xử lý & Chuẩn hóa: Làm sạch, tách câu/từ, loại bỏ nhiễu, xử lý ký tự đặc biệt.
Chọn chiến lược tiếp cận:

Rule-based / Lexicon-based: Cần có bộ từ điển cảm xúc phong phú, các quy tắc xử lý ngôn ngữ phù hợp.
Machine Learning: Cần gán nhãn dữ liệu, trích xuất đặc trưng và huấn luyện.
Deep Learning: Áp dụng mô hình embedding, RNN/CNN/Transformer, fine-tuning.

Huấn luyện và đánh giá:

Chia dữ liệu thành các bộ train/validation/test.
Đánh giá mô hình qua các chỉ số: Accuracy, Precision, Recall, F1-score.

Triển khai và giám sát:

Đưa mô hình vào môi trường chạy thật.
Thu thập phản hồi, đánh giá và tiếp tục cải thiện.

data = [
    {
        "text": "Sản phẩm này quá tuyệt vời, rất đáng đồng tiền.",
        "sentiment": "positive",
    },
    {
        "text": "Thật sự thất vọng, chất lượng kém hơn mong đợi.",
        "sentiment": "negative",
    },
    {
        "text": "Chưa biết xài sao, mới mua thôi, chắc ổn.",
        "sentiment": "neutral",
    },
    {
        "text": "Quá chán, mất thời gian, không đáng tiền chút nào.",
        "sentiment": "negative",
    },
    {
        "text": "Khá tốt, giá hợp lý, nhưng giao hàng hơi chậm.",
        "sentiment": "neutral",
    },
    {
        "text": "Thiết kế đẹp, màu sắc đúng ý, sẽ mua lại nếu cần.",
        "sentiment": "positive",
    },
    {
        "text": "Tuy sản phẩm không đẹp,  nhưng hài lòng chất lượng.",
        "sentiment": "positive",
    },
]

label2id = {"negative": 0, "neutral": 1, "positive": 2}
id2label = {v: k for k, v in label2id.items()}

3.5.2.4.1. Pre-trained model#

from transformers import pipeline

model_path = "5CD-AI/Vietnamese-Sentiment-visobert"
sentiment_pipeline = pipeline(
    "sentiment-analysis", model=model_path, tokenizer=model_path
)

for item in data:
    result = sentiment_pipeline(item["text"])[0]
    item["predicted_sentiment"] = result["label"]
    item["score"] = result["score"]

import pandas as pd

pd.DataFrame(data)

	text	sentiment	predicted_sentiment	score
0	Sản phẩm này quá tuyệt vời, rất đáng đồng tiền.	positive	POS	0.998884
1	Thật sự thất vọng, chất lượng kém hơn mong đợi.	negative	NEG	0.998847
2	Chưa biết xài sao, mới mua thôi, chắc ổn.	neutral	POS	0.659783
3	Quá chán, mất thời gian, không đáng tiền chút ...	negative	NEG	0.999171
4	Khá tốt, giá hợp lý, nhưng giao hàng hơi chậm.	neutral	POS	0.726378
5	Thiết kế đẹp, màu sắc đúng ý, sẽ mua lại nếu cần.	positive	POS	0.993708
6	Tuy sản phẩm không đẹp, nhưng hài lòng chất l...	positive	NEU	0.678071

3.5.2.4.2. Custom model#

train test split

from sklearn.model_selection import train_test_split

# Chuyển data sang list
data_list = list(data)

# Bước 1: tách train (khoảng 60%) và tạm_thời (40%)
train_data, temp_data = train_test_split(
    data_list, test_size=0.4, random_state=42, shuffle=True
)

# Bước 2: từ temp_data tách ra validation (50%) và test (50%)
val_data, test_data = train_test_split(
    temp_data, test_size=0.5, random_state=42, shuffle=True
)

print("Train size:", len(train_data))
print("Val size:", len(val_data))
print("Test size:", len(test_data))

Train size: 4
Val size: 1
Test size: 2

Tạo Dataset cho Hugging Face

import torch
from torch.utils.data import Dataset


class SentimentDataset(Dataset):
    def __init__(self, data, tokenizer, label2id, max_len=64):
        """
        data: list of dict, mỗi dict có {"text": str, "sentiment": str}
        tokenizer: tokenizer từ Hugging Face (AutoTokenizer, v.v.)
        label2id: dict ánh xạ từ sentiment (str) -> id (int)
        max_len: chiều dài tối đa khi tokenize
        """
        self.data = data
        self.tokenizer = tokenizer
        self.label2id = label2id
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        text = sample["text"]
        sentiment = sample["sentiment"]
        label_id = self.label2id[sentiment]

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt",
        )

        # Mặc định, tokenizer trả về:
        #   encoding["input_ids"], encoding["attention_mask"], (có thể có token_type_ids)
        # Ta squeeze(0) để loại bỏ chiều batch (vì mỗi mẫu chỉ có 1)
        input_ids = encoding["input_ids"].squeeze(0)
        attention_mask = encoding["attention_mask"].squeeze(0)

        # Tạo dict kết quả
        item = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": torch.tensor(label_id, dtype=torch.long),
        }

        return item

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_path = "5CD-AI/Vietnamese-Sentiment-visobert"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForSequenceClassification.from_pretrained(
    model_path, num_labels=3, id2label=id2label, label2id=label2id
)

train_dataset = SentimentDataset(train_data, tokenizer, label2id, max_len=64)
val_dataset = SentimentDataset(val_data, tokenizer, label2id, max_len=64)
test_dataset = SentimentDataset(test_data, tokenizer, label2id, max_len=64)

Metrics for evaluation

import numpy as np


def evaluate_function(eval_pred):
    logits, labels = eval_pred  # logits, labels đều là numpy.ndarray
    # Dùng np.argmax thay vì torch.argmax
    predictions = np.argmax(logits, axis=-1)

    # Tính metric (VD: accuracy, f1...)
    # Ví dụ dùng scikit-learn hoặc huggingface metric
    # Ở đây minh hoạ load_metric("accuracy") và load_metric("f1")
    # (Bạn có thể thay bằng metric tuỳ ý)

    import evaluate

    accuracy_metric = evaluate.load("accuracy")
    f1_metric = evaluate.load("f1")

    acc = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(
        predictions=predictions, references=labels, average="weighted"
    )

    return {"accuracy": acc["accuracy"], "f1": f1["f1"]}

Training

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./output/visobert-finetuned",
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    logging_dir="./output/logs",
    logging_steps=1,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,  # dùng để validate
    processing_class=tokenizer,  # Trainer sẽ tự động batch & pad
    compute_metrics=evaluate_function,
)

trainer.train()

[3/6 00:00 < 00:02, 1.16 it/s, Epoch 1/3]

Epoch	Training Loss	Validation Loss

[1/1 : < :]

[6/6 00:34, Epoch 3/3]

Epoch	Training Loss	Validation Loss
1	0.001200	5.580416
2	0.000300	8.432305
3	0.000500	8.990547

TrainOutput(global_step=6, training_loss=0.0007009292385191657, metrics={'train_runtime': 35.6154, 'train_samples_per_second': 0.337, 'train_steps_per_second': 0.168, 'total_flos': 394670126592.0, 'train_loss': 0.0007009292385191657, 'epoch': 3.0})

Evaluation

test_metrics = trainer.evaluate(test_dataset)
print("Test metrics:", test_metrics)

Test metrics: {'eval_loss': 3.8895528316497803, 'eval_accuracy': 0.5, 'eval_f1': 0.5, 'eval_runtime': 2.8813, 'eval_samples_per_second': 0.694, 'eval_steps_per_second': 0.347, 'epoch': 3.0}

test_text = "Hàng giao nhanh, chất lượng vừa ý."
inputs = tokenizer(
    test_text, return_tensors="pt", truncation=True, padding=True
)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    pred_id = logits.argmax(dim=-1).item()

print("Text:", test_text)
print("Predicted sentiment:", id2label[pred_id])

Text: Hàng giao nhanh, chất lượng vừa ý.
Predicted sentiment: neutral

Sentiment Analysis

Contents

3.5.2.4. Sentiment Analysis#

3.5.2.4.1. Pre-trained model#

3.5.2.4.2. Custom model#