Large Language Models

3.6.2. Large Language Models#

3.6.2.1. Generate Text#

Cơ chế Auto-regression cho phép models có khả năng generate liên tục ra token mới thêm vào existing sequence hiện có của input, sau đó sequence được updated để tiếp tục sử dụng là input để cho model generate token tiếp theo. Model sẽ dừng lại cho đến khi đạt đến 1 điều kiện dừng cụ thể đạt độ dài tối đa hoặc gặp dấu kết thúc chuỗi <STOP>.

Nhiều ứng dụng chatbot hay “streaming” text (như ChatGPT) thường sinh token theo kiểu “token-by-token” hoặc từng “chunk”. Việc này được triển khai dựa trên việc gọi mô hình liên tục và truyền lại hidden states hoặc tiếp tục generate trên cùng một input_ids.

Trong các framework như Hugging Face Transformers, thường có hàm generate() cho phép sinh toàn bộ chuỗi. Để stream token, bạn có thể gọi mô hình lần lượt hoặc sử dụng một số phương pháp hỗ trợ “stream output”.

3.6.2.1.1. Decoding methods#

Sau khi mô hình tính được phân phối xác suất trên toàn bộ từ vựng (tất cả các token khả dĩ), ta cần một chiến lược (method) để chọn ra token tiếp theo

Tổng quan tóm tắt

Phương pháp	Cách thức	Ưu điểm	Nhược điểm	Khi nào dùng / Trường hợp sử dụng	Đa dạng	Mạch lạc	Chi phí tính toán
Greedy	Chọn từ có xác suất cao nhất ở mỗi bước, nhanh nhưng dễ lập lại và thiếu đa dạng.	- Nhanh, đơn giản - Kết quả lập lại (deterministic)	- Dễ lập - Văn bản có thể nhàm chán	Phù hợp khi chỉ cần “phác thảo” kết quả chắc chắn, ưu tiên tốc độ.	Thấp	Cao	Thấp
Beam Search	Duy trì nhiều beam, chọn beam tốt nhất cuối cùng (nhìn xác suất tổng thể của chuỗi n token liên tiếp).	- Bao quát nhiều hướng - Tối ưu hơn Greedy trong MT	- Tốn tài nguyên - Cũng có thể nhàm chán, lập với văn bản dài	Translation, Summarization, NLG độ chính xác cao.	Trung bình	Cao	Trung bình
Sampling (Random)	Bốc thăm theo phân phối xác suất.	- Đa dạng, “sáng tạo”	- Dễ sinh “ứng từng” nếu không kiểm soát	Khi muốn kết quả ngẫu nhiên, đa dạng, phù hợp cho viết sáng tạo, content creation.	Cao	Thấp	Thấp
Top-k	Chỉ lấy k token xác suất cao nhất để sampling.	- Kiểm soát tốt hơn random - Giảm rủi ro token quá thấp	- Chọn k không hợp lý vẫn dẫn đến lập hoặc kém đa dạng	Kết hợp với temperature để tăng/giảm độ “tự tin”, phù hợp cho tạo văn bản chung.	Trung bình	Trung bình	Thấp
Top-p (Nucleus)	Chọn nhóm token sao cho tổng xác suất đạt p.	- Tự điều chỉnh số token - Thường tự nhiên, mượt	- Phải chọn p hợp lý - Nếu p quá lớn -> mất kiểm soát	Thường được dùng thay thế top-k với kết quả tốt hơn, phù hợp cho tạo văn bản chung.	Trung bình	Trung bình	Thấp

Các tham số quan trọng ảnh hưởng đến lựa chọn token

Temperature (temp): Đã giải thích ở trên (mặc định thường là 1.0).
Top-k: Giá trị k (thường 0 – 50 – 100) để chọn top-k token. 0 có nghĩa là không áp dụng top-k.
Top-p (nucleus sampling): Giá trị p (thường 0.8 – 0.95) thể hiện xác suất lũy tích.
Max new tokens / max length: Độ dài tối đa cho chuỗi sinh.
Repetition penalty: Thường áp dụng một hệ số phạt nếu mô hình lặp lại token hoặc n-gram nào đó quá thường xuyên.
Beam size (trong beam search): Số beam chạy song song.
Do_sample: Bật/tắt chế độ sampling (nếu tắt, thường là greedy hoặc beam).

Việc chọn tham số phụ thuộc vào mục đích đầu ra:

Văn bản sáng tạo (truyện, tiểu thuyết, đối thoại phi công thức): Tăng temperature, bật top-k hoặc top-p, chấp nhận sự ngẫu nhiên.
Văn bản chính xác/cần tính nhất quán (tóm tắt, dịch, câu trả lời mang tính kỹ thuật): Tắt sampling (dùng greedy/beam) hoặc giảm temperature, giảm top-k/p.
Tránh lặp: Sử dụng repetition penalty ở mức vừa phải (1.1 – 1.3).

import warnings

# ignore warnings
warnings.filterwarnings("ignore", module=".*")

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Tải về model GPT-2 và tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# GPT-2 không có pad token mặc định, dùng eos_token làm pad_token để tránh thêm token mới
tokenizer.pad_token = tokenizer.eos_token  # Sử dụng EOS token làm PAD token
model = AutoModelForCausalLM.from_pretrained(model_name)

# Cập nhật embedding của model nếu thêm token mới (không cần thiết trong trường hợp này)
# model.resize_token_embeddings(len(tokenizer))

# Prompt ban đầu
prompt = "It's such a nice day today, I want to go out and "
input_ids = tokenizer.encode(prompt, return_tensors="pt")

3.6.2.1.1.1. Greedy search#

Cách hoạt động: Luôn chọn token có xác suất cao nhất tại mỗi bước.

Ưu điểm:

Đơn giản, tốc độ nhanh.
Thường tạo ra kết quả ổn định (deterministic).

Nhược điểm:

Dễ rơi vào lặp (loop) hoặc văn bản nhàm chán, không đa dạng.
Không tận dụng được khả năng “sáng tạo” của mô hình.

greedy_output = model.generate(
    input_ids,
    max_new_tokens=30,  # Generate tối đa 30 token mới
    do_sample=False,  # Tắt sampling, dùng greedy
    pad_token_id=tokenizer.eos_token_id,  # Thiết lập pad_token_id
)

greedy_text = tokenizer.decode(greedy_output[0], skip_special_tokens=True)
print("---- Greedy Search ----")
print(greedy_text)

---- Greedy Search ----
It's such a nice day today, I want to go out and  take a walk with my friends. I'm going to go out and get some coffee. I'm going to go out and get some food.

Do do_sample=False, mô hình luôn chọn token có xác suất cao nhất ở mỗi bước -> kết quả thường lặp hoặc cụt ý nếu gặp tình huống khó, ít mang tính “sáng tạo”.

3.6.2.1.1.2. Beam Search#

Cách hoạt động: Duy trì nhiều “luồng” (beam) giả định, mỗi beam là một chuỗi token được sinh ra. Từ mỗi beam, chọn các token xác suất cao nhất để mở rộng. Cuối cùng chọn ra beam tốt nhất (hoặc một vài beam tốt).

Ưu điểm:

Tối ưu hoá tốt hơn greedy nhờ khám phá được nhiều nhánh.
Hay được dùng trong bài toán dịch máy (machine translation).

Nhược điểm:

Tốn tài nguyên, tốc độ chậm hơn.
Với văn bản dài, beam search cũng có thể tạo ra văn bản lặp hoặc kém phong phú.

beam_output = model.generate(
    input_ids,
    max_new_tokens=30,  # Generate tối đa 30 token mới
    num_beams=5,  # Sử dụng beam search với 5 beams
    early_stopping=True,  # Dừng sớm khi tìm thấy câu hoàn chỉnh
    pad_token_id=tokenizer.eos_token_id,  # Thiết lập pad_token_id
)

beam_text = tokenizer.decode(beam_output[0], skip_special_tokens=True)
print("---- Beam Search ----")
print(beam_text)

---- Beam Search ----
It's such a nice day today, I want to go out and  have a good time with my friends and family.  It's such a nice day today, I want to go out and  have

Thêm điều kiện để tránh lặp lại

beam_output = model.generate(
    input_ids,
    max_new_tokens=30,  # Generate tối đa 30 token mới
    num_beams=5,  # Sử dụng beam search với 5 beams
    early_stopping=True,  # Dừng sớm khi tìm thấy câu hoàn chỉnh
    no_repeat_ngram_size=3,  # Không lặp lại n-gram kích thước 3
    pad_token_id=tokenizer.eos_token_id,  # Thiết lập pad_token_id
)

beam_text = tokenizer.decode(beam_output[0], skip_special_tokens=True)
print("---- Beam Search with no repeat ----")
print(beam_text)

---- Beam Search with no repeat ----
It's such a nice day today, I want to go out and  have a good time with my friends and family.  I'm so happy to be here and I'm looking forward to the day when I

3.6.2.1.1.3. Sampling#

Sampling (ngẫu nhiên, thường có kèm Temperature, Top-k, Top-p):

Random Sampling cơ bản
Chọn token ngẫu nhiên dựa trên phân phối xác suất mô hình trả về. Token có xác suất cao thì vẫn dễ được chọn, nhưng token xác suất thấp vẫn có thể xuất hiện (tuy nhỏ).
- Ưu: Đa dạng, có tính bất ngờ.
- Nhược: Có thể sinh ra đoạn văn không mạch lạc, lạc đề.

sampling_output = model.generate(
    input_ids,
    max_new_tokens=30,  # Generate tối đa 30 token mới
    do_sample=True,  # Bật sampling
    pad_token_id=tokenizer.eos_token_id,  # Thiết lập pad_token_id
)

sampling_text = tokenizer.decode(sampling_output[0], skip_special_tokens=True)
print("---- Sampling ----")
print(sampling_text)

---- Sampling ----
It's such a nice day today, I want to go out and  go to a coffee shop and to sing. I have a big problem with myself. It will make me feel so tired on the day before.

Temperature

Là một hệ số (hằng số) T dùng để điều chỉnh phân phối xác suất trước khi sampling:

\[p_i^\prime = \frac{p_i^{\,1/T}}{\sum_j p_j^{\,1/T}}\]
- Khi T < 1, mô hình “tự tin” hơn → chọn token xác suất cao nhiều hơn → văn bản “mạch lạc” nhưng ít đa dạng.
- Khi T > 1, xác suất các token “cân bằng” hơn → tăng tính ngẫu nhiên → văn bản “sáng tạo” hơn nhưng rủi ro “lạc đề”.

tem_output = model.generate(
    input_ids,
    max_new_tokens=30,  # Generate tối đa 30 token mới
    do_sample=True,  # Bật sampling
    temperature=0.2,  # Thiết lập nhiệt độ cho sampling
    pad_token_id=tokenizer.eos_token_id,  # Thiết lập pad_token_id
)

tem_text = tokenizer.decode(tem_output[0], skip_special_tokens=True)
print("---- Sampling with low temperature ----")
print(tem_text)

---- Sampling with low temperature ----
It's such a nice day today, I want to go out and  take a walk with my family. I'm going to go to the park and get some rest. I'm going to go to the park and

Top-k Sampling

Lấy k token có xác suất cao nhất (trong toàn bộ từ vựng), rồi chuẩn hoá lại thành một phân phối xác suất và chọn ngẫu nhiên trong số k token đó.

Ý nghĩa: Giới hạn mô hình chỉ chọn trong top-k token có khả năng cao nhất.
- Ưu: Hạn chế chọn phải token quá hiếm, giúp văn bản ít lặp, vẫn giữ sự đa dạng.
- Nhược: Nếu chọn k không hợp lý, văn bản có thể vẫn lặp hoặc không đủ đa dạng.
Top-k: Văn bản có sự đa dạng vừa phải, ít lặp.

topk_output = model.generate(
    input_ids,
    max_new_tokens=30,  # Generate tối đa 30 token mới
    do_sample=True,  # Bật sampling
    top_k=50,  # Giới hạn sampling trong top-k từ
    temperature=0.7,  # Thiết lập nhiệt độ cho sampling
    pad_token_id=tokenizer.eos_token_id,  # Thiết lập pad_token_id
)

topk_text = tokenizer.decode(topk_output[0], skip_special_tokens=True)
print("---- Sampling with top-k ----")
print(topk_text)

---- Sampling with top-k ----
It's such a nice day today, I want to go out and  have a nice day. So I'm going to show you some other pictures I took of my friends today, because my friends are all amazing.

Top-p (Nucleus Sampling)

Lấy nhóm nhỏ nhất của các token nhưng đủ để tổng xác suất ≥ p. Ví dụ, nếu p = 0.9, ta chọn các token có tổng xác suất cộng dồn từ cao đến khi đạt 0.9.

Ý nghĩa: Đảm bảo tổng xác suất các token tiềm năng đủ lớn, mà không cố định về số lượng token như top-k.
- Ưu: Giữ được tính “động”, có thể thay đổi số token được chọn tùy theo độ phân tán của phân phối xác suất.
- Nhược: Cần tìm giá trị p hợp lý (thường 0.9 ~ 0.95) để cân bằng giữa tính “đa dạng” và “mạch lạc”.
Top-p: Thường được ưa chuộng hơn Top-k vì cho văn bản tự nhiên, mượt hơn.

topp_output = model.generate(
    input_ids,
    max_new_tokens=30,  # Generate tối đa 30 token mới
    do_sample=True,  # Bật sampling
    top_p=0.90,  # Giới hạn sampling trong top-p từ
    temperature=0.5,  # Thiết lập nhiệt độ cho sampling
    pad_token_id=tokenizer.eos_token_id,  # Thiết lập pad_token_id
)

topp_text = tokenizer.decode(topp_output[0], skip_special_tokens=True)
print("---- Sampling with top-p ----")
print(topp_text)

---- Sampling with top-p ----
It's such a nice day today, I want to go out and  do something for the community. I want to go out and make some money and get a job. I want to get out and do something for

3.6.2.2. Overview of LLMs#

3.6.2.2.1. LLM Models#

1. OpenAI’s GPT Models

GPT-3.5:
- Known for generating high-quality text at a lower cost.
- Specs:
  - Can process up to 16,385 tokens.
  - Generates up to 4,096 tokens of text.
GPT-4:
- Processes both text and images.
- Variants like Turbo are optimized for efficiency.
- Specs: Handles larger inputs (up to 128,000 tokens).

2. Other LLMs to Know

Claude 3 (Anthropic): Focuses on safety and multilingual abilities.
Gemini (Google): Works with text, images, and video.
LLaMA 3 (Meta): Excels in language understanding and translation.
Falcon Models (TII): Lightweight models ideal for tasks like text generation and summarization.

3.6.2.3. Risks & Ethical#

3.6.2.3.1. Hallucinations#

Sometimes LLMs create false but believable information.
This is especially risky in fields like law or medicine.

3.6.2.3.2. Bias#

Reflects inequalities present in the data they’re trained on.
Solutions include better training data and frequent testing.

3.6.2.3.3. Data privacy & Security#

Sensitive data can accidentally be revealed, so strict protection measures are needed.

Thường các công ty muốn bảo mật thì sẽ ký với các bên tổ chức như Azure OpenAI thay vì sử dụng trực tiếp API từ OpenAI

3.6.2.3.4. Human Oversight#

Critical to ensure AI is used responsibly and decisions are well-informed.

3.6.2.4. Applications#

3.6.2.4.1. Healthcare and medical#

3.6.2.4.2. Finance#

Chatbot
Analyzing Financial Time-series Data
BloombergGPT: Sentiment Analysis, Named Entity Recognition, News Classification, Question Answering
FinGPT: Financial Sentiment Analysis, Stock Trading, Robo-Advisor

3.6.2.4.3. Copywriting#

3.6.2.5. Labs#

%%capture
!pip install openai==1.55.3 httpx==0.27.2 dotenv --force-reinstall --quiet

import openai
import os
from dotenv import load_dotenv

# Load OPENAI_API_KEY từ file .env
load_dotenv(r"contents/theory/aiml_algorithms/dl_nlp/llm/.env")

True

from openai import OpenAI

client = OpenAI()

Summarizing paragraphs

# prompt system is the system message that sets the behavior of the assistant
prompt_system = (
    "You are a helpful assistant whose goal is to help summarize document."
)

# and the user message is the input that you want to provide to the assistant.
prompt = """Continue the following paragraph. Summarize no more than 100 words in Vietnamese.

Hoạt động sản xuất toàn cầu suy yếu trong tháng 9 do nhu cầu giảm và bất ổn kinh tế, theo khảo sát của các tổ chức nghiên cứu quốc tế.
Chỉ số nhà quản trị mua hàng (PMI) tháng 9 của eurozone do S&P Global - hãng thông tin và phân tích của Mỹ - công bố đạt 45 điểm.
Kết quả này nhỉnh hơn ước tính sơ bộ 0,02 điểm, nhưng kém xa ngưỡng 50, đồng nghĩa sản xuất khu vực này vẫn suy giảm.
Hoạt động sản xuất tại khu vực đồng euro đã chậm lại với tốc độ nhanh nhất từ đầu năm nay, do nhu cầu thấp bất chấp các nhà máy phải hạ giá sản phẩm.
Đức - nền kinh tế lớn nhất châu Âu - ghi nhận tình trạng xấu đi rõ rệt trong 12 tháng. (Theo VnExpress)"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": prompt_system},
        {"role": "user", "content": prompt},
    ],
)

response

ChatCompletion(id='chatcmpl-BDqwAQsiJB8DdPPvCn2LgNaNEfaUV', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hoạt động sản xuất toàn cầu đang gặp khó khăn trong tháng 9 do nhu cầu suy giảm và bất ổn kinh tế. Chỉ số PMI của eurozone đạt 45 điểm, cho thấy sự suy giảm trong sản xuất khu vực này. Mặc dù kết quả tốt hơn một chút so với dự đoán, nhưng vẫn dưới mức 50, chỉ ra rằng sản xuất tiếp tục giảm. Tình hình sản xuất tại khu vực đồng euro đã chậm lại với tốc độ nhanh nhất kể từ đầu năm, đặc biệt tại Đức, nền kinh tế lớn nhất châu Âu.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None, annotations=[]))], created=1742641454, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier='default', system_fingerprint='fp_e4fa3702df', usage=CompletionUsage(completion_tokens=122, prompt_tokens=230, total_tokens=352, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

output = response.choices[0].message.content
print(len(output.split()))
print(output)

99
Hoạt động sản xuất toàn cầu đang gặp khó khăn trong tháng 9 do nhu cầu suy giảm và bất ổn kinh tế. Chỉ số PMI của eurozone đạt 45 điểm, cho thấy sự suy giảm trong sản xuất khu vực này. Mặc dù kết quả tốt hơn một chút so với dự đoán, nhưng vẫn dưới mức 50, chỉ ra rằng sản xuất tiếp tục giảm. Tình hình sản xuất tại khu vực đồng euro đã chậm lại với tốc độ nhanh nhất kể từ đầu năm, đặc biệt tại Đức, nền kinh tế lớn nhất châu Âu.

Sentiment Analysis

prompt_system = "You are a helpful assistant whose goal is to help user sentiment analysis."

prompt = """Continue the following paragraph. Sentiment the paragraph is positive, neutral, negative to economic.

Hoạt động sản xuất toàn cầu suy yếu trong tháng 9 do nhu cầu giảm và bất ổn kinh tế, theo khảo sát của các tổ chức nghiên cứu quốc tế.
Chỉ số nhà quản trị mua hàng (PMI) tháng 9 của eurozone do S&P Global - hãng thông tin và phân tích của Mỹ - công bố đạt 45 điểm.
Kết quả này nhỉnh hơn ước tính sơ bộ 0,02 điểm, nhưng kém xa ngưỡng 50, đồng nghĩa sản xuất khu vực này vẫn suy giảm.
Hoạt động sản xuất tại khu vực đồng euro đã chậm lại với tốc độ nhanh nhất từ đầu năm nay, do nhu cầu thấp bất chấp các nhà máy phải hạ giá sản phẩm.
Đức - nền kinh tế lớn nhất châu Âu - ghi nhận tình trạng xấu đi rõ rệt trong 12 tháng. (Theo VnExpress)

Answer: (only return 'negative' or 'neutral', 'positive')
"""


response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": prompt_system},
        {"role": "user", "content": prompt},
    ],
)

response
output = response.choices[0].message.content
print(output)

negative

Story Generation

prompt_system = (
    "You are a helpful assistant whose goal is to help write stories."
)

prompt = """Continue the following story. Write no more than 50 words.

A Fox one day spied a beautiful bunch of ripe grapes hanging from a vine trained along the branches of a tree.
The grapes seemed ready to burst with juice, and the Fox's mouth watered as he gazed longingly at them."""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": prompt_system},
        {"role": "user", "content": prompt},
    ],
)
response.choices[0].message.content

'Determined to taste their sweetness, the Fox leaped and jumped, but the grapes remained tantalizingly out of reach. Finally, panting and defeated, he huffed, "Those grapes are probably sour anyway," and sauntered away, pretending he never wanted them in the first place.'

prompt = """Continue the following story. Write no more than 50 words.

Vào thời Hùng Vương, có một người đàn bà đã nhiều tuổi nhưng sống một mình và không có con.
Một sáng nọ bà đi thăm nương, bỗng nhìn thấy một vế chân giẫm nát cả mấy luống cà. Bà kêu lên:"""

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": prompt_system},
        {"role": "user", "content": prompt},
    ],
)
response.choices[0].message.content

'"Who dares to ruin my crops?" Bà cất tiếng gắt gỏng, nhưng không ai đáp lại. Chợt, từ đằng xa, một cậu bé lấm lem bùn xuất hiện, khuôn mặt đầy lo lắng. "Xin lỗi bà, cháu chỉ tìm kiếm thức ăn cho mẹ!" Bà ngỡ ngàng, lòng trào dâng cảm xúc.'

In-Context Learning

Bổ sung cuộc conversation mẫu trước để có thê hiêu được style trả lời

prompt_system = (
    "You are a helpful assistant whose goal is to write short poems."
)

prompt = """Write a short poem about {topic}."""

examples = {
    "nature": """
      Birdsong fills the air,
      Mountains high and valleys deep,
      Nature's music sweet.""",
    "winter": """
      Snow blankets the ground,
      Silence is the only sound,
      Winter's beauty found.
    """,
}

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        # set the system role to define the assistant's behavior
        {"role": "system", "content": prompt_system},
        # Examples 1 of the assistant's behavior
        {"role": "user", "content": prompt.format(topic="nature")},
        {"role": "assistant", "content": examples["nature"]},
        # Examples 2 of the assistant's behavior
        {"role": "user", "content": prompt.format(topic="winter")},
        {"role": "assistant", "content": examples["winter"]},
        # User's message
        {"role": "user", "content": prompt.format(topic="summer")},
    ],
)

print(response.choices[0].message.content)

      Sunlight dances bright,  
      Laughter in the warm, still air,  
      Days stretch into night.  

type(response)

openai.types.chat.chat_completion.ChatCompletion