# Retrieval-Augmented Generation (RAG)

## RAG Foundation


### RAG Overview

<p align="center">
<img src = "https://media.datacamp.com/legacy/v1704459771/image_552d84ab56.png" width="700" >
</p>

**Retrieval-Augmented Generation (RAG)** is a model that combines the strengths of retrieval-based and generation-based models. It uses a retriever to find relevant passages from a large corpus of text and a generator to generate the final output. The retriever helps the generator by providing relevant context, which can improve the quality of the generated text.
- Boosts the accuracy of the generated text by providing relevant context from a large corpus of text.
- Reduce false information
- Source Citing: Ability to cite sources in reponses, increase user trust
- Integration: Supports frequently updated & domain-specific knowledge integration

**Data Source**: Private documents, PDFs, Codebase, SQL Database

**Cons**:
- Increased latency due to more text processing
- Protential accuracy decline if information is scattered
- Inefficient resource usage with large datasets

**Factor to consider**:
- Applications requirements
- Acceptable latency
- Desired accuracy
- Available computational resources

---

**Terms**:

<p align="center">
<img src = "https://github.com/KalyanKS-NLP/rag-zero-to-hero-guide/raw/main/RAG%20Basics/images/RAG_Must_Know_Terms.gif">
</p>

---

**RAG Applications**:
- **H·ªó tr·ª£ kh√°ch h√†ng qua Chatbot**: S·ª≠ d·ª•ng RAG ƒë·ªÉ truy xu·∫•t d·ªØ li·ªáu n·ªôi b·ªô (FAQ, t√†i li·ªáu s·∫£n ph·∫©m...) v√† t·∫°o c√¢u tr·∫£ l·ªùi t·ª± nhi√™n, r√∫t ng·∫Øn th·ªùi gian ph·∫£n h·ªìi v√† gi·∫£i quy·∫øt c√¢u h·ªèi ph·ª©c t·∫°p.

- **Ph√¢n t√≠ch t√†i li·ªáu ph√°p l√Ω**: √Åp d·ª•ng RAG ƒë·ªÉ t√¨m v√† t√≥m t·∫Øt c√°c ƒëi·ªÅu kho·∫£n, ti·ªÅn l·ªá quan tr·ªçng trong vƒÉn b·∫£n ph√°p l√Ω, gi√∫p lu·∫≠t s∆∞ nghi√™n c·ª©u nhanh h∆°n v√† ch√≠nh x√°c h∆°n.

- **H·ªó tr·ª£ nghi√™n c·ª©u khoa h·ªçc**: RAG gi√∫p truy xu·∫•t, t·ªïng h·ª£p th√¥ng tin t·ª´ b√†i b√°o, d·ªØ li·ªáu ho·∫∑c th√≠ nghi·ªám khoa h·ªçc, h·ªó tr·ª£ r√† so√°t t√†i li·ªáu, ki·ªÉm ch·ª©ng v√† kh√°m ph√° ch·ªß ƒë·ªÅ ph·ª©c t·∫°p.

- **H·ªó tr·ª£ quy·∫øt ƒë·ªãnh y t·∫ø**: ·ª®ng d·ª•ng RAG ƒë·ªÉ truy xu·∫•t h·ªì s∆° b·ªánh nh√¢n, t√†i li·ªáu y khoa, h∆∞·ªõng d·∫´n ƒëi·ªÅu tr·ªã, cung c·∫•p khuy·∫øn ngh·ªã ch√≠nh x√°c, c·∫≠p nh·∫≠t v√† b·∫£o m·∫≠t th√¥ng tin.

- **C√° nh√¢n h√≥a gi√°o d·ª•c**: T·∫≠n d·ª•ng RAG ƒë·ªÉ t√¨m n·ªôi dung h·ªçc t·∫≠p ph√π h·ª£p v·ªõi tr√¨nh ƒë·ªô, ƒë∆∞a ra gi·∫£i th√≠ch v·ª´a s·ª©c ng∆∞·ªùi h·ªçc, gi√∫p l·∫•p ‚Äúl·ªó h·ªïng‚Äù ki·∫øn th·ª©c hi·ªáu qu·∫£.

- **T√¨m ki·∫øm t√†i li·ªáu k·ªπ thu·∫≠t**: RAG h·ªó tr·ª£ r√† so√°t, tr√≠ch xu·∫•t gi·∫£i ph√°p t·ª´ t√†i li·ªáu chuy√™n m√¥n, m√£ ngu·ªìn, h∆∞·ªõng d·∫´n, gi√∫p nh√† ph√°t tri·ªÉn v√† k·ªπ s∆∞ gi·∫£i quy·∫øt v·∫•n ƒë·ªÅ nhanh ch√≥ng v√† chi ti·∫øt.

---

**T·∫°i sao c·∫ßn RAG?**

LLMs are trained on vast amounts of text from books, Wikipedia, websites, and code from GitHub repositories. However, their training data is limited to information available up to a specific date. This means their knowledge is cut off at that date.

N·∫øu kh√¥ng c√≥ RAG:
- LLMs cannot answer queries about events or facts that occurred after their training cutoff.
- They may generate incorrect or hallucinated responses, making them unreliable for up-to-date information.

N·∫øu c√≥ RAG:
- Retrieves relevant content from an external knowledge source (e.g., databases, APIs, or private documents).
- Provides the retrieved relevant content as context to the LLM along with query, enabling it to generate factually accurate answers.
- Ensures the response is grounded in retrieved information, reducing hallucination.

Thus, RAG enhances LLMs by keeping them updated without requiring frequent retraining.

---

**RAG flow**

<p align="center">
<img src = "https://www.dailydoseofds.com/content/images/size/w1000/2024/11/image-1.png">
</p>

**1. Indexing**

Indexing in RAG involves processing raw documents by first extracting their content (parsing) and then splitting them into smaller, meaningful chunks. These chunks are then converted into vector embeddings using an embedding model and stored in a vector database, enabling efficient retrieval during query-time.

- **Parse:** Extract raw text from documents (PDFs, web pages, etc.).
- **Chunk:** Split text into smaller, meaningful segments for retrieval.
- **Encode:** Convert chunks into dense vector embeddings using an embedding model.
- **Store:** Save embeddings in a vector database for fast similarity search.

![](_images/how_rag_work_1.gif)


**2. Retrieval**

The user asks a **query** which is converted into a dense vector (embedding) using the same embedding model used in indexing step. This vector representation is then used in **semantic search** to find the most relevant chunks of information from a vector database.

- **Query:** The user inputs a question or prompt.
- **Encode:** The query is converted into a dense vector representation using an embedding model.
- **Semantic Search:** The encoded query is compared against the embeddings in the vector database to find the most relevant embeddings.
- **Relevant Chunks:** The retrieved chunks of text are returned as context for generating a response.

![](_images/How_RAG_works-2.gif)

**3. Augmentation**

In this step, retrieved relevant chunks are combined to form a context. Then the query is merged with this context to construct a prompt for the LLM.

- **Combine:** Relevant chunks are combined to form the context.
- **Augment:** The query is merged with the context to create a prompt for the LLM.

![](_images/How_RAG_works-3.gif)

**4. Generation**

In this step, the prompt is fed to the LLM. The LLM processes the prompt and then generates a response based on both the query and the context.

- **Feed:** The prompt having query and context along with instructions is passed on to the LLM.
- **Generate:** The LLM processes the prompt and generates a response based on both the query and the provided context.

![](_images/How_RAG_works-4.gif)

### Vector Database

**Vector database** stores unstructured data (text, images, audio, video, etc.) in the form of vector embeddings.

![](_images/vector_db_meaning.png)

- Each data point (whether a **word**, a **document**, an **image**, or any other entity) is transformed into a **numerical vector (embeddings)** using ML techniques (which we shall see ahead). The model is trained by **embeddings** in such a way that these vectors capture the **essential features and characteristics** of the underlying data.

- Once stored in a **vector database**, we can retrieve original objects that are similar to the query we wish to run on our unstructured data. Encoding unstructured data allows us to run many sophisticated operations like **similar search**, **classification**,...

- Vector databases allow LLMs to look up new information they were not trained on, which is crucial for real-world applications and use it in text generation. (without training the model again)

<p align="center">
<img src="https://www.dailydoseofds.com/content/images/size/w1000/2024/02/image-248.png" width="500">
</p>

- maintain the available information in a **vector database** by encoding it into vectors using an embedding model, that can be queried by the retriever to find relevant passages for the generator. Then the LLMs needs to access the information, it can query the vector database using an approximate similarity search with the prompt vector to find content that is similar to the input query vector.

<p align="center">
<img src="https://www.dailydoseofds.com/content/images/size/w1000/2024/02/image-253.png" width="500">
</p>

- With **RAG**, the language model can use the retrieved information (which is expected to be reliable) from the vector database to ensure that its responses are grounded in **real-world knowledge and context**, reducing the likelihood of **hallucinations**.
    - makes the model's responses more accurate, reliable, and contextually relevant
    - ensuring that we don't have to train the LLM repeatedly on new data


**Vector Database for Embedding systems**

![](https://assets.datacamp.com/production/repositories/6385/datasets/2eac066affd938de32dc6b55437560d15ea886ab/vdb_landscape.png)

- **Pinecone**: Cloud-based semantic search.
- **Chroma**: Lightweight local database.
- **Faiss**: Open-source for nearest neighbor search.

### RAG Roadmap

Here is the RAG Beginner‚Äôs roadmap. This roadmap provides a structured learning path to mastering RAG from basics to deployment. üöÄ

![](_images/RAG_Roadmap.gif)


**1. Python Programming Language** : Python l√† ng√¥n ng·ªØ ch√≠nh cho ph√°t tri·ªÉn RAG nh·ªù h·ªá sinh th√°i AI phong ph√∫. Python cung c·∫•p c√°c th∆∞ vi·ªán nh∆∞ LangChain, LlamaIndex v√† sentence-transformers ƒë·ªÉ tri·ªÉn khai m·ªôt c√°ch tr∆°n tru.

**2. Generative AI Basics** : Hi·ªÉu c√°ch ho·∫°t ƒë·ªông c·ªßa c√°c m√¥ h√¨nh AI t·∫°o sinh (generative AI), bao g·ªìm t·∫°o vƒÉn b·∫£n, t·∫°o h√¨nh ·∫£nh v√† AI ƒëa ph∆∞∆°ng ti·ªán, l√† r·∫•t quan tr·ªçng ƒë·ªÉ x√¢y d·ª±ng ·ª©ng d·ª•ng RAG.

**3. LLM Basics** : C√°c Large Language Models (LLMs) ƒë∆∞·ª£c hu·∫•n luy·ªán tr√™n l∆∞·ª£ng d·ªØ li·ªáu kh·ªïng l·ªì ƒë·ªÉ t·∫°o ra vƒÉn b·∫£n gi·ªëng con ng∆∞·ªùi. H·ªá th·ªëng RAG s·ª≠ d·ª•ng LLM ƒë·ªÉ di·ªÖn gi·∫£i y√™u c·∫ßu c·ªßa ng∆∞·ªùi d√πng v√† t·∫°o ph·∫£n h·ªìi d·ª±a tr√™n ng·ªØ c·∫£nh ƒë√£ truy xu·∫•t.

**4. Prompt Techniques** : Prompting l√† qu√° tr√¨nh cung c·∫•p ƒë·∫ßu v√†o cho LLMs ƒë·ªÉ h∆∞·ªõng d·∫´n ph·∫£n h·ªìi. C√°c k·ªπ thu·∫≠t nh∆∞ few-shot prompting, zero-shot prompting v√† chain-of-thought prompting gi√∫p c·∫£i thi·ªán ch·∫•t l∆∞·ª£ng vƒÉn b·∫£n ƒë∆∞·ª£c t·∫°o.

**5. LLM Frameworks (LangChain ho·∫∑c LlamaIndex)** : C√°c framework n√†y cung c·∫•p nh·ªØng ch·ª©c nƒÉng t√≠ch h·ª£p s·∫µn ƒë·ªÉ ph√°t tri·ªÉn ·ª©ng d·ª•ng RAG.

**6. Chunking** : Chunking l√† vi·ªác chia nh·ªè t√†i li·ªáu th√†nh c√°c ƒëo·∫°n ng·∫Øn ƒë·ªÉ nh·ªØng ƒëo·∫°n li√™n quan c√≥ th·ªÉ ƒë∆∞·ª£c ƒë∆∞a cho LLM x·ª≠ l√Ω. Chi·∫øn l∆∞·ª£c chunking c√≥ th·ªÉ l√† chunking c·ªë ƒë·ªãnh, chunking ƒë·ªá quy, chunking d·∫°ng agentic, chunking theo ng·ªØ nghƒ©a, v.v.

**7. Data Extraction** : Tr√≠ch xu·∫•t d·ªØ li·ªáu c√≥ c·∫•u tr√∫c t·ª´ c√°c t√†i li·ªáu kh√¥ng c·∫•u tr√∫c (PDF, HTML, vƒÉn b·∫£n, v.v.) l√† b∆∞·ªõc quan tr·ªçng ƒë·ªÉ x√¢y d·ª±ng kho ki·∫øn th·ª©c cho RAG.

**8. Embeddings** : Embeddings chuy·ªÉn vƒÉn b·∫£n th√†nh c√°c vector s·ªë chi·ªÅu cao, n·∫Øm b·∫Øt √Ω nghƒ©a ng·ªØ nghƒ©a. Ch√∫ng ƒë∆∞·ª£c d√πng cho t√¨m ki·∫øm t∆∞∆°ng ƒë·ªìng, truy xu·∫•t v√† gom c·ª•m t√†i li·ªáu trong h·ªá th·ªëng RAG.

**9. Vector Databases** : C√°c c∆° s·ªü d·ªØ li·ªáu vector nh∆∞ FAISS, ChromaDB, v√† Weaviate l∆∞u tr·ªØ v√† truy xu·∫•t embeddings m·ªôt c√°ch hi·ªáu qu·∫£. Ch√∫ng cho ph√©p t√¨m ki·∫øm ng·ªØ nghƒ©a nhanh ch√≥ng ƒë·ªÉ x√°c ƒë·ªãnh nh·ªØng ƒëo·∫°n li√™n quan cho LLM.

**10. RAG Basics** : Retrieval-Augmented Generation (RAG) tƒÉng c∆∞·ªùng LLMs b·∫±ng c√°ch truy xu·∫•t ki·∫øn th·ª©c ph√π h·ª£p tr∆∞·ªõc khi t·∫°o n·ªôi dung. ƒêi·ªÅu n√†y c·∫£i thi·ªán ƒë·ªô ch√≠nh x√°c, gi·∫£m sai l·ªách (hallucinations) v√† cho ph√©p c·∫≠p nh·∫≠t theo th·ªùi gian th·ª±c.

**11. Implement RAG from Scratch** : X√¢y d·ª±ng h·ªá th·ªëng RAG t·ª´ ƒë·∫ßu li√™n quan ƒë·∫øn thi·∫øt k·∫ø quy tr√¨nh truy xu·∫•t, chunking, l·∫≠p ch·ªâ m·ª•c, l∆∞u tr·ªØ embeddings v√† c∆° ch·∫ø truy v·∫•n m√† kh√¥ng ph·ª• thu·ªôc v√†o c√°c framework d·ª±ng s·∫µn.

**12. Implement RAG with LangChain ho·∫∑c LlamaIndex** : C√°c framework n√†y ƒë∆°n gi·∫£n h√≥a vi·ªác tri·ªÉn khai RAG b·∫±ng c√°ch cung c·∫•p c√¥ng c·ª• t√≠ch h·ª£p s·∫µn cho vi·ªác t·∫£i t√†i li·ªáu, t·∫°o embeddings, truy xu·∫•t v√† t√≠ch h·ª£p LLM.

**13. Agent Basics** : Agent s·ª≠ d·ª•ng kh·∫£ nƒÉng suy lu·∫≠n, b·ªô nh·ªõ v√† c√°c c√¥ng c·ª• ƒë·ªÉ t∆∞∆°ng t√°c v·ªõi h·ªá th·ªëng b√™n ngo√†i v√† t·ª± ƒë·ªông h√≥a quy tr√¨nh ph·ª©c t·∫°p. Agent ƒë∆∞·ª£c h·ªó tr·ª£ b·ªüi LLM c√≥ th·ªÉ t·ª± ƒë·ªông truy xu·∫•t v√† x·ª≠ l√Ω d·ªØ li·ªáu ƒë·ªông.

**14. Agentic RAG** : Agentic RAG k·∫øt h·ª£p ki·∫øn th·ª©c truy xu·∫•t v·ªõi kh·∫£ nƒÉng agent t·ª± ƒë·ªông. N√≥ cho ph√©p LLM th·ª±c hi·ªán c√°c truy v·∫•n l·∫∑p ƒëi l·∫∑p l·∫°i, tinh ch·ªânh c√¢u tr·∫£ l·ªùi v√† th·ª±c hi·ªán h√†nh ƒë·ªông d·ª±a tr√™n th√¥ng tin ƒë√£ truy xu·∫•t.

**15. Advanced RAG Techniques** : C√°c k·ªπ thu·∫≠t n√¢ng cao bao g·ªìm truy xu·∫•t lai (semantic + keyword search), vi·∫øt l·∫°i truy v·∫•n, re-ranking, v.v.

**16. Build RAG Apps** : X√¢y d·ª±ng ·ª©ng d·ª•ng RAG trong th·ª±c t·∫ø bao g·ªìm t√≠ch h·ª£p giao di·ªán ng∆∞·ªùi d√πng (UI), logic backend v√† c∆° s·ªü d·ªØ li·ªáu. S·ª≠ d·ª•ng Streamlit, FastAPI ho·∫∑c Flask ƒë·ªÉ t·∫°o c√°c h·ªá th·ªëng RAG t∆∞∆°ng t√°c.

**17. RAG Evaluation & Monitoring** : ƒê√°nh gi√° m√¥ h√¨nh RAG c·∫ßn c√°c ch·ªâ s·ªë nh∆∞ ƒë·ªô ch√≠nh x√°c truy xu·∫•t, t·ª∑ l·ªá sai l·ªách v√† m·ª©c ƒë·ªô li√™n quan c·ªßa ph·∫£n h·ªìi. C√°c c√¥ng c·ª• gi√°m s√°t nh∆∞ LangSmith gi√∫p ph√¢n t√≠ch hi·ªáu su·∫•t c·ªßa h·ªá th·ªëng.

**18. Deploy RAG Apps** : Tri·ªÉn khai ·ª©ng d·ª•ng RAG ƒë√≤i h·ªèi l∆∞u tr·ªØ m√¥ h√¨nh, c∆° s·ªü d·ªØ li·ªáu vector v√† pipeline truy xu·∫•t tr√™n c√°c n·ªÅn t·∫£ng ƒë√°m m√¢y nh∆∞ AWS, Azure ho·∫∑c Google Cloud ƒë·ªÉ ƒë·∫£m b·∫£o kh·∫£ nƒÉng m·ªü r·ªông.

### RAG Variants

![](_images/rag_architecture_level.jpg)

1Ô∏è‚É£ **Naive RAG: The Classic Approach**: Naive RAG is the standard implementation with a relatively straightforward process:
- Query comes in from the user
- System retrieves relevant documents from a vector database
- Retrieved documents are combined with the query as context
- LLM generates a response based on both query and context
This works well for many simple applications, like basic Q&A systems or document assistants.

2Ô∏è‚É£ **Retrieve and Rerank RAG**: This one adds a reranking step after the retrieval to improve response quality:
- Initial retrieval returns a larger set of potentially relevant documents
- A reranking model evaluates and scores these documents based on relevance
- Only the highest-scoring documents are sent to the LLM

3Ô∏è‚É£ **Multimodal RAG**: The architecture leverages models that can process and retrieve from text, images, audio, video, and other data types.

4Ô∏è‚É£ **Graph RAG**: Graph RAG uses a graph database to incorporate relationship information between documents:
- Documents/chunks are nodes in a graph
- Relationships between documents are edges
- Can follow relationship paths to find contextually relevant information

5Ô∏è‚É£ **Hybrid RAG: Vector DB with Graph DB**: This architecture combines both vector search and a graph database:
- Vector search identifies semantically similar content
- Graph database provides structured relationship data
- Queries can leverage both similarity and explicit relationships
- Results can include information discovered through relationship traversal

6Ô∏è‚É£ **Agentic RAG with Router Agent**: A single agent makes decisions about retrieval:
- Analyzes the query to determine the best knowledge sources
- Makes strategic decisions about how to retrieve information
- Coordinates the retrieval process based on query understanding

7Ô∏è‚É£ **Multi-Agent RAG**: This one employs multiple specialized agents:
- Master agent coordinates the overall process
- Specialized retrieval agents focus on different tasks
- Agents can communicate and collaborate to solve complex problems
For example, one agent might retrieve from various sources, another might do data transformation, and a third personalizing the results from the user‚Äîall coordinated by a master agent that assembles the final response.

## RAG Basics

### RAG Workflow

<img src="https://www.dailydoseofds.com/content/images/size/w1000/2024/11/image-2.png" width="500">

- **Retrieval**: Accessing and retrieving information from a knowledge source, such as a database or memory.
- **Augmented**: Enhancing or enriching something, in this case, the text generation process, with additional information or context.
- **Generation**: The process of creating or producing something, in this context, generating text or language.

**Step 1: Data Collection and Indexing**

<img src="https://www.dailydoseofds.com/content/images/2024/11/image-5.png">

- Documents
- Financial Statements
- Product metadata
- FAQ list

Note: Depending on each user group, they will have access to specific types of documents.

**Indexing** in RAG involves processing raw documents by first extracting their content (parsing) and then splitting them into smaller, meaningful chunks. These chunks are then converted into vector embeddings using an embedding model and stored in a vector database, enabling efficient retrieval during query-time.

![](_images/How_RAG_works_1.gif)

- **Parse:** Extract raw text from documents (PDFs, web pages, etc.).
- **Chunk:** Split text into smaller, meaningful segments for retrieval.
- **Encode:** Convert chunks into dense vector embeddings using an embedding model.
- **Store:** Save embeddings in a vector database for fast similarity search.

**Step 2: Data Chunking**

<img src="https://www.dailydoseofds.com/content/images/size/w1000/2024/11/image-6.png">

- **Data Chunking**: Breaking down the data into smaller, more manageable pieces or chunks before embedding and storing it in the vector database.
- If we don't chunk, the entire document will have **a single embedding**, which won't be of any practical use to retrieve relevant context. In fact, we use only a part of the document to retrieve relevant context.
- Benifits:
    - **Efficient retrieval**: Retrieving only the relevant chunks of data instead of the entire document, focus on specific information.
    - **Scalability**: Handling large volumes of data more efficiently.
    - **Improved accuracy**: Ensuring that the retrieved information is contextually relevant to the query.

**Chunking Strategy**: [5 strategies](https://blog.dailydoseofds.com/p/5-chunking-strategies-for-rag?ref=dailydoseofds.com)

![](_images/chunking-rag.gif)

- **Fixed-size chunking (Chia theo ƒë·ªô d√†i c·ªë ƒë·ªãnh)**
    - T√†i li·ªáu ƒë∆∞·ª£c chia th√†nh c√°c ƒëo·∫°n (chunk) c√≥ k√≠ch th∆∞·ªõc ƒë·ªÅu nhau, v√≠ d·ª• theo s·ªë t·ª´ hay k√Ω t·ª± c·ªë ƒë·ªãnh.
    - ∆Øu ƒëi·ªÉm: ƒê∆°n gi·∫£n, d·ªÖ tri·ªÉn khai.
    - Nh∆∞·ª£c ƒëi·ªÉm: C√≥ th·ªÉ c·∫Øt n·ªôi dung quan tr·ªçng th√†nh nhi·ªÅu ph·∫ßn r·ªùi r·∫°c.

- **Semantic chunking (Chia theo ng·ªØ nghƒ©a)**
    - D·ª±a v√†o m·ª©c ƒë·ªô t∆∞∆°ng ƒë·ªìng ng·ªØ nghƒ©a gi·ªØa c√°c ƒëo·∫°n. Khi ƒë·ªô t∆∞∆°ng ƒë·ªìng gi·∫£m m·∫°nh, b·∫Øt ƒë·∫ßu m·ªôt ƒëo·∫°n m·ªõi.
    - ∆Øu ƒëi·ªÉm: B·∫£o to√†n ƒë∆∞·ª£c m·∫°ch n·ªôi dung logic.
    - Nh∆∞·ª£c ƒëi·ªÉm: C·∫ßn t√≠nh to√°n ƒë·ªô t∆∞∆°ng ƒë·ªìng (nh∆∞ cosine similarity), ph·ª©c t·∫°p h∆°n.

- **Recursive chunking (Chia ƒë·ªá quy)**
    - T√°ch t√†i li·ªáu l·ªõn th√†nh c√°c ƒëo·∫°n d·ª±a tr√™n ti√™u ch√≠ (nh∆∞ ch·ªß ƒë·ªÅ, ƒë·ªô d√†i), n·∫øu ƒëo·∫°n v·∫´n qu√° d√†i th√¨ ti·∫øp t·ª•c t√°ch ƒë·ªá quy.
    - ∆Øu ƒëi·ªÉm: Ki·ªÉm so√°t k√≠ch th∆∞·ªõc chunk h·ª£p l√Ω h∆°n, tr√°nh b·ªã qu√° d√†i.
    - Nh∆∞·ª£c ƒëi·ªÉm: Qu√° tr√¨nh c·∫Øt ph·ª©c t·∫°p, c·∫ßn l·∫∑p l·∫°i nhi·ªÅu b∆∞·ªõc.

- **Document structure-based chunking (Chia theo c·∫•u tr√∫c t√†i li·ªáu)**
    - Chia t√†i li·ªáu d·ª±a tr√™n c·∫•u tr√∫c s·∫µn c√≥, v√≠ d·ª• ti√™u ƒë·ªÅ, m·ª•c, ch∆∞∆°ng, ph·∫ßn, v.v‚Ä¶
    - ∆Øu ƒëi·ªÉm: T·∫≠n d·ª•ng logic ph√¢n chia c·ªßa t√†i li·ªáu g·ªëc.
    - Nh∆∞·ª£c ƒëi·ªÉm: Ph·ª• thu·ªôc v√†o t√≠nh nh·∫•t qu√°n v√† m·ª©c ƒë·ªô chi ti·∫øt c·ªßa c·∫•u tr√∫c t√†i li·ªáu.

- **LLM-based chunking (Chia d·ª±a tr√™n m√¥ h√¨nh ng√¥n ng·ªØ)**
    - D·ªØ li·ªáu ƒë∆∞·ª£c chuy·ªÉn v√†o m√¥ h√¨nh LLM; m√¥ h√¨nh s·∫Ω t·ª± ƒë·ªÅ xu·∫•t c√°ch chia sao cho h·ª£p l√Ω v·ªõi ng·ªØ c·∫£nh.
    - ∆Øu ƒëi·ªÉm: Khai th√°c s·ª©c m·∫°nh hi·ªÉu ng√¥n ng·ªØ c·ªßa LLM.
    - Nh∆∞·ª£c ƒëi·ªÉm: C√≥ th·ªÉ t·ªën t√†i nguy√™n, ph·ª• thu·ªôc v√†o hi·ªáu su·∫•t v√† ch·∫•t l∆∞·ª£ng c·ªßa m√¥ h√¨nh.

**Prefer method**

<img src="https://blog.premai.io/content/images/2024/10/document-based-.png" width="500">

- **Document-based chunking**: Chia theo c·∫•u tr√∫c t√†i li·ªáu thanh c√°c ƒëo·∫°n nh·ªè, m·ªói ƒëo·∫°n ch·ª©a m·ªôt ph·∫ßn n·ªôi dung li√™n quan. V√≠ d·ª•: chia theo ti√™u ƒë·ªÅ, m·ª•c l·ª•c, ch∆∞∆°ng, ph·∫ßn, v.v‚Ä¶
- **Summary by LLM**: T√≥m t·∫Øt n·ªôi dung c·ªßa m·ªói ƒëo·∫°n b·∫±ng m√¥ h√¨nh LLM, r√∫t tr√≠ch c√°c th√¥ng tin quan tr·ªçng.
- **Embedding**: Chuy·ªÉn c√°c ƒëo·∫°n t√≥m t·∫Øt th√†nh vector embeddings, l∆∞u tr·ªØ v√†o c∆° s·ªü d·ªØ li·ªáu vector phuc v·ª• cho vi·ªác search truy xu·∫•t th√¥ng tin from user query. Raw data l∆∞u tr·ªØ phuc v·ª• cho vi·ªác get information t·∫°o c√¢u tr·∫£ l·ªùi final.

**Step 3: Generate embeddings**

<img src="https://www.dailydoseofds.com/content/images/2024/11/image-9.png" width="500">

- **Document Embedding**: Transforming the chunks of data into numerical vectors using an embedding model.
> Since these are **context embedding models** (not **word embedding models**), models like [**bi-encoders**](https://www.dailydoseofds.com/bi-encoders-and-cross-encoders-for-sentence-pair-similarity-scoring-part-1/ are highly relevant here.
- **Function**: Matches user queries with relevant chunks of data in the vector database.
- **Outcome**: Ensures that the retrieved information is contextually relevant to the query.

**Step 4: Store Embedding**

These embeddings are then stored in the vector database:

<img src="https://www.dailydoseofds.com/content/images/size/w1000/2024/11/image-10.png" width="500">

- Vector database hoat ƒë·ªông nh∆∞ m·ªôt bo nh·ªõ l∆∞u tr·ªØ of RAG system, ch·ª©a c√°c vector embeddings c·ªßa c√°c ƒëo·∫°n th√¥ng tin ƒë√£ ƒë∆∞·ª£c x·ª≠ l√Ω, b·∫±ng c√°ch s·ª≠ d·ª•ng ch√∫ng, user query s·∫Ω ƒë∆∞·ª£c truy xu·∫•t th√¥ng tin t·ª´ c∆° s·ªü d·ªØ li·ªáu vector.
- Vector database l∆∞u tr·ªØ c·∫£ **th√¥ng tin g·ªëc (raw data, c√≥ th·ªÉ g·ªìm c·∫£ summary sau khi ch·∫°y qua LLM)** v√† **metadata**, **indexing**, v√† **vector embeddings**.

**Step 5: Handle User Queries**

**Process**:
1. Embed the user query using the same embedding model used for the data chunks.

<img src="https://www.dailydoseofds.com/content/images/size/w1000/2024/11/image-12.png" width="500">

2. Retrieve the relevant chunks of data from the vector database using the query embedding and **similar measures**

<img src="https://www.dailydoseofds.com/content/images/size/w1000/2024/11/image-14.png" width="500">

- The vectorized query is then compared against our existing vectors in the database to find the most similar information.
- The vector database returns **the k** (a pre-defined parameter) **most similar documents/chunks** (using approximate nearest neighbor search).

3. Re-ranking the retrieved chunks based on their relevance to the query.

<img src="https://www.dailydoseofds.com/content/images/size/w1000/2024/11/image-15.png" width="500">

- The selected chunks might need further refinement to ensure the most relevant information is **prioritized**. Model evaluates the initial list of retrieved chunks alongside the query to assign a **relevance score to each chunk**.

**Step 6: Generating Response with LLMs**

<img src="https://www.dailydoseofds.com/content/images/size/w1000/2024/11/image-16.png" width="500">

- This model combines **the user's original query with the retrieved chunks in a prompt template** to generate a response that synthesizes information from the selected documents.This model combines the user's original query with the retrieved chunks in a prompt template to generate a response that synthesizes information from the selected documents.
- Feed the retrieved chunks and the user query to the LLM to generate the final response.

### Data Processing

#### Data Ingestion

**Data Ingestion**: The process of collecting and importing data from various sources into a system for further processing and analysis. This can include data from databases, APIs, files, or other sources. The goal is to make the data available for analysis, storage, or other purposes.
- `TextLoader`: Handle text files, CSV, JSON, and other formats.
- `URLLoader`: Scrape data from websites.
- `PDFLoader`: Load data from PDF files.
- `DocxLoader`: Load data from Word documents.
- `UnstructuredFileLoader`: Load data from various unstructured file formats.
- `Google Drive Loader`: Load data (Docs or Entity folders) from Google Drive.

- **PDF Loader**

In [1]:
!pip install -q pypdf

In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(
    r"contents\theory\aiml_algorithms\dl_nlp\llm\data\The One Page Linux Manual.pdf"
)
pages = loader.load_and_split()


In [None]:
pages


[Document(metadata={'producer': 'Acrobat PDFWriter 3.0 for Windows', 'creator': 'Microsoft Word', 'creationdate': 'D:00000101000000Z', 'title': 'The One Page Linux Manual', 'author': 'downloaded from The Quick Reference Site (http://www.digilife.be/quickreferences)', 'keywords': 'Linux Unix Redhat Caldera', 'subject': '(c) 2003 The Quick Reference Site (Tim Sinaeve)', 'moddate': '2003-02-16T13:58:05+01:00', 'source': 'contents\\theory\\aiml_algorithms\\dl_nlp\\llm\\data\\The One Page Linux Manual.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='THE ONE    PAGE LINUX MANUALA summary of useful Linux commands\nVersion 3.0 May 1999 squadron@powerup.com.au\nStarting & Stopping\nshutdown -h now Shutdown the system now and do not\nreboot\nhalt Stop all processes - same as above\nshutdown -r 5 Shutdown the system in 5 minutes and\nreboot\nshutdown -r now Shutdown the system now and reboot\nreboot Stop all processes and then reboot - same\nas above\nstartx Start the X system

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=700, chunk_overlap=100)
texts = text_splitter.split_documents(pages)
print("Preview:")
print(texts[0].page_content)


Preview:
THE ONE    PAGE LINUX MANUALA summary of useful Linux commands
Version 3.0 May 1999 squadron@powerup.com.au
Starting & Stopping
shutdown -h now Shutdown the system now and do not
reboot
halt Stop all processes - same as above
shutdown -r 5 Shutdown the system in 5 minutes and
reboot
shutdown -r now Shutdown the system now and reboot
reboot Stop all processes and then reboot - same
as above
startx Start the X system
Accessing & mounting file systems
mount -t iso9660 /dev/cdrom
/mnt/cdrom
Mount the device cdrom
and call it cdrom under the
/mnt directory
mount -t msdos /dev/hdd
/mnt/ddrive
Mount hard disk ‚Äúd‚Äù as a
msdos file system and call
it ddrive under the /mnt
directory
mount -t vfat /dev/hda1
/mnt/cdrive
Mount hard disk ‚Äúa‚Äù as a
VFAT file system and call it
cdrive under the /mnt
directory
umount /mnt/cdrom Unmount the cdrom
Finding files and text within files
find / -name  fname Starting with the root directory, look
for the file called fname
find / -name ‚Äù*fname*

#### Text Splitters - Chunking

**Text Splitters**: The process of breaking down large text documents into smaller, more manageable chunks or segments. This is important for various applications, such as natural language processing, machine learning, and information retrieval. By splitting text into smaller pieces, it becomes easier to analyze, process, and retrieve relevant information.

<img src="https://www.dailydoseofds.com/content/images/size/w1000/2024/11/image-6.png">

- Ph∆∞∆°ng ph√°p Chunking trong RAG ph·ª• thu·ªôc v√†o lo·∫°i vƒÉn b·∫£n:
    - **Fixed-size chunking** (chia theo k√≠ch th∆∞·ªõc c·ªë ƒë·ªãnh) cho **t√†i li·ªáu l·ªõn** nh∆∞ PDF, Word, v.v...
    - **Recursive chunking** (chia ƒë·ªá quy) ph√π h·ª£p cho **vƒÉn b·∫£n th√¥ng th∆∞·ªùng**
    - **Document structure-based chunking** (theo c·∫•u tr√∫c t√†i li·ªáu) d·ª±a tr√™n ti√™u ƒë·ªÅ cho t√†i li·ªáu c√≥ **c·∫•u tr√∫c nh∆∞ Markdown, HTML, v.v...**
    - **Semantic chunking** (theo ng·ªØ nghƒ©a) cho vƒÉn b·∫£n **c·∫ßn t√≠nh li√™n k·∫øt √Ω nghƒ©a**.
    - **LLM-based chunking** (d·ª±a tr√™n m√¥ h√¨nh ng√¥n ng·ªØ)

- M·ª©c Overlap th∆∞·ªùng **100-1000 token** ho·∫∑c **10-20%** ƒë·ªÉ t·ªëi ∆∞u h√≥a hi·ªáu su·∫•t RAG.

**Process**:
- **Breaking** text into smaller, semantically meaningful units, such as sentences, paragraphs, or sections. (C·∫ßn ƒë·∫£m b·∫£o r·∫±ng c√°c ƒëo·∫°n vƒÉn b·∫£n n√†y v·∫´n gi·ªØ ƒë∆∞·ª£c ng·ªØ nghƒ©a v√† ng·ªØ c·∫£nh c·ªßa t√†i li·ªáu g·ªëc.)
- **Aggregating** these units into significant segments utils they reach a certain size (L·ª±a ch·ªçn size cho ph√π h·ª£p)
- **Isolating** the most relevant segments as distinct pieces once the target size is reached.
- **Repeating** the process until some segment overlap to maintain contextual continuity (C√°c ƒëo·∫°n chunking **n√™n overlap nhau m·ªôt ch√∫t c·ªßa ƒëo·∫°n tr∆∞·ªõc v√† sau ƒë√≥**)

In [None]:
document = """
M·ªü c·ª≠a phi√™n giao d·ªãch s√°ng nay, l√∫c 8h15', nh√† v√†ng Mi H·ªìng ƒë√£ ƒëi·ªÅu ch·ªânh gi√° v√†ng nh·∫´n l√™n m·ª©c 96,2 - 97,8 tri·ªáu ƒë·ªìng/l∆∞·ª£ng, tƒÉng 200.000 ƒë·ªìng m·ªói l∆∞·ª£ng so v·ªõi ƒë·∫ßu gi·ªù s√°ng qua.

L√∫c 13h25', gi√° v√†ng nh·∫´n t·∫°i B·∫£o T√≠n Minh Ch√¢u ƒë√£ tƒÉng l√™n m·ª©c 97,4 - 99,5 tri·ªáu ƒë·ªìng. So v·ªõi ƒë·∫ßu gi·ªù s√°ng, gi√° v√†ng nh·∫´n tƒÉng th√™m 600.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu mua v√†o v√† 400.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu b√°n ra.

Gi√° v√†ng mi·∫øng c≈©ng tƒÉng th√™m 500.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu mua v√†o v√† 300.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu b√°n ra, hi·ªán gi·ªØ ·ªü m·ª©c 97,3 - 98,7 tri·ªáu ƒë·ªìng/l∆∞·ª£ng.

C√¥ng ty SJC c≈©ng n√¢ng gi√° v√†ng nh·∫´n l√™n m·ª©c 97 - 98,5 tri·ªáu ƒë·ªìng/l∆∞·ª£ng, tƒÉng 400.000 ƒë·ªìng/l∆∞·ª£ng ·ªü c·∫£ hai chi·ªÅu mua v√†o v√† b√°n ra.

C√πng m·ª©c tƒÉng, gi√° v√†ng nh·∫´n t·∫°i DOJI c≈©ng n√¢ng l√™n m·ª©c m·ª©c 97,1 - 98,4 tri·ªáu ƒë·ªìng/l∆∞·ª£ng. Gi√° v√†ng mi·∫øng t·∫°i C√¥ng ty SJC v√† DOJI ƒë·ªìng lo·∫°t ni√™m y·∫øt ·ªü m·ª©c 97,2 - 98,7 tri·ªáu ƒë·ªìng/l∆∞·ª£ng.
"""


##### Fixed-size Chunking

- **Token Text Splitter**: T√°ch t√†i li·ªáu th√†nh c√°c ƒëo·∫°n nh·ªè h∆°n d·ª±a tr√™n s·ªë l∆∞·ª£ng token (t·ª´) trong t√†i li·ªáu. (T∆∞∆°ng t·ª± nh∆∞ t√°ch theo k√≠ch th∆∞·ªõc k√Ω t·ª±, nh∆∞ng s·ª≠ d·ª•ng s·ªë l∆∞·ª£ng token thay v√¨ k√Ω t·ª±). Convert text th√†nh BPE token, chia th√†nh nh·ªØng smaller chunks, sau ƒë√≥ recontruct l·∫°i th√†nh vƒÉn b·∫£n.



In [29]:
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=100,  # K√≠ch th∆∞·ªõc t·ªëi ƒëa c·ªßa m·ªói ƒëo·∫°n
    chunk_overlap=20,  # S·ªë k√Ω t·ª± ch·ªìng l·∫•p gi·ªØa c√°c ƒëo·∫°n
)
chunks = text_splitter.split_text(document)
chunks


["\nM·ªü c·ª≠a phi√™n giao d·ªãch s√°ng nay, l√∫c 8h15', nh√† v√†ng Mi H·ªìng ƒë√£ ƒëi·ªÅu ch·ªânh gi√° v√†ng nh·∫´n l√™n m·ª©c 96,2 - 97,8 tri·ªáu ƒë·ªìng",
 "ÔøΩc 96,2 - 97,8 tri·ªáu ƒë·ªìng/l∆∞·ª£ng, tƒÉng 200.000 ƒë·ªìng m·ªói l∆∞·ª£ng so v·ªõi ƒë·∫ßu gi·ªù s√°ng qua.\n\nL√∫c 13h25', gi√° v√†ng nh·∫´n tÔøΩ",
 "√∫c 13h25', gi√° v√†ng nh·∫´n t·∫°i B·∫£o T√≠n Minh Ch√¢u ƒë√£ tƒÉng l√™n m·ª©c 97,4 - 99,5 tri·ªáu ƒë·ªìng. So v·ªõi ƒë·∫ßu gi·ªù s√°ng, gi√° v√†ng nhÔøΩ",
 'u gi·ªù s√°ng, gi√° v√†ng nh·∫´n tƒÉng th√™m 600.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu mua v√†o v√† 400.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu b√°n ra.\n\nGi√° v√†ng miÔøΩ',
 'ÔøΩ chi·ªÅu b√°n ra.\n\nGi√° v√†ng mi·∫øng c≈©ng tƒÉng th√™m 500.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu mua v√†o v√† 300.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu b√°n ra, hi·ªán g',
 'ÔøΩng ·ªü chi·ªÅu b√°n ra, hi·ªán gi·ªØ ·ªü m·ª©c 97,3 - 98,7 tri·ªáu ƒë·ªìng/l∆∞·ª£ng.\n\nC√¥ng ty SJC c≈©ng n√¢ng gi√° v√†ng nh·∫´n l√™n m·ª©c 97 - 98,5 tri',
 ' nh·∫´n l√™n m·ª©c 97 - 98,


##### Recursive Chunking

- **Recursive Character Text Splitter**: S·ª≠ d·ª•ng split b·∫±ng m·ªôt s·ªë k√Ω t·ª± nh·∫•t ƒë·ªãnh (v√≠ d·ª•: d·∫•u ch·∫•m, d·∫•u ph·∫©y, d·∫•u c√°ch) ƒë·ªÉ chia t√†i li·ªáu th√†nh c√°c ƒëo·∫°n nh·ªè h∆°n. Sau ƒë√≥, n√≥ s·∫Ω ki·ªÉm tra xem c√°c ƒëo·∫°n n√†y c√≥ qu√° d√†i hay kh√¥ng v√† n·∫øu c√≥, n√≥ s·∫Ω ti·∫øp t·ª•c chia ch√∫ng th√†nh c√°c ƒëo·∫°n nh·ªè h∆°n cho ƒë·∫øn khi ƒë·∫°t ƒë∆∞·ª£c k√≠ch th∆∞·ªõc mong mu·ªën. (b·∫Øt ƒë·∫ßu v·ªõi m·ªôt k√≠ch th∆∞·ªõc l·ªõn v√† gi·∫£m d·∫ßn k√≠ch th∆∞·ªõc cho ƒë·∫øn khi ƒë·∫°t ƒë∆∞·ª£c k√≠ch th∆∞·ªõc mong mu·ªën)



In [23]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
)
chunks = text_splitter.split_text(document)
chunks


["M·ªü c·ª≠a phi√™n giao d·ªãch s√°ng nay, l√∫c 8h15', nh√† v√†ng Mi H·ªìng ƒë√£ ƒëi·ªÅu ch·ªânh gi√° v√†ng nh·∫´n l√™n m·ª©c 96,2 - 97,8 tri·ªáu ƒë·ªìng/l∆∞·ª£ng, tƒÉng 200.000 ƒë·ªìng m·ªói l∆∞·ª£ng so v·ªõi ƒë·∫ßu gi·ªù s√°ng qua.",
 "L√∫c 13h25', gi√° v√†ng nh·∫´n t·∫°i B·∫£o T√≠n Minh Ch√¢u ƒë√£ tƒÉng l√™n m·ª©c 97,4 - 99,5 tri·ªáu ƒë·ªìng. So v·ªõi ƒë·∫ßu gi·ªù s√°ng, gi√° v√†ng nh·∫´n tƒÉng th√™m 600.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu mua v√†o v√† 400.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu b√°n ra.",
 'Gi√° v√†ng mi·∫øng c≈©ng tƒÉng th√™m 500.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu mua v√†o v√† 300.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu b√°n ra, hi·ªán gi·ªØ ·ªü m·ª©c 97,3 - 98,7 tri·ªáu ƒë·ªìng/l∆∞·ª£ng.\n\nC√¥ng ty SJC c≈©ng n√¢ng gi√° v√†ng nh·∫´n l√™n m·ª©c 97 - 98,5 tri·ªáu ƒë·ªìng/l∆∞·ª£ng, tƒÉng 400.000 ƒë·ªìng/l∆∞·ª£ng ·ªü c·∫£ hai chi·ªÅu mua v√†o v√† b√°n ra.',
 'C√πng m·ª©c tƒÉng, gi√° v√†ng nh·∫´n t·∫°i DOJI c≈©ng n√¢ng l√™n m·ª©c m·ª©c 97,1 - 98,4 tri·ªáu ƒë·ªìng/l∆∞·ª£ng. Gi√° v√†ng mi

- **S·ª≠ d·ª•ng spacy ƒë·ªÉ t√°ch theo c√¢u**

In [None]:
from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(
    chunk_size=300,  # K√≠ch th∆∞·ªõc t·ªëi ƒëa c·ªßa m·ªói ƒëo·∫°n
    chunk_overlap=50,  # S·ªë k√Ω t·ª± ch·ªìng l·∫•p gi·ªØa c√°c ƒëo·∫°n
)
texts = text_splitter.split_text(document)
texts


["M·ªü c·ª≠a phi√™n giao d·ªãch s√°ng nay, l√∫c 8h15', nh√† v√†ng Mi H·ªìng ƒë√£ ƒëi·ªÅu ch·ªânh gi√° v√†ng nh·∫´n l√™n m·ª©c 96,2 - 97,8 tri·ªáu ƒë·ªìng/l∆∞·ª£ng, tƒÉng 200.000 ƒë·ªìng m·ªói l∆∞·ª£ng so v·ªõi ƒë·∫ßu gi·ªù s√°ng qua.\n\n\n\nL√∫c 13h25', gi√° v√†ng nh·∫´n t·∫°i B·∫£o T√≠n Minh Ch√¢u ƒë√£ tƒÉng l√™n m·ª©c 97,4 - 99,5 tri·ªáu ƒë·ªìng.",
 'So v·ªõi ƒë·∫ßu gi·ªù s√°ng, gi√° v√†ng nh·∫´n tƒÉng th√™m 600.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu mua v√†o v√† 400.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu b√°n ra.\n\n\n\nGi√° v√†ng mi·∫øng c≈©ng tƒÉng th√™m 500.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu mua v√†o v√† 300.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu b√°n ra, hi·ªán gi·ªØ ·ªü m·ª©c 97,3 - 98,7 tri·ªáu ƒë·ªìng/l∆∞·ª£ng.',
 'C√¥ng ty SJC c≈©ng n√¢ng gi√° v√†ng nh·∫´n l√™n m·ª©c 97 - 98,5 tri·ªáu ƒë·ªìng/l∆∞·ª£ng, tƒÉng 400.000 ƒë·ªìng/l∆∞·ª£ng ·ªü c·∫£ hai chi·ªÅu mua v√†o v√† b√°n ra.\n\n\n\nC√πng m·ª©c tƒÉng, gi√° v√†ng nh·∫´n t·∫°i DOJI c≈©ng n√¢ng l√™n m·ª©c m·ª©c 97,1 - 98,4 tri·ªáu ƒë·ªìng/l∆∞·ª£ng.


##### Structured Chunking

- **Markdown splitter**: T√°ch t√†i li·ªáu theo c√°c ti√™u ƒë·ªÅ trong t√†i li·ªáu Markdown, ph√¢n t√°ch b·∫±ng **headers**, **code blocks** or **dividers**. (t∆∞∆°ng t·ª± nh∆∞ t√°ch theo c·∫•u tr√∫c t√†i li·ªáu)


In [None]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_text = "# Ti√™u ƒë·ªÅ 1\n\nM·ªôt s·ªë vƒÉn b·∫£n\n\n## Ti√™u ƒë·ªÅ 2\n\nTh√™m vƒÉn b·∫£n"
headers_to_split_on = [
    ("#", "Ti√™u ƒë·ªÅ 1"),
    ("##", "Ti√™u ƒë·ªÅ 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_chunks = markdown_splitter.split_text(markdown_text)
md_chunks


[Document(metadata={'Ti√™u ƒë·ªÅ 1': 'Ti√™u ƒë·ªÅ 1'}, page_content='M·ªôt s·ªë vƒÉn b·∫£n'),
 Document(metadata={'Ti√™u ƒë·ªÅ 1': 'Ti√™u ƒë·ªÅ 1', 'Ti√™u ƒë·ªÅ 2': 'Ti√™u ƒë·ªÅ 2'}, page_content='Th√™m vƒÉn b·∫£n')]

- **V√≠ d·ª• cho Source Code:**

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

python_code = "def ham1():\n    pass\n\ndef ham2():\n    pass"
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=30, chunk_overlap=0
)
code_chunks = python_splitter.split_text(python_code)
code_chunks


['def ham1():\n    pass', 'def ham2():\n    pass']

##### Semantic Chunking

- **Chia nh·ªè ng·ªØ nghƒ©a**

In [32]:
!pip install --quiet langchain_experimental langchain_openai

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from dotenv import find_dotenv, load_dotenv
from langchain.embeddings import HuggingFaceEmbeddings

# load_dotenv(r"contents\theory\aiml_algorithms\dl_nlp\llm\.env")


# Use any open-source embedding model from Hugging Face
# 'sentence-transformers/all-MiniLM-L6-v2' is a lightweight, popular choice
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

semantic_splitter = SemanticChunker(
    embeddings,
    buffer_size=1,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=70,
)
chunks = semantic_splitter.split_text(document)
chunks


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

["\nM·ªü c·ª≠a phi√™n giao d·ªãch s√°ng nay, l√∫c 8h15', nh√† v√†ng Mi H·ªìng ƒë√£ ƒëi·ªÅu ch·ªânh gi√° v√†ng nh·∫´n l√™n m·ª©c 96,2 - 97,8 tri·ªáu ƒë·ªìng/l∆∞·ª£ng, tƒÉng 200.000 ƒë·ªìng m·ªói l∆∞·ª£ng so v·ªõi ƒë·∫ßu gi·ªù s√°ng qua. L√∫c 13h25', gi√° v√†ng nh·∫´n t·∫°i B·∫£o T√≠n Minh Ch√¢u ƒë√£ tƒÉng l√™n m·ª©c 97,4 - 99,5 tri·ªáu ƒë·ªìng. So v·ªõi ƒë·∫ßu gi·ªù s√°ng, gi√° v√†ng nh·∫´n tƒÉng th√™m 600.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu mua v√†o v√† 400.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu b√°n ra. Gi√° v√†ng mi·∫øng c≈©ng tƒÉng th√™m 500.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu mua v√†o v√† 300.000 ƒë·ªìng/l∆∞·ª£ng ·ªü chi·ªÅu b√°n ra, hi·ªán gi·ªØ ·ªü m·ª©c 97,3 - 98,7 tri·ªáu ƒë·ªìng/l∆∞·ª£ng. C√¥ng ty SJC c≈©ng n√¢ng gi√° v√†ng nh·∫´n l√™n m·ª©c 97 - 98,5 tri·ªáu ƒë·ªìng/l∆∞·ª£ng, tƒÉng 400.000 ƒë·ªìng/l∆∞·ª£ng ·ªü c·∫£ hai chi·ªÅu mua v√†o v√† b√°n ra.",
 'C√πng m·ª©c tƒÉng, gi√° v√†ng nh·∫´n t·∫°i DOJI c≈©ng n√¢ng l√™n m·ª©c m·ª©c 97,1 - 98,4 tri·ªáu ƒë·ªìng/l∆∞·ª£ng. Gi√° v√†ng mi·∫øng t·∫

##### LLM-based Chunking

**LLM-based chunking**: S·ª≠ d·ª•ng m√¥ h√¨nh ng√¥n ng·ªØ ƒë·ªÉ t·ª± ƒë·ªông chia t√†i li·ªáu th√†nh c√°c ƒëo·∫°n nh·ªè h∆°n d·ª±a tr√™n ng·ªØ nghƒ©a v√† ng·ªØ c·∫£nh c·ªßa vƒÉn b·∫£n. (s·ª≠ d·ª•ng m√¥ h√¨nh LLM ƒë·ªÉ t·ª± ƒë·ªông x√°c ƒë·ªãnh c√°ch chia t√†i li·ªáu th√†nh c√°c ƒëo·∫°n nh·ªè h∆°n d·ª±a tr√™n ng·ªØ nghƒ©a v√† ng·ªØ c·∫£nh c·ªßa vƒÉn b·∫£n)

**1. LLMChunkizer**

Ph∆∞∆°ng ph√°p n√†y chia t√†i li·ªáu th√†nh c√°c kh·ªëi d·ª±a tr√™n s·ªë l∆∞·ª£ng token, sau ƒë√≥ s·ª≠ d·ª•ng LLM (nh∆∞ GPT-4o) ƒë·ªÉ chia nh·ªè h∆°n th√†nh c√°c ƒëo·∫°n ch·ª©a √Ω t∆∞·ªüng ho√†n ch·ªânh, ƒë·ªìng th·ªùi x·ª≠ l√Ω ch·ªìng ch√©o ƒë·ªÉ gi·ªØ ng·ªØ c·∫£nh. T√†i li·ªáu ƒë∆∞·ª£c chia th√†nh c√°c kh·ªëi d·ª±a tr√™n s·ªë l∆∞·ª£ng token (v√≠ d·ª•: 200 token m·ªói kh·ªëi), sau ƒë√≥ LLM ƒë∆∞·ª£c s·ª≠ d·ª•ng ƒë·ªÉ chia m·ªói kh·ªëi th√†nh c√°c ƒëo·∫°n nh·ªè h∆°n, m·ªói ƒëo·∫°n ch·ª©a m·ªôt √Ω t∆∞·ªüng ho√†n ch·ªânh.

- **X·ª≠ l√Ω ch·ªìng ch√©o**: ƒê·ªÉ duy tr√¨ ng·ªØ c·∫£nh, hai ƒëo·∫°n cu·ªëi t·ª´ kh·ªëi tr∆∞·ªõc ƒë∆∞·ª£c th√™m v√†o ƒë·∫ßu kh·ªëi ti·∫øp theo tr∆∞·ªõc khi ph√¢n t√≠ch, ƒë·∫£m b·∫£o c√°c √Ω t∆∞·ªüng li√™n quan ƒë∆∞·ª£c gi·ªØ c√πng nhau.
- **∆Øu ƒëi·ªÉm**: Duy tr√¨ ng·ªØ c·∫£nh v·ªõi ch·ªìng ch√©o, d·ªÖ ƒëi·ªÅu ch·ªânh k√≠ch th∆∞·ªõc kh·ªëi
- **Nh∆∞·ª£c ƒëi·ªÉm**: Ph·ª• thu·ªôc v√†o Prompt, c√≥ th·ªÉ t·ªën k√©m h∆°n

**2. Proposition-based chunking**

Ph∆∞∆°ng ph√°p n√†y s·ª≠ d·ª•ng LLM ƒë·ªÉ bi·∫øn ƒë·ªïi t√†i li·ªáu th√†nh danh s√°ch c√°c m·ªánh ƒë·ªÅ t·ª± ch·ª©a, m·ªói m·ªánh ƒë·ªÅ l√† m·ªôt √Ω nghƒ©a ƒë·ªôc l·∫≠p, ƒë·ªìng th·ªùi gi·∫£i quy·∫øt tham chi·∫øu, s·ª≠ d·ª•ng model Propositionizer (Flan-T5-large). Ph∆∞∆°ng ph√°p n√†y c·∫£i thi·ªán hi·ªáu su·∫•t trong c√°c t√°c v·ª• QA (h·ªèi ƒë√°p) b·∫±ng c√°ch cung c·∫•p th√¥ng tin li√™n quan v·ªõi m·∫≠t ƒë·ªô cao, v∆∞·ª£t tr·ªôi so v·ªõi chia d·ª±a tr√™n ƒëo·∫°n ho·∫∑c c√¢u
- **∆Øu ƒëi·ªÉm**: T·∫°o c√°c ƒëo·∫°n t·ª± ch·ª©a, t·ªëi ∆∞u cho QA (h·ªèi ƒë√°p), m·∫≠t ƒë·ªô th√¥ng tin cao.
- **Nh∆∞·ª£c ƒëi·ªÉm**: C·∫ßn m√¥ h√¨nh tinh ch·ªânh, ph·ª©c t·∫°p h∆°n ƒë·ªÉ tri·ªÉn khai

### Data Retrieval

**Data Retrieval**: The process of accessing and extracting relevant information from a database or knowledge source.

#### Vector Embedding

**Vector embedding** (c√≤n g·ªçi l√† embeddings) l√† c√°ch bi·ªÉu di·ªÖn d·ªØ li·ªáu (th∆∞·ªùng l√† vƒÉn b·∫£n, h√¨nh ·∫£nh, ho·∫∑c √¢m thanh) d∆∞·ªõi d·∫°ng c√°c vector s·ªë trong kh√¥ng gian nhi·ªÅu chi·ªÅu. Nh·ªØng vector n√†y gi√∫p m√°y t√≠nh hi·ªÉu v√† so s√°nh d·ªØ li·ªáu theo ng·ªØ nghƒ©a, thay v√¨ ch·ªâ d·ª±a tr√™n c√°c th√¥ng tin b·ªÅ m·∫∑t nh∆∞ chu·ªói k√Ω t·ª±.
- **T√¨m ki·∫øm ng·ªØ nghƒ©a (semantic search)**: V·ªõi vector embedding, ta c√≥ th·ªÉ th·ª±c hi·ªán t√¨m ki·∫øm d·ª±a tr√™n √Ω nghƒ©a n·ªôi dung. Trong RAG (Retrieval-Augmented Generation), thu·∫≠t to√°n s·∫Ω t√¨m ƒë∆∞·ª£c c√°c ƒëo·∫°n vƒÉn b·∫£n li√™n quan v·ªÅ m·∫∑t ng·ªØ nghƒ©a ƒë·ªÉ cung c·∫•p cho m√¥ h√¨nh.
- **T·ªëi ∆∞u h√≥a ƒë·ªô ch√≠nh x√°c**: Khi l·ª±a ch·ªçn c√°c ƒëo·∫°n vƒÉn b·∫£n li√™n quan, vi·ªác s·ª≠ d·ª•ng vector embeddings cho ph√©p h·ªá th·ªëng RAG ƒë∆∞a ra c√°c k·∫øt qu·∫£ ch√≠nh x√°c h∆°n v√¨ ch√∫ng ƒë∆∞·ª£c x·∫øp h·∫°ng d·ª±a tr√™n m·ª©c ƒë·ªô t∆∞∆°ng ƒë·ªìng ng·ªØ nghƒ©a.
- **Gi·∫£m thi·ªÉu chi ph√≠ t√≠nh to√°n**: T·∫°o embedding v√† so s√°nh vector c√≥ th·ªÉ hi·ªáu qu·∫£ v·ªÅ m·∫∑t t√≠nh to√°n h∆°n so v·ªõi nhi·ªÅu ph∆∞∆°ng ph√°p so kh·ªõp d·ª±a tr√™n chu·ªói th√¥ng th∆∞·ªùng, ƒë·∫∑c bi·ªát khi √°p d·ª•ng cho d·ªØ li·ªáu l·ªõn.

**1. L·ª±a ch·ªçn m√¥ h√¨nh embedding ph√π h·ª£p**

- S·ª≠ d·ª•ng c√°c m√¥ h√¨nh nh∆∞ BERT-based, Sentence Transformers (V√≠ d·ª•: sentence-BERT) ho·∫∑c c√°c m√¥ h√¨nh LLM c√≥ kh·∫£ nƒÉng sinh ra embeddings ch·∫•t l∆∞·ª£ng cao.
- C√¢n nh·∫Øc gi·ªØa m√¥ h√¨nh nh·ªè (nhanh) v√† m√¥ h√¨nh l·ªõn (ch√≠nh x√°c h∆°n) tu·ª≥ theo y√™u c·∫ßu v·ªÅ th·ªùi gian ph·∫£n h·ªìi v√† ƒë·ªô ch√≠nh x√°c.

**2. Chu·∫©n h√≥a v√† l√†m s·∫°ch d·ªØ li·ªáu**

- Lo·∫°i b·ªè n·ªôi dung nhi·ªÖu, tr√πng l·∫∑p, ho·∫∑c kh√¥ng li√™n quan tr∆∞·ªõc khi t·∫°o embedding.
- S·ª≠ d·ª•ng k·ªπ thu·∫≠t tokenization, lowercasing, v√† lo·∫°i b·ªè stopword (n·∫øu ph√π h·ª£p) ƒë·ªÉ ƒë·∫£m b·∫£o ch·∫•t l∆∞·ª£ng embedding.

**3. X√°c ƒë·ªãnh ƒë·ªô l·ªõn embedding**

- C√¢n nh·∫Øc k√≠ch th∆∞·ªõc vector embedding ƒë·ªÉ c√¢n b·∫±ng gi·ªØa ch·∫•t l∆∞·ª£ng bi·ªÉu di·ªÖn v√† kh·∫£ nƒÉng l∆∞u tr·ªØ / t·ªëc ƒë·ªô truy xu·∫•t.
- Embedding qu√° l·ªõn s·∫Ω t·ªën t√†i nguy√™n v√† th·ªùi gian, nh∆∞ng embedding qu√° nh·ªè c√≥ th·ªÉ m·∫•t ƒëi nhi·ªÅu th√¥ng tin ng·ªØ nghƒ©a.

**4. S·ª≠ d·ª•ng ch·ªâ m·ª•c vector (Vector Index)**

- ƒê·ªÉ h·ªó tr·ª£ t√¨m ki·∫øm ng·ªØ nghƒ©a nhanh v√† hi·ªáu qu·∫£, n√™n d√πng c√°c c√¥ng c·ª• ch·ªâ m·ª•c vector nh∆∞ **Faiss**, **Annoy**, **Milvus**, v.v.
- T·∫≠n d·ª•ng k·ªπ thu·∫≠t **Approximate Nearest Neighbor Search (ANN)** ƒë·ªÉ gi·∫£m ƒë·ªô ph·ª©c t·∫°p v√† tƒÉng t·ªëc t√¨m ki·∫øm.

**5. Qu·∫£n l√Ω v√† c·∫≠p nh·∫≠t embeddings**

- Th∆∞·ªùng xuy√™n c·∫≠p nh·∫≠t embeddings khi d·ªØ li·ªáu thay ƒë·ªïi, nh·∫•t l√† trong c√°c ·ª©ng d·ª•ng c·∫ßn th√¥ng tin m·ªõi (tin t·ª©c, m·∫°ng x√£ h·ªôi, v.v.).
- Thi·∫øt k·∫ø gi·∫£i ph√°p tu·∫ßn t·ª± ƒë·ªÉ t√°i t·∫°o v√† t√°i ch·ªâ m·ª•c (re-indexing) m·ªôt c√°ch hi·ªáu qu·∫£, tr√°nh xung ƒë·ªôt trong qu√° tr√¨nh t√¨m ki·∫øm.

**6. ƒêo l∆∞·ªùng ch·∫•t l∆∞·ª£ng**

- S·ª≠ d·ª•ng c√°c ch·ªâ s·ªë nh∆∞ `Recall@K`, `MRR`, `nDCG`... ƒë·ªÉ ƒë√°nh gi√° m·ª©c ƒë·ªô ph√π h·ª£p c·ªßa k·∫øt qu·∫£ t√¨m ki·∫øm.
- Th·ª≠ nghi·ªám nhi·ªÅu m√¥ h√¨nh embedding v√† chi·∫øn l∆∞·ª£c ti·ªÅn x·ª≠ l√Ω kh√°c nhau ƒë·ªÉ t√¨m gi·∫£i ph√°p t·ªëi ∆∞u.

#### Similarity Search

**Similarity Search**: Calculating the similarity between two vectors (query and document) to determine how closely they are related. This is typically done using distance metrics like KNN or ANN (Approximate Nearest Neighbor) search algorithms.
- **KNN**: K-nearest neighbors, a simple algorithm that finds the k most similar items to a given query based on distance metrics.
- **ANN**: Approximate nearest neighbor search, a more efficient algorithm for finding similar items in high-dimensional spaces, often used in vector databases.

| **Ti√™u ch√≠**                       | **KNN (Exact)**                                                                                                        | **ANN (Approximate)**                                                                                                                           |
|------------------------------------|-------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| **Ph∆∞∆°ng ph√°p**                    | T√¨m ki·∫øm l√°ng gi·ªÅng g·∫ßn nh·∫•t ch√≠nh x√°c                                                                                 | T√¨m ki·∫øm l√°ng gi·ªÅng g·∫ßn nh·∫•t x·∫•p x·ªâ                                                                                                              |
| **ƒê·ªô ph·ª©c t·∫°p t√≠nh to√°n**          | - Th∆∞·ªùng l√† $O(N)$ hay $O(N \log N)$ cho vi·ªác truy v·∫•n<br/>- Ph·ª• thu·ªôc m·∫°nh v√†o k√≠ch th∆∞·ªõc d·ªØ li·ªáu                | - C√≥ th·ªÉ ƒë·∫°t d∆∞·ªõi $\log N$ nh·ªù c·∫•u tr√∫c index ƒë·∫∑c bi·ªát<br/>- ƒê√¥i khi ch·∫•p nh·∫≠n sai s·ªë nh·ªè ƒë·ªÉ gi·∫£m thi·ªÉu ƒë·ªô ph·ª©c t·∫°p t√¨m ki·∫øm                    |
| **ƒê·ªô ch√≠nh x√°c**                   | - K·∫øt qu·∫£ ch√≠nh x√°c tuy·ªát ƒë·ªëi                                                                                           | - K·∫øt qu·∫£ g·∫ßn ƒë√∫ng, c√≥ th·ªÉ c√≥ sai s·ªë nh∆∞ng th∆∞·ªùng r·∫•t nh·ªè                                                                                         |
| **T·ªëc ƒë·ªô**                         | - Ch·∫≠m h∆°n khi quy m√¥ d·ªØ li·ªáu l·ªõn <br/>- M·∫•t nhi·ªÅu th·ªùi gian cho m·ªói truy v·∫•n                                            | - T·ªëc ƒë·ªô truy v·∫•n r·∫•t nhanh<br/>- Th√≠ch h·ª£p cho d·ªØ li·ªáu l·ªõn, y√™u c·∫ßu ƒë·ªô tr·ªÖ th·∫•p                                                                 |
| **B·ªô nh·ªõ**                         | - D·ªÖ tri·ªÉn khai KNN ƒë∆°n gi·∫£n (kh√¥ng index), nh∆∞ng t·ªën c√¥ng truy v·∫•n<br/>- N·∫øu d√πng c·∫•u tr√∫c index n√¢ng cao, b·ªô nh·ªõ c√≥ th·ªÉ l·ªõn h∆°n | - Nhi·ªÅu gi·∫£i ph√°p ANN kh√°c nhau, ƒëa ph·∫ßn ƒë√≤i h·ªèi t·∫°o index<br/>- Th∆∞·ªùng ti√™u t·ªën b·ªô nh·ªõ b·ªï sung ƒë·ªÉ ƒë·∫£m b·∫£o truy v·∫•n nhanh                        |
| **∆Øu ƒëi·ªÉm**                        | - Ch√≠nh x√°c cao<br/>- D·ªÖ hi·ªÉu, tri·ªÉn khai thu·∫≠t to√°n d·ªÖ d√†ng                                                             | - T·ªëc ƒë·ªô x·ª≠ l√Ω nhanh h∆°n nhi·ªÅu khi d·ªØ li·ªáu r·∫•t l·ªõn<br/>- Th√≠ch h·ª£p cho b√†i to√°n th·ª±c t·∫ø v·ªõi y√™u c·∫ßu th·ªùi gian ph·∫£n h·ªìi ng·∫Øn                      |
| **H·∫°n ch·∫ø**                        | - Th·ªùi gian truy v·∫•n d√†i n·∫øu kh√¥ng c√≥ index t·ªët<br/>- Kh√¥ng m·ªü r·ªông t·ªët khi d·ªØ li·ªáu ph√¨nh to                             | - K·∫øt qu·∫£ kh√¥ng ch√≠nh x√°c tuy·ªát ƒë·ªëi (d√π sai s·ªë nh·ªè)<br/>- C·∫•u tr√∫c index ph·ª©c t·∫°p, y√™u c·∫ßu tinh ch·ªânh nhi·ªÅu                                      |
| **·ª®ng d·ª•ng ph·ªï bi·∫øn**              | - T√¨m ki·∫øm v·ªõi b·ªô d·ªØ li·ªáu nh·ªè ho·∫∑c trung b√¨nh<br/>- Y√™u c·∫ßu ƒë·ªô ch√≠nh x√°c tuy·ªát ƒë·ªëi<br/>- Ph√¢n lo·∫°i, b√†i to√°n h·ªçc m√°y gi√°m s√°t d·∫°ng KNN truy·ªÅn th·ªëng | - Recommendation systems, t√¨m ki·∫øm vƒÉn b·∫£n, h√¨nh ·∫£nh ·ªü quy m√¥ l·ªõn<br/>- RAG (Retrieval Augmented Generation) v·ªõi d·ªØ li·ªáu kh·ªïng l·ªì               |


**Use-cases of vector similarity search:**
- Deduplication
- Recommendation systems
- Anomaly detection
- Reverse image search
- Search engines

### Data Generation



## Agentic RAG

[link tham kh·∫£o](https://blog.dailydoseofds.com/p/rag-vs-agentic-rag)

## RAG Evaluation

## RAG Advanced

## RAG Toolkits

RAG All-in-one is a guide to building Retrieval-Augmented Generation (RAG) applications. It offers a collection of tools, libraries, and frameworks for RAG systems, with explanations of key components and recommendations for effective implementation.

![](_images/rag_diagram.png)



### Data Processing

Tools and libraries for ingesting various document formats, extracting text, and preparing data for further processing.

**üìåDocument Ingestor**

The process of collecting and importing data from various sources into a system for further processing and analysis. This can include data from databases, APIs, files, or other sources. The goal is to make the data available for analysis, storage, or other purposes, help to extract data from a variety of documents like web pages, PDF, word documents, images, power point presentations etc. Once the data is extracted, the data is chunked, encoded and then stored as embeddings in vector databases.

| Library | Description | Link | GitHub Stars üåü |
|---------|-------------|------|-------------|
| LangChain Document Loaders | Comprehensive set of document loaders for various file types | [GitHub](https://github.com/langchain-ai/langchain) | ![GitHub stars](https://img.shields.io/github/stars/langchain-ai/langchain) |
| LlamaIndex Parser | Flexible document parsing and chunking capabilities for various file formats | [GitHub](https://github.com/jerryjliu/llama_index) | ![GitHub stars](https://img.shields.io/github/stars/jerryjliu/llama_index) |
| Docling | Document processing tool that parses diverse formats with advanced PDF understanding and AI integrations | [GitHub](https://github.com/docling-project/docling) | ![GitHub stars](https://img.shields.io/github/stars/docling-project/docling) |
| Unstructured | Library for pre-processing and extracting content from raw documents | [Github](https://github.com/Unstructured-IO/unstructured) | ![GitHub stars](https://img.shields.io/github/stars/Unstructured-IO/unstructured) |
| PyPDF | Library for reading and manipulating PDF files | [GitHub](https://github.com/py-pdf/pypdf) | ![GitHub stars](https://img.shields.io/github/stars/py-pdf/pypdf) |
| PyMuPDF | A Python binding for MuPDF, offering fast PDF processing capabilities | [GitHub](https://github.com/pymupdf/PyMuPDF) | ![GitHub stars](https://img.shields.io/github/stars/pymupdf/PyMuPDF) |
| MegaParse | Versatile parser for text, PDFs, PowerPoint, and Word documents with lossless information extraction | [GitHub](https://github.com/QuivrHQ/MegaParse) | ![GitHub stars](https://img.shields.io/github/stars/QuivrHQ/MegaParse) |
| Adobe PDF Extract | A service provided by Adobe for extracting content from PDF documents | [Link](https://developer.adobe.com/document-services/docs/overview/legacy-documentation/pdf-extract-api/quickstarts/python/) |  |
| Azure AI Document Intelligence | A service provided by Azure for extracting content including text, tables, images from PDF documents | [Link](https://developer.adobe.com/document-services/docs/overview/legacy-documentation/pdf-extract-api/quickstarts/python/) |  |

**üìåData Extraction - Web Scraping**

| Library | Description | Link |
|---------|-------------|------|
| Crawl4AI (Web Scraping) | Open-source LLM Friendly Web Crawler & Scrapper | [Link](https://github.com/unclecode/crawl4ai) |
| ScrapeGraphAI (Web & Document) | A web scraping Python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). | [Link](https://github.com/ScrapeGraphAI/Scrapegraph-ai) |
| Crawlee (Web Scraping) | A web scraping and browser automation library | [Link](https://github.com/apify/crawlee-python) |

**üìåData Extraction - Documents**

| Library | Description | Link |
|---------|-------------|------|
| Docling (Document) | Docling parses documents and exports them to the desired format with ease and speed. | [Link](https://github.com/docling-project/docling) |
| Llama Parse (Document) | GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents). | [Link](https://github.com/run-llama/llama_cloud_services) |
| PyMuPDF4LLM (Document) | PyMuPDF4LLM library makes it easier to extract PDF content in the format you need for LLM & RAG environments. | [Link](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) |
| MegaParse (Document) | Parser for every type of documents | [Link](https://github.com/quivrhq/megaparse) |
| ExtractThinker (Document) | Document Intelligence library for LLMs | [Link](https://github.com/enoch3712/ExtractThinker) |

**üìåChunking**

The process of breaking down large text documents into smaller, more manageable chunks or segments. This is important for various applications, such as natural language processing, machine learning, and information retrieval. By splitting text into smaller pieces, it becomes easier to analyze, process, and retrieve relevant information.

| Library | Description | Link |
|---------|-------------|------|
| Chonkie | RAG chunking library that is lightweight, lightning-fast, and easy to use. The no-nonsense RAG chunking library. This library supports seven different chunking strategies. | [Link](https://github.com/chonkie-ai/chonkie) |

**üìåRerankers**

The process of re-ranking the retrieved chunks based on their relevance to the query. The selected chunks might need further refinement to ensure the most relevant information is prioritized. Model evaluates the initial list of retrieved chunks alongside the query to assign a relevance score to each chunk.

| Library | Description | Link |
|---------|-------------|------|
| Rerankers | A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models. Any new reranking models can be added with very little knowledge of the codebase. | [Link](https://github.com/AnswerDotAI/rerankers) |

### RAG Frameworks

**üìåResearch**

| Library | Description | Link |
|---------|-------------|------|
| FlashRAG | A Python Toolkit for Efficient RAG Research. This toolkit includes 36 pre-processed benchmark RAG datasets and 16 state-of-the-art RAG algorithms. | [Link](https://github.com/RUC-NLPIR/FlashRAG) |

**üìåRAG Framework**

End-to-end frameworks that provide integrated solutions for building RAG applications, simplify building applications with LLMs by providing in-built tools. These frameworks avoids writing code from scratch and speeds up the LLM application development.

| Library | Description | Link | GitHub Stars üåü |
|-----------|-------------|------|-------------|
| **LangChain** | Framework for building applications with LLMs and integrating with various data sources | [GitHub](https://github.com/langchain-ai/langchain) | ![GitHub stars](https://img.shields.io/github/stars/langchain-ai/langchain) |
| **LlamaIndex** | Data framework for building RAG systems with structured data | [GitHub](https://github.com/jerryjliu/llama_index) | ![GitHub stars](https://img.shields.io/github/stars/jerryjliu/llama_index) |
| **Haystack** | End-to-end framework for building NLP pipelines | [GitHub](https://github.com/deepset-ai/haystack) | ![GitHub stars](https://img.shields.io/github/stars/deepset-ai/haystack) |
| **fastRAG** | Research framework for efficient and optimized retrieval augmented generative pipelines, incorporating state-of-the-art LLMs and Information Retrieval. | [GitHub](https://github.com/IntelLabs/fastRAG) | ![GitHub stars](https://img.shields.io/github/stars/IntelLabs/fastRAG) |
| **Llmware** | Unified framework for building enterprise RAG pipelines with small, specialized models | [GitHub](https://github.com/llmware-ai/llmware) |![GitHub stars](https://img.shields.io/github/stars/llmware-ai/llmware) |
| **SmolAgents** | A barebones library for agents | [GitHub](https://github.com/huggingface/smolagents) | ![GitHub stars](https://img.shields.io/github/stars/huggingface/smolagents) |
| **txtai** | Open-source embeddings database for semantic search and LLM workflows | [GitHub](https://github.com/neuml/txtai) | ![GitHub stars](https://img.shields.io/github/stars/neuml/txtai) |
| **Pydantic AI** | Agent Framework / shim to use Pydantic with LLMs | [GitHub](https://github.com/pydantic/pydantic-ai) | ![GitHub stars](https://img.shields.io/github/stars/pydantic/pydantic-ai) |
| **OpenAI Agent** | A lightweight, powerful framework for multi-agent workflows | [GitHub](https://github.com/openai/openai-agents-python) | ![GitHub stars](https://img.shields.io/github/stars/openai/openai-agents-python) |

**üìåAgentic RAG**
| Library | Description | Link |
|---------|-------------|------|
| CrewAI | Framework for orchestrating role-playing, autonomous AI agents. | [Link](https://github.com/crewAIInc/crewAI) |
| Agno | Build AI Agents with memory, knowledge, tools and reasoning. Chat with them using a beautiful Agent UI. | [Link](https://github.com/agno-agi/agno) |
| LangGraph | Build resilient language agents as graphs. | [Link](https://github.com/langchain-ai/langgraph) |
| AutoGen | An open-source framework for building AI agent systems. | [Link](https://github.com/microsoft/autogen) |
| R2R | Agentic Retrieval-Augmented Generation (RAG) with a RESTful API. R2R offers multimodal content ingestion, hybrid search functionality, knowledge graphs, and comprehensive user and document management.| [Link](https://github.com/SciPhi-AI/R2R) |
| Vectara | Build Agentic RAG applications. | [Link](https://vectara.github.io/py-vectara-agentic/latest/) |

**üìåGraph RAG**

| Library | Description | Link |
|---------|-------------|------|
| GraphRAG | A modular graph-based Retrieval-Augmented Generation (RAG) system. | [Link](https://github.com/microsoft/graphrag) |
| Nano GraphRAG | A simple, easy-to-hack GraphRAG implementation. | [Link](https://github.com/gusye1234/nano-graphrag) |
| FastGraph RAG | Streamlined and promptable Fast GraphRAG framework designed for interpretable, high-precision, agent-driven retrieval workflows. | [Link](https://github.com/circlemind-ai/fast-graphrag) |



### Vector Database 

üìåDatabases optimized for storing and efficiently searching vector embeddings/text documents. A specialized database that stores and indexes text embeddings as high-dimensional vectors. It enables fast, efficient retrieval of semantically similar content for RAG applications.

| Database | Description | Link | GitHub Stars üåü |
|----------|-------------|------|-------------|
| LanceDB | Developer-friendly, embedded retrieval engine for multimodal AI | [GitHub](https://github.com/lancedb/lancedb) | ![GitHub stars](https://img.shields.io/github/stars/lancedb/lancedb) |
| Pinecone | Managed vector database for semantic search | [Link](https://www.pinecone.io/) |  |
| MongoDB | General-purpose document database | [Link](https://www.mongodb.com/) |  |
| Elasticsearch | Search and analytics engine that can store documents | [Link](https://www.elastic.co/) |  |
| SQLite-Vec | A vector search SQLite extension that runs anywhere! | [Link](https://github.com/asg017/sqlite-vec) |![GitHub stars](https://img.shields.io/github/stars/asg017/sqlite-vec) |
| FAISS | A library for efficient similarity search and clustering of dense vectors. | [Link](https://github.com/facebookresearch/faiss) |![GitHub stars](https://img.shields.io/github/stars/facebookresearch/faiss) |
| PGVector | Open-source vector similarity search for Postgres | [Link](https://github.com/pgvector/pgvector) |![GitHub stars](https://img.shields.io/github/stars/pgvector/pgvector) |
| Chroma | The AI-native open-source embedding database. The fastest way to build Python or JavaScript LLM apps with memory! | [Link](https://github.com/chroma-core/chroma) |![GitHub stars](https://img.shields.io/github/stars/chroma-core/chroma) |
| Qdrant | High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI. | [Link](https://github.com/qdrant/qdrant) |![GitHub stars](https://img.shields.io/github/stars/qdrant/qdrant) |
| Pincone | The vector database for machine learning applications. | [Link](https://github.com/pinecone-io) |![GitHub stars](https://img.shields.io/github/stars/pinecone-io) |
| Weaviate | Weaviate is a cloud-native, open source vector database that is robust, fast, and scalable. | [Link](https://github.com/weaviate/weaviate) |![GitHub stars](https://img.shields.io/github/stars/weaviate/weaviate) |
| Milvus | Milvus is a high-performance, cloud-native vector database built for scalable vector ANN search | [Link](https://github.com/milvus-io/milvus) |![GitHub stars](https://img.shields.io/github/stars/milvus-io/milvus) |



### Model LLMs and Embedding 

**üìåLLM Models**

Large Language Models and platforms for generating responses based on retrieved context. Powerful AI models trained on vast text data to generate human-like responses. LLMs forms the core of RAG, enabling natural language understanding and generation.

| LLM | Description | Link |
|-----|-------------|------|
| OpenAI API | Access to GPT models through API | [Link](https://platform.openai.com/) |
| Claude | Anthropic's Claude series of LLMs | [Link](https://www.anthropic.com/claude) |
| Hugging Face LLM Models| Platform for open-source NLP models | [Link](https://huggingface.co/collections/open-llm-leaderboard/open-llm-leaderboard-best-models-652d6c7965a4619fb5c27a03) |
| LLaMA | Meta's open-source large language model | [Link](https://github.com/facebookresearch/llama) |
| Mistral | Open-source and commercial models | [Link](https://mistral.ai/) |
| Cohere | API access to generative and embedding models | [Link](https://cohere.com/) |
| DeepSeek | Advanced large language models for various applications | [Link](https://www.deepseek.com/) |
| Qwen | Alibaba Cloud's large language model accessible via API | [Link](https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api) |
| Ollama | Run open-source LLMs locally | [Link](https://github.com/ollama/ollama) |


**üìåEmbedding**

Models and services for creating vector representations of text. These convert text into numerical vectors, capturing semantic meaning for similarity comparisons. They‚Äôre crucial for retrieving relevant documents or chunks in RAG‚Äôs retrieval step.

| Embedding Solution | Description | Link |
|-------------------|-------------|------|
| OpenAI Embeddings | API for text-embedding-ada-002 and newer models | [Link](https://platform.openai.com/docs/guides/embeddings) |
| Sentence Transformers | Python framework for state-of-the-art sentence embeddings | [Link](https://github.com/UKPLab/sentence-transformers) |
| Cohere Embed | Specialized embedding models API | [Link](https://cohere.com/embed) |
| Hugging Face Embeddings | Various embedding models | [Link](https://huggingface.co/models?pipeline_tag=feature-extraction) |
| E5 Embeddings | Microsoft's text embeddings | [Link](https://huggingface.co/intfloat/e5-large-v2) |
| BGE Embeddings | BAAI general embeddings | [Link](https://huggingface.co/BAAI/bge-large-en-v1.5) |



### Observability Monitoring

üìåTools for monitoring, analyzing, and improving LLM applications.


| Library | Description | Link | GitHub Stars üåü                                                                       |
|------|-------------|------|---------------------------------------------------------------------------|
| Langfuse | Open source LLM engineering platform | [GitHub](https://github.com/langfuse/langfuse) | ![GitHub stars](https://img.shields.io/github/stars/langfuse/langfuse) |
| Opik/Comet | Debug, evaluate, and monitor LLM applications with tracing, evaluations, and dashboards | [GitHub](https://github.com/comet-ml/opik) | ![GitHub stars](https://img.shields.io/github/stars/comet-ml/opik) |
| Phoenix/Arize | Open-source observability for LLM applications | [GitHub](https://github.com/Arize-ai/phoenix) | ![GitHub stars](https://img.shields.io/github/stars/Arize-ai/phoenix) |
| Helicone | Open source LLM observability platform. One line of code to monitor, evaluate, and experiment | [GitHub](https://github.com/helicone/helicone) | ![GitHub stars](https://img.shields.io/github/stars/helicone/helicone) |
| Openlit | Open source platform for AI Engineering: OpenTelemetry-native LLM Observability, GPU Monitoring, Guardrails, Evaluations, Prompt Management, Vault, Playground | [GitHub](https://github.com/openlit/openlit) | ![GitHub stars](https://img.shields.io/github/stars/openlit/openlit) |
| Lunary | The production toolkit for LLMs. Observability, prompt management and evaluations. | [GitHub](https://github.com/lunary-ai/lunary) | ![GitHub stars](https://img.shields.io/github/stars/lunary-ai/lunary) |
| Langtrace | OpenTelemetry-based observability tool for LLM applications with real-time tracing and metrics | [GitHub](https://github.com/Scale3-Labs/langtrace) | ![GitHub stars](https://img.shields.io/github/stars/Scale3-Labs/langtrace) |



### Prompt Techniques

Methods and frameworks for effective prompt engineering in RAG systems.



**üìåOpen Source Prompt Engineering Tools**

| Library | Description | Link | GitHub Stars üåü |
|------|-------------|------|-------|
| Prompt Engineering Guide | Comprehensive guide to prompt engineering | [GitHub](https://github.com/dair-ai/Prompt-Engineering-Guide) | ![GitHub stars](https://img.shields.io/github/stars/dair-ai/Prompt-Engineering-Guide) |
| DSPy | Framework for programming language models instead of prompting | [GitHub](https://github.com/stanfordnlp/dspy) | ![GitHub stars](https://img.shields.io/github/stars/stanfordnlp/dspy) |
| Guidance | Language for controlling LLMs | [GitHub](https://github.com/guidance-ai/guidance) | ![GitHub stars](https://img.shields.io/github/stars/guidance-ai/guidance) |
| LLMLingua | Prompt compression library for faster LLM inference | [GitHub](https://github.com/microsoft/LLMLingua) | ![GitHub stars](https://img.shields.io/github/stars/microsoft/LLMLingua) |
| Promptify | NLP task prompt generator for GPT, PaLM and other models | [GitHub](https://github.com/promptslab/Promptify) | ![GitHub stars](https://img.shields.io/github/stars/promptslab/Promptify) |
| PromptSource | Toolkit for creating and sharing natural language prompts | [GitHub](https://github.com/bigscience-workshop/promptsource) | ![GitHub stars](https://img.shields.io/github/stars/bigscience-workshop/promptsource) |
| Promptimizer | Library for optimizing prompts | [GitHub](https://github.com/hinthornw/promptimizer) | ![GitHub stars](https://img.shields.io/github/stars/hinthornw/promptimizer) |
| Selective Context | Context compression tool for doubling LLM content processing | [GitHub](https://github.com/liyucheng09/Selective_Context) | ![GitHub stars](https://img.shields.io/github/stars/liyucheng09/Selective_Context) |
| betterprompt | Testing suite for LLM prompts before production | [GitHub](https://github.com/stjordanis/betterprompt) | ![GitHub stars](https://img.shields.io/github/stars/stjordanis/betterprompt) |



**üìåDocumentation & Services**

| Resource | Description | Link |
|----------|-------------|------|
| OpenAI Prompt Engineering | Official guide to prompt engineering from OpenAI | [Link](https://platform.openai.com/docs/guides/prompt-engineering) |
| LangChain Prompts | Templates and composition tools for prompts | [Link](https://python.langchain.com/docs/how_to/) |
| PromptPerfect | Tool for optimizing prompts | [Link](https://promptperfect.jina.ai/) |



### Evaluation

üìåTools and frameworks for assessing and improving RAG system performance. It is crucial to assess the performance of RAG applications to understand the merits and the demerits. For this, we have libraries like RAGAS, Giskard, Trulens etc.

| Library | Description | Link | Github Stars üåü |
|------|-------------|------|-------|
| FastChat | Open platform for training, serving, and evaluating LLM-based chatbots | [Github](https://github.com/lm-sys/fastchat) | ![GitHub stars](https://img.shields.io/github/stars/lm-sys/fastchat) |
| OpenAI Evals | Framework for evaluating LLMs and LLM systems | [GitHub](https://github.com/openai/evals) | ![GitHub stars](https://img.shields.io/github/stars/openai/evals) |
| RAGAS | Ultimate toolkit for evaluating and optimizing RAG systems | [GitHub](https://github.com/explodinggradients/ragas) | ![GitHub stars](https://img.shields.io/github/stars/explodinggradients/ragas) |
| Promptfoo | Open-source tool for testing and evaluating prompts | [GitHub](https://github.com/promptfoo/promptfoo) | ![GitHub stars](https://img.shields.io/github/stars/promptfoo/promptfoo) |
| DeepEval | Comprehensive evaluation library for LLM applications | [GitHub](https://github.com/confident-ai/deepeval) | ![GitHub stars](https://img.shields.io/github/stars/confident-ai/deepeval) |
| Giskard | Open-source evaluation and testing for ML & LLM systems | [Github](https://github.com/giskard-ai/giskard) | ![GitHub stars](https://img.shields.io/github/stars/giskard-ai/giskard) |
| PromptBench | Unified evaluation framework for large language models | [Github](https://github.com/microsoft/promptbench) | ![GitHub stars](https://img.shields.io/github/stars/microsoft/promptbench) |
| TruLens | Evaluation and tracking for LLM experiments with RAG-specific metrics | [GitHub](https://github.com/truera/trulens) | ![GitHub stars](https://img.shields.io/github/stars/truera/trulens) |
| EvalPlus | Rigorous evaluation framework for LLM4Code | [Github](https://github.com/evalplus/evalplus) | ![GitHub stars](https://img.shields.io/github/stars/evalplus/evalplus) |
| LightEval | All-in-one toolkit for evaluating LLMs | [Github](https://github.com/huggingface/lighteval) | ![GitHub stars](https://img.shields.io/github/stars/huggingface/lighteval) |
| LangTest | Test suite for comparing LLM models on accuracy, bias, fairness and robustness | [Github](https://github.com/JohnSnowLabs/langtest) | ![GitHub stars](https://img.shields.io/github/stars/JohnSnowLabs/langtest) |
| AgentEvals | Evaluators and utilities for measuring agent performance | [Github](https://github.com/langchain-ai/agentevals) | ![GitHub stars](https://img.shields.io/github/stars/langchain-ai/agentevals) |
| RAGChecker | A Fine-grained Framework For Diagnosing RAG. | [Github](https://github.com/amazon-science/RAGChecker) |![GitHub stars](https://img.shields.io/github/stars/amazon-science/RAGChecker)|
| BeyondLLM | Beyond LLM offers an all-in-one toolkit for experimentation, evaluation, and deployment of Retrieval-Augmented Generation (RAG) systems | [Github](https://github.com/aiplanethub/beyondllm) |![GitHub stars](https://img.shields.io/github/stars/aiplanethub/beyondllm)|

## RAG Survey Papers



**Paper**

| Paper | Category | Link |
|--------------------------------|------------------|------|
| Retrieval-Augmented Generation for Large Language Models: A Survey | General | [Link](https://arxiv.org/abs/2312.10997) |
| Retrieval-Augmented Generation for Natural Language Processing: A Survey | General | [Link](https://arxiv.org/abs/2407.13193) |
| A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions | General | [Link](https://arxiv.org/abs/2410.12837) |
| Retrieval-Augmented Generation for AI-Generated Content: A Survey | General | [Link](https://arxiv.org/abs/2402.19473) |
| A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models | General | [Link](https://arxiv.org/abs/2405.06211) |
| A Survey on Retrieval-Augmented Text Generation for Large Language Models | General | [Link](https://arxiv.org/abs/2404.10981) |
| Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely | General | [Link](https://arxiv.org/abs/2409.14924) |
| Graph Retrieval-Augmented Generation: A Survey | Graph RAG | [Link](https://arxiv.org/abs/2408.08921) |
| Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG | Agentic RAG | [Link](https://arxiv.org/abs/2501.09136) |
| Evaluation of Retrieval-Augmented Generation: A Survey | Evaluation | [Link](https://arxiv.org/abs/2405.07437) |
| Searching for Best Practices in Retrieval-Augmented Generation | RAG Best Practices | [Link](https://arxiv.org/abs/2407.01219) |
