Web Page RAG

Contents

3.6.4.11. Web Page RAG#

Authored by Kalyan KS. To stay updated with LLM, RAG and Agent updates, you can follow me on Twitter.

Step-1 : Extract the web page text
Step-2 : Chunk the extracted web page text
Step-3 : Create a vector store with the extracted web page text chunks
Step-4 : Create a retriever which will return the relevant chunks
Step-5 : Build context from the relevant chunk texts
Step-6 : Build the RAG chain using rag prompt, LLM and string output parser.
Step-7 : Run the RAG chain to get the answer.

3.6.4.11.1. Install and import libraries#

!pip install -qU langchain langchain-community langchain-text-splitters
!pip install -qU langchain-openai langchain-chroma

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.5/2.5 MB 29.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.9/50.9 kB 3.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67.3/67.3 kB 3.2 MB/s eta 0:00:00
?25h  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
  Preparing metadata (pyproject.toml) ... ?25l?25hdone
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.9/54.9 kB 3.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 611.1/611.1 kB 13.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 MB 49.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 51.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 278.6/278.6 kB 17.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94.8/94.8 kB 6.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 65.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.6/101.6 kB 6.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.3/13.3 MB 83.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.9/55.9 kB 3.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 177.4/177.4 kB 12.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.0/65.0 kB 4.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 118.7/118.7 kB 8.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73.0/73.0 kB 5.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.3/62.3 kB 4.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 459.8/459.8 kB 19.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 319.7/319.7 kB 19.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.5/71.5 kB 4.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.0/4.0 MB 57.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 452.6/452.6 kB 24.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.0/46.0 kB 3.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 kB 5.9 MB/s eta 0:00:00
?25h  Building wheel for pypika (pyproject.toml) ... ?25l?25hdone

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters  import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda

3.6.4.11.2. Set up LLM API Key#

Save the OPENAI_API_KEY in Google Colab Secrets

from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

3.6.4.11.3. Extract YouTube video transcript#

from typing import List
from langchain.schema import Document

def wp_text(page_url: str) -> List[Document]:
    """
    Extracts text from the web page using WebBaseLoader.

    Parameters:
    page_url (str): The URL of the web page.

    Returns:
    List[Document]: A list of Document objects containing the text.
    """

    print("Web page text is extracted...")

    loader = WebBaseLoader(page_url)
    webpage_text = loader.load()

    return webpage_text

page_url = "https://x.ai/blog/grok-2"
webpage_text = wp_text(page_url)

Web page text is extracted...

print(webpage_text)

[Document(metadata={'source': 'https://x.ai/blog/grok-2', 'title': 'Grok-2 Beta Release', 'description': 'We announce our new Grok-2 and Grok-2 mini models.', 'language': 'en'}, page_content='Grok-2 Beta ReleaseGrokAPIBlogAboutCareersMenuAugust 13, 2024Grok-2 Beta ReleaseGrok-2 is our frontier language model with state-of-the-art reasoning capabilities. This release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the \uf8ffùïè platform.We are excited to release an early preview of Grok-2, a significant step forward from our previous model Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. At the same time, we are introducing Grok-2 mini, a small but capable sibling of Grok-2. An early version of Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo.Grok-2 and Grok-2 mini are currently in beta on \uf8ffùïè, and we are also making both models available through our enterprise API later this month.Grok-2 language model and chat capabilitiesWe introduced an early version of Grok-2 under the name "sus-column-r" into the LMSYS chatbot arena, a popular competitive language model benchmark. It outperforms both Claude and GPT-4 on the LMSYS leaderboard in terms of its overall Elo score.\n\nInternally, we employ a comparable process to evaluate our models. Our AI Tutors engage with our models across a variety of tasks that reflect real-world interactions with Grok. During each interaction, the AI Tutors are presented with two responses generated by Grok. They select the superior response based on specific criteria outlined in our guidelines. We focused on evaluating model capabilities in two key areas: following instructions and providing accurate, factual information. Grok-2 has shown significant improvements in reasoning with retrieved content and in its tool use capabilities, such as correctly identifying missing information, reasoning through sequences of events, and discarding irrelevant posts.BenchmarksWe evaluated the Grok-2 models across a series of academic benchmarks that included reasoning, reading comprehension, math, science, and coding. Both Grok-2 and Grok-2 mini demonstrate significant improvements over our previous Grok-1.5 model. They achieve performance levels competitive to other frontier models in areas such as graduate-level science knowledge (GPQA), general knowledge (MMLU, MMLU-Pro), and math competition problems (MATH). Additionally, Grok-2 excels in vision-based tasks, delivering state-of-the-art performance in visual math reasoning (MathVista) and in document-based question answering (DocVQA).\nBenchmarkGrok-1.5Grok-2 mini‚Ä°Grok-2‚Ä°GPT-4 Turbo*Claude 3 Opus‚Ä†Gemini Pro 1.5Llama 3 405BGPT-4o*Claude 3.5 Sonnet‚Ä†GPQA35.9%51.0%56.0%48.0%50.4%46.2%51.1%53.6%59.6%MMLU81.3%86.2%87.5%86.5%85.7%85.9%88.6%88.7%88.3%MMLU-Pro51.0%72.0%75.5%63.7%68.5%69.0%73.3%72.6%76.1%MATH¬ß50.6%73.0%76.1%72.6%60.1%67.7%73.8%76.6%71.1%HumanEval¬∂74.1%85.7%88.4%87.1%84.9%71.9%89.0%90.2%92.0%MMMU53.6%63.2%66.1%63.1%59.4%62.2%64.5%69.1%68.3%MathVista52.8%68.1%69.0%58.1%50.5%63.9%‚Äî63.8%67.7%DocVQA85.6%93.2%93.6%87.2%89.3%93.1%92.2%92.8%95.2%\n* GPT-4-Turbo and GPT-4o scores are from the May 2024 release.\n‚Ä† Claude 3 Opus and Claude 3.5 Sonnet scores are from the June 2024 release.\n‚Ä° Grok-2 MMLU, MMLU-Pro, MMMU and MathVista were evaluated using 0-shot CoT.\n¬ß For MATH, we present maj@1 results.\n¬∂ For HumanEval, we report pass@1 benchmark scores.Experience Grok with real-time information on \uf8ffùïèOver the past few months, we\'ve been continuously improving Grok on the \uf8ffùïè platform. Today, we\'re introducing the next evolution of the Grok experience, featuring a redesigned interface and new features.\n\n\uf8ffùïè Premium and Premium+ users will have access to two new models: Grok-2 and Grok-2 mini. Grok-2 is our state-of-the-art AI assistant with advanced capabilities in both text and vision understanding, integrating real-time information from the \uf8ffùïè platform, accessible through the Grok tab in the \uf8ffùïè app. Grok-2 mini is our small but capable model that offers a balance between speed and answer quality. Compared to its predecessor, Grok-2 is more intuitive, steerable, and versatile across a wide range of tasks, whether you\'re seeking answers, collaborating on writing, or solving coding tasks. In collaboration with Black Forest Labs, we are experimenting with their FLUX.1 model to expand Grok‚Äôs capabilities on \uf8ffùïè. If you are a Premium or Premium+ subscriber, make sure to update to the latest version of the \uf8ffùïè app in order to beta test Grok-2.Build with Grok using the Enterprise APIWe are also releasing Grok-2 and Grok-2 mini to developers through our new enterprise API platform later this month. Our upcoming API is built on a new bespoke tech stack that allows multi-region inference deployments for low-latency access across the world. We offer enhanced security features such as mandatory multi-factor authentication (e.g. using a Yubikey, Apple TouchID, or TOTP), rich traffic statistics, and advanced billing analytics (incl. detailed data exports). We further offer a management API that allows you to integrate team, user, and billing management into your existing in-house tools and services. Join our newsletter to get notified when we launch later this month.What is next?Grok-2 and Grok-2 mini are being rolled out on \uf8ffùïè. We are very excited about their applications to a range of AI-driven features, such as enhanced search capabilities, gaining deeper insights on \uf8ffùïè posts, and improved reply functions, all powered by Grok. Soon, we will release a preview of multimodal understanding as a core part of the Grok experience on \uf8ffùïè and API.Since announcing Grok-1 in November 2023, xAI has been moving at an extraordinary pace, driven by a small team with the highest talent density. We have introduced Grok-2, positioning us at the forefront of AI development. Our focus is on advancing core reasoning capabilities with our new compute cluster. We will have many more developments to share in the coming months. We are looking for individuals to join our small, focused team dedicated to building the most impactful innovations for the future of humanity. Apply to our positions here.GrokAPILegalPrivacy Policy\n')]

print(f"Number of documents = {len(webpage_text)}")

Number of documents = 1

3.6.4.11.4. Chunk Transcript text#

def wp_chunk(webpage_text: List[Document]) -> List[Document]:
    """
    Splits extracted web page text into smaller chunks using RecursiveCharacterTextSplitter.

    Parameters:
    webpage_text (List[Document]): A list of Document objects containing extracted web page text.

    Returns:
    List[Document]: A list of chunked Document objects.
    """

    print("Web page text is chunked....")

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    chunks = text_splitter.split_documents(webpage_text)

    return chunks

chunks = wp_chunk(webpage_text)

Web page text is chunked....

print(f"Number of chunks = {len(chunks)}")

Number of chunks = 9

print(chunks[0])

page_content='Grok-2 Beta ReleaseGrokAPIBlogAboutCareersMenuAugust 13, 2024Grok-2 Beta ReleaseGrok-2 is our frontier language model with state-of-the-art reasoning capabilities. This release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the ùïè platform.We are excited to release an early preview of Grok-2, a significant step forward from our previous model Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. At the same time, we are introducing Grok-2 mini, a small but capable sibling of Grok-2. An early version of Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo.Grok-2 and Grok-2 mini are currently in beta on ùïè, and we are also making both models available through our enterprise API later this month.Grok-2 language model and chat capabilitiesWe introduced an early version of Grok-2' metadata={'source': 'https://x.ai/blog/grok-2', 'title': 'Grok-2 Beta Release', 'description': 'We announce our new Grok-2 and Grok-2 mini models.', 'language': 'en'}

3.6.4.11.5. Create Vector Store#

# Set the chroma DB path
current_dir = "/content/rag"
persistent_directory = os.path.join(current_dir, "db", "chroma_db_wp")

def create_vector_store(chunks: List[Document], db_path: str) -> Chroma:
    """
    Creates a Chroma vector store from chunked documents.

    Parameters:
    chunks (List[Document]): A list of chunked Document objects.
    db_path (str): The directory path to persist the vector store.

    Returns:
    Chroma: A Chroma vector store containing the embedded documents.
    """

    print("Chrome vector store is created...\n")

    embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
    db = Chroma.from_documents(documents=chunks, embedding=embedding_model, persist_directory=db_path)

    return db

db = create_vector_store(chunks, persistent_directory)

Chrome vector store is created...

3.6.4.11.6. Retrieve relevant chunks#

def retrieve_context(db: Chroma, query: str) -> List[Document]:
    """
    Retrieves relevant document chunks from the Chroma vector store based on a query.

    Parameters:
    db (Chroma): The Chroma vector store containing embedded documents.
    query (str): The query string to search for relevant document chunks.

    Returns:
    List[Document]: A list of retrieved relevant document chunks.
    """

    retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 2})
    print("Relevant chunks are retrieved...\n")
    relevant_chunks = retriever.invoke(query)

    return relevant_chunks

query = 'What is Grok 2?'

relevant_chunks = retrieve_context(db, query)

Relevant chunks are retrieved...

print(f"Number of relevant chunks = {len(relevant_chunks)}")

Number of relevant chunks = 2

for i, chunk in enumerate(relevant_chunks):
  print(f"Chunk-{i}")
  print(chunk)
  print("\n")

Chunk-0
page_content='Grok-2 Beta ReleaseGrokAPIBlogAboutCareersMenuAugust 13, 2024Grok-2 Beta ReleaseGrok-2 is our frontier language model with state-of-the-art reasoning capabilities. This release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the ùïè platform.We are excited to release an early preview of Grok-2, a significant step forward from our previous model Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. At the same time, we are introducing Grok-2 mini, a small but capable sibling of Grok-2. An early version of Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo.Grok-2 and Grok-2 mini are currently in beta on ùïè, and we are also making both models available through our enterprise API later this month.Grok-2 language model and chat capabilitiesWe introduced an early version of Grok-2' metadata={'description': 'We announce our new Grok-2 and Grok-2 mini models.', 'language': 'en', 'source': 'https://x.ai/blog/grok-2', 'title': 'Grok-2 Beta Release'}


Chunk-1
page_content='and Grok-2 mini demonstrate significant improvements over our previous Grok-1.5 model. They achieve performance levels competitive to other frontier models in areas such as graduate-level science knowledge (GPQA), general knowledge (MMLU, MMLU-Pro), and math competition problems (MATH). Additionally, Grok-2 excels in vision-based tasks, delivering state-of-the-art performance in visual math reasoning (MathVista) and in document-based question answering (DocVQA).' metadata={'description': 'We announce our new Grok-2 and Grok-2 mini models.', 'language': 'en', 'source': 'https://x.ai/blog/grok-2', 'title': 'Grok-2 Beta Release'}

3.6.4.11.7. Build context#

def build_context(relevant_chunks: List[Document]) -> str:
    """
    Builds a context string from retrieved relevant document chunks.

    Parameters:
    relevant_chunks (List[Document]): A list of retrieved relevant document chunks.

    Returns:
    str: A concatenated string containing the content of the relevant chunks.
    """

    print("Context is built from relevant chunks")
    context = "\n\n".join([chunk.page_content for chunk in relevant_chunks])

    return context

context = build_context(relevant_chunks)

Context is built from relevant chunks

print(context)

Grok-2 Beta ReleaseGrokAPIBlogAboutCareersMenuAugust 13, 2024Grok-2 Beta ReleaseGrok-2 is our frontier language model with state-of-the-art reasoning capabilities. This release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the ùïè platform.We are excited to release an early preview of Grok-2, a significant step forward from our previous model Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. At the same time, we are introducing Grok-2 mini, a small but capable sibling of Grok-2. An early version of Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo.Grok-2 and Grok-2 mini are currently in beta on ùïè, and we are also making both models available through our enterprise API later this month.Grok-2 language model and chat capabilitiesWe introduced an early version of Grok-2

and Grok-2 mini demonstrate significant improvements over our previous Grok-1.5 model. They achieve performance levels competitive to other frontier models in areas such as graduate-level science knowledge (GPQA), general knowledge (MMLU, MMLU-Pro), and math competition problems (MATH). Additionally, Grok-2 excels in vision-based tasks, delivering state-of-the-art performance in visual math reasoning (MathVista) and in document-based question answering (DocVQA).

3.6.4.11.8. Combine all the steps into one function#

import os
from typing import Dict
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

def get_context(inputs: Dict[str, str]) -> Dict[str, str]:
    """
    Creates or loads a vector store for the video transcript and retrieves relevant chunks based on a query.

    Args:
        inputs (Dict[str, str]): A dictionary containing the following keys:
            - 'page_url' (str): Web page URL
            - 'query' (str): The user query.
            - 'db_path' (str): Path to the vector database.

    Returns:
        Dict[str, str]: A dictionary containing:
            - 'context' (str): Extracted relevant context.
            - 'query' (str): The user query.
    """
    page_url, query, db_path  = inputs['page_url'], inputs['query'], inputs['db_path']

    # Create new vector store if it does not exist
    if not os.path.exists(db_path):
        print("Creating a new vector store...\n")
        webpage_text = wp_text(page_url)
        chunks = wp_chunk(webpage_text)
        db = create_vector_store(chunks, db_path)

    # Load the existing vector store
    else:
        print("Loading the existing vector store\n")
        embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
        db = Chroma(persist_directory=db_path, embedding_function=embedding_model)

    relevant_chunks = retrieve_context(db, query)
    context = build_context(relevant_chunks)

    return {'context': context, 'query': query}

3.6.4.11.9. Build RAG chain#

template = """ You are an AI model trained for question answering. You should answer the
  given question based on the given context only.
  Question : {query}
  \n
  Context : {context}
  \n
  If the answer is not present in the given context, respond as: The answer to this question is not available
  in the provided content.
  """

rag_prompt = ChatPromptTemplate.from_template(template)

llm = ChatOpenAI(model='gpt-4o-mini')

str_parser = StrOutputParser()

rag_chain = (
    RunnableLambda(get_context)
    | rag_prompt
    | llm
    | str_parser
)

3.6.4.11.10. Run RAG chain#

# Set the chroma DB path
current_dir = "/content/rag"
persistent_directory = os.path.join(current_dir, "db", "chroma_db_wp")

# Web page URL
page_url = "https://x.ai/blog/grok-2"

# Write the query
query = 'What is Grok 2?'

answer = rag_chain.invoke({'page_url':page_url, 'query':query, 'db_path':persistent_directory})

Loading the existing vector store

Relevant chunks are retrieved...

Context is built from relevant chunks

print(f"Query:{query}\n")
print(f"Generated answer:{answer}")

Query:What is Grok 2?

Generated answer:Grok-2 is a frontier language model that features state-of-the-art reasoning capabilities, representing a significant advancement over its predecessor, Grok-1.5. It is designed for chat, coding, and reasoning tasks, and it is currently available in beta on the ùïè platform. Alongside Grok-2, a smaller version named Grok-2 mini is also released. Grok-2 has been tested and shown to outperform other models like Claude 3.5 Sonnet and GPT-4-Turbo in various performance metrics, including graduate-level science knowledge, general knowledge, and math problems, as well as excelling in vision-based tasks.