Introduction to KDB.AI (5) - Sample code (LangChain and RAG)

Introduction

This is Jung from the Lakehouse Department of the GLB Division.

In this article, I would like to introduce the sample code practice of KDB.AI that I introduced in the previous article.

KDB.AI is a Vector database powered by kdb+, the world's fastest time series database and analysis engine, and can be connected via Endpoint and API Key.

The Early Access Program was recently launched and you can try it out with a free account.

Instructions on how to create an account and Endpoints & API Keys can be found in the previous article. techblog.ap-com.co.jp

Please create an account, Endpoints, and API Keys to practice the sample code.

This time, we will introduce LangChain and RAG among the samples.

The order of the articles is the introduction of LangChain and RAG, how to create an OpenAI API Key and Hugging Face API Token, and the flow of sample code practice.

Introduction
Contents
Introduction of sample code (LangChain and RAG)
Advance preparation matters and preparation method
- How to create an OpenAI API Key
- How to create a Hugging Face API Token
Sample code practice
Summary

Introduction of sample code (LangChain and RAG)

LangChain and RAG is a practice sample of similarity search using LangChain's text splitter and KDB.AI, and a practice sample of RAG that can answer user questions using external knowledge.

Load the TXT file used for LLM and use LangChain's text splitter to fit the loaded document into 500-sentence chunks. After making a list of each 500 sentence chunks and embedding them, we perform a distance-based similarity search using KDB.AI's similarity_search. Then, after creating a RAG using load_qa_chain and RetrievalQA, we ask the same questions as similarity_search and compare each result.

In the practical training, you will go through the following process.

➀ Load the TXT document using TextLoader and use LangChain's text splitter to fit the loaded document into 500-sentence chunks.
➁ Create a list of the 500 sentences created in ➀ and embedding it with an OpenAI model. ③ Create a table in KDB.AI and save the Embedding. ④ Search for the content you are interested in using KDB.AI's similarity_search.
➝ The most similar clusters of sentences will appear.
⑤ Create an LLM model using RAG's OpenAI and HuggingFaceHub. ⑥ Insert the group of chapters with the highest degree of similarity created in ④ into the LLM model created in ⑤ along with the questions in ④. ➝ Produce results based on external knowledge from OpenAI and HuggingFaceHub LLM models.
⑦Create a QA bot using RetrievalQA using GPT-3.5.
Enter KDB.AI's Vector Store in the method called as_retriever, which converts Vector Store into a search machine. ⑧ Input the question from ④ into the QA bot created in ⑦. ➝ Results based on KDB.AI's Vector Store and external knowledge.

Advance preparation matters and preparation method

For this exercise, you will need an OpenAI API Key and Hugging Face API Token.
Please create an OpenAI API Key and Hugging Face API Token in order to practice the sample code. We will introduce how to create it below.

How to create an OpenAI API Key

Create an account

➀ Create an account at the site below. platform.openai.com

Press the Sign up button on the homepage. (If you already have an account, please log in and check "Create API Key".)

➁ Enter the email address and password you want to use.

③ Verify your email. (Open the email with the subject "OpenAI - Verify your email" and press the Verify email address button.)

④ Return to the OpenAI homepage, log in again, and enter your personal information.

⑤ Once your phone number has been verified, your account will be created.

Creating an API key

➀ Press your profile and then View API keys.

➁ Create an API key. Press the Create new secret key button.

Create a name for your API key and press the Create new secret key button.

An API key will be created. (You will not be able to view it again, so please save it separately.)

③ You can manage your API keys on the API Keys page. You can rename or delete existing keys and create new keys.

How to create a Hugging Face API Token

Create an account

➀ Create an account at the site below. huggingface.co

Press the Sign up button on the homepage.
(If you already have an account, please log in and check "Create API Key".)

➁ Enter the email address and password you want to use.

③ Create your own profile.

④ Verify your email. Open the email with the subject line "[Hugging Face] Click this link to confirm your email address" and press the link to create an account.

Creating an API key

➀ After logging into Hugging Face, press the menu button.

➁ Press Settings.

③ Press Access Tokens and then New token.

④ Create a token.
Create a Token name and select write for Role.

If you press Generate a token, a token will be created.

Sample code practice

The training was conducted using KDB.AI. kdb.ai

Let's try out the sample code on Databricks Workspace.

0. Databricks Cluster

Cluster uses 14.0 ML (includes Apache Spark 3.5.0, Scala 2.12). The above Cluster is Python3.10.12 version.

1. Installing Dependencies

Install some libraries.

%pip install pandas pyarrow openai pypdf tiktoken kdbai-client git+https://github.com/KxSystems/langchain.git@KDB.AI#subdirectory=libs/langchain -q

Import the libraries you want to use.

# vector DB
from getpass import getpass
import os
import pandas as pd
import kdbai_client as kdbai

# langchain packages
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import KDBAI
from langchain import HuggingFaceHub
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

2. Setting API keys

Set the OpenAI API Key and Hugging Face API Token created above.

# OPENAI_API & HUGGINGFACEHUB KEYの接続
os.environ['OPENAI_API_KEY'] = getpass('OpenAI API Key: ')
os.environ['HUGGINGFACEHUB_API_TOKEN'] = getpass('Hugging Face API Token: ')

Click the red box and enter your OpenAI API Key.
Once the OpenAI API Key is working properly, you can also create a Hugging Face API Token input window as shown below.
Click on the Hugging Face API Token input window and enter your Hugging Face API Token.
OpenAI API Key and Hugging Face API Token have been set.

3. Preparation of data

This time, we will use data from national policy messages sent by the US President to the US Congress. Load a TXT document.

Load a TXT document using TextLoader.
Documentation is to be requested from the LLM.
Use LangChain's text splitter "RucursiveCharacterTextSplitter" to fit the read document into chunks of 500 sentences.
List each group of 500 sentences.
"Enter the TXT path in "My path where you uploaded the TXT".

# Load TXT documents using TextLoader
# Request documentation from LLM
loader = TextLoader("My path where I uploaded the TXT")
doc = loader.load()

# Use langchain's text splitter "RucursiveCharacterTextSplitter"
# Fit the entered documents into 500 sentence blocks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 0
)

# Fit loaded documents into chunks of 500 sentences
chunks = text_splitter.split_documents(doc) 
# Make a list of 500 sentences
texts = [p.page_content for p in chunks]
# Check the data
texts[0]

Make embeddings.

Embedding represents text-like data as a single coordinate, and these values can be used for similar text searches, clustering, recommendation services, etc.
Create an embedding using OpenAIEmbeddings.
Select the "text-embedding-ada-002" model, which is recommended by OpenAI and has the characteristics of being easy and inexpensive.

embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

Please check the link below for detailed explanation about Embedding and "text-embedding-ada-002". platform.openai.com

4. Save embedding to KDB.AI

Connect to KDB.AI Session using Endpoints and API Keys.

How to create Endpoints and API Keys can be found in the previous blog.

techblog.ap-com.co.jp

"Enter your KDB.AI ENDPOINT in "My KDB.AI ENDPOINT".
"Enter your KDB.AI API KEY in "My KDB.AI API KEY".

import kdbai_client as kdbai

KDBAI_ENDPOINT = "My KDB.AI ENDPOINT"
KDBAI_API_KEY = "My KDB.AI API KEY"
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

Define the schema of the KDB.AI table that stores the embedding.

Select the index and metric to use for search.
In this sample, we will use FLAT as index.
FLAT is a method that calculates the distance between all Vectors, so it may be less efficient, but it will find accurate values.
The similarity metric uses L2 of Euclidean distance.
Euclidean distance L2 calculates the distance of coordinates in high-dimensional space.

Please check the following page for details on the index handled by KDB.AI.

code.kx.com

Please check the following page for details on the metrics handled by KDB.AI.

code.kx.com

schema_rag = {
    "columns": [
        {"name": "id", "pytype": "str"},
        {"name": "text", "pytype": "bytes"},
        {
            "name": "embeddings",
            "pytype": "float32",
            "vectorIndex": {"dims": 1536, "metric": "L2", "type": "flat"},
        },
    ]
}

Create a table.

Create the rag_langchain table with the session.create_table() function.

table = session.create_table('rag_langchain', schema_rag)

Enter Embedding into the table using KDBAI.from_texts.

vecdb_kdbai = KDBAI.from_texts(session, 'rag_langchain', texts=texts, embedding=embeddings)

table.query()

Embedding is now saved in KDB.AI, so you can perform semantic search.

5.Similarity search

Let's search for the points of interest in KDB.AI's Vector Store.

The search uses Euclideandistance to calculate similarity.
The most similar clusters of sentences will appear.

query = "what are the nations strengths?"

query_sim = vecdb_kdbai.similarity_search(query) 
print(query_sim)

6. Retrieval-augmented Generation (RAG)

RAG is a technology that provides the latest information based on external knowledge when searching. In this sample, we will use load_qa_chain and RetrievalQA in RAG.

We asked the QA chatbot on the LangChain official website about the characteristics of the two methods.

python.langchain.com

load_qa_chain: load_qa_chain is a feature in the LangChain library that loads a preconfigured question answer chain using an LLM like OpenAI, which handles the prompts, LLM, etc. settings required for the QA chain.
RetrievalQA: A class in a rung chain that embodies a search-augmented QA architecture, combining a searcher and a reader LLM to answer questions.

➀ load_qa_chain with OpenAI and HuggingFaceHub Create an LLM model using OpenAI and HuggingFaceHub.

llm_openai = OpenAI(model="text-davinci-003", max_tokens=512)
llm_flan = HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":0.5, "max_length":512})

Create a chain by applying load_qa_chain to the two models.

chain_openAI = load_qa_chain(llm_openai, chain_type="stuff")
chain_HuggingFaceHub = load_qa_chain(llm_flan, chain_type="stuff")

query = "what are the nations strengths?"

#Find the most similar chapters
query_sim = vecdb_kdbai.similarity_search(query) 

#Chain by model
print("OpenAI Response: ")
print(chain_openAI.run(input_documents=query_sim, question=query),'\n')
print("HuggingFaceHub Response: ")
print(chain_HuggingFaceHub.run(input_documents=query_sim, question=query))

➁ RetrievalQA using GPT-3.5 Create a QA bot using RetrievalQA.

The query similarly uses "what are the nations strengths?".
Enter KDB.AI's Vector Store in the method called as_retriever, which converts Vector Store into a search machine.

K = 10

qabot = RetrievalQA.from_chain_type(chain_type='stuff',
                                    llm=ChatOpenAI(model='gpt-3.5-turbo-16k', temperature=0.0),
                                    retriever=vecdb_kdbai.as_retriever(search_kwargs=dict(k=K)),
                                    return_source_documents=True)
print(query)
print("-----")
print(qabot(dict(query=query))["result"])

The US President received an answer to the power of the nation based on the contents of the national policy message sent by the US Congress and external knowledge obtained through RetrievalQA.

7. Delete table

Delete the table.

table.drop()

Summary

In this post, I introduced the sample code of KDB.AI, LangChain and RAG, and the Databricks sample code exercise done on Databricks.

Through the practical training, I learned the following:

How to use LangChain's text splitter "RucursiveCharacterTextSplitter" to fit the read document into a chunk of text
How to create an LLM model for OpenAI and HuggingFaceHub through load_qa_chain
How to create a QA bot using RetrievalQA

In the next post, I would like to test other KDB.AI sample codes on Databricks and introduce them as in this post.

Thank you for reading until the end. Thank you for your continued support!

We provide a wide range of support, from the introduction of data analysis platforms using Databricks to support for in-house production. If you are interested, please feel free to contact us.

www.ap-com.co.jp

We are also looking for people to work with us! We look forward to hearing from anyone interested in APC.