Langchain – Tool that helps you build AI apps with ease

Langchain – Tool that helps you build AI apps with ease

a year ago, June 9, 2023
Reading time: 17 mins

This is a transcript of a talk I presented recently at my company to an audience from all the departments in the company. I went over general definitions, Langchain, some modules in langchain and two applications I had built recently, one being a Question Answer application and a data interpreter.

Definitions

Let’s get started with some definitions just to get done with it. ChatGPT is an OpenAI language model that employs deep learning methods to produce human-like responses in conversations. Most, if not all of us, are familiar with ChatGPT. LLM, on the other hand, stands for Large Language Model, which encompasses all those models extensively trained on textual data, contributing to contextually appropriate and human-like responses from these AI systems. Now, let’s move on to the main topic of prompt engineering.

Prompt Engineering

Prompt engineering has emerged as a crucial skill, with some considering it a game-changing skill you need to learn in 2023. As engineers, prompt engineering may sound like an over-engineered field at first. However, it is an interesting area to explore. So, what makes it complicated, and why is there an engineering aspect involved?

Prompt Template

Essentially, prompts are the questions we pose to AI models, expecting answers in return. Prompt templates play a significant role in prompt engineering. They provide structured instructions to the model. These templates include several sections. Firstly, instructions, where most of the time we will be assuming the LLM to be an expert in a particular field, and guiding them on formulating answers. Additionally, there is a section for context and examples, where we present the model with sample questions and their corresponding answers. This helps the model generate responses in a way that aligns with the user’s expectations. Finally, we have the input, which is a crucial part of the prompt template. This is where the user’s actual question or query is placed, usually by interpolating the input variable into the template. Many other tools will be discussed later that rely on prompt templates like this.

🦜🔗 Langchain

Now, let’s move on to the main topic of discussion today, which is LangChain. LangChain is a Python library that simplifies the prompt engineering process. As shown on the slides, it provides various tools to create customized prompt templates and other aids to enhance prompt engineering. In today’s discussion, I will highlight a few of these tools and provide some code examples. Please note that there are additional tools offered by LangChain that we won’t cover today.

  • Document Loaders
  • Text Splitters
  • Vectorstores
  • ————————
  • Prompts
  • Chains
  • Agents

The first three tools listed focus on preparing the data for constructing the question prompts. These tools ensure that the input for the prompt is formatted correctly and aligned with the desired outcome. For example, you might need to gather data from your own database or text files stored in your file system. In such cases, you would require loaders to fetch the data, text splitters to split the text into manageable chunks, and vector stores, which we’ll discuss shortly. On the other hand, the remaining tools listed below those three are modules that leverage the power of LLM in the background. We’ll explore these modules later in today’s session.

Loaders

Let’s start with the document loader, which, as the name suggests, loads documents in various formats such as plain text files, CSV, JSON, or any other unstructured text format. It collects the relevant text from these files and stores it in memory, typically referred to as documents. Text splitters come into play by dividing these documents into smaller chunks. This step is necessary for several purposes, such as searching for specific contexts within a document, enabling efficient retrieval of relevant information. Additionally, text splitting is beneficial for tasks like clustering, which we won’t delve into today.

Embedding

Finally, we’ll look at the concept of word embedding, which I inadvertently missed mentioning on the slides. Essentially, embedding involves representing words, tokens, sentences, or paragraphs by assigning feature values to different dimensions. Although the numbers themselves may not hold specific meanings to us, embedding models establish connections, proximity, and categorizations among words in multidimensional spaces. To illustrate, we could have dimensions like animal type (e.g., dog), animal class, color, etc.

Embedding accomplishes this task but in a vast number of dimensions, such as 4,000. It becomes challenging to visualize this in a graph. However, the embedding model itself can embed a given query, compare it with other embeddings in the database, and identify similarities. This enables searching for phrases in a book, for instance, and finding the most relevant page or document based on context and relevance.

The key difference from other lexical searches, like word-based searches, is that we go beyond simply finding occurrences of words in a sentence. Semantic search plays a more significant role here. To facilitate this, we’ll be using something called Pine Chain Database, which is a software service that acts as a storage for our embeddings. These embeddings are essentially the numerical values assigned to each chunk we create.

Prompts

Moving on, we have modules that utilize these chunks, vectors, and other elements to generate appropriate prompts. Depending on the application, this could be building a conversational AI chatbot, a question-answering AI, or even a classification model for data categorization. Prompts serve as the foundation in LangChain. Each prompt is defined as a prompt template, which includes input variables, as shown in this example. By using curly braces and a designated input variable, you can reuse the template multiple times. This is a common practice across various applications.

Chains

Now, let’s discuss the next module, which is called chains. Chains represent a sequence of executable steps that perform multiple actions. In a basic example, we have a formatting stage that prepares the prompt and then an LLM stage where the question is asked using the LLM API. The response is then formatted in a specific way. These sequences of steps are referred to as LLM chains. There are numerous examples of chains, which will be covered after this demo. For instance, one of the applications being worked on is an application that has knowledge about the database, specifically data engineering guidelines. Relevant repositories have been scraped and the mentioned tasks like embedding have been performed. Please proceed with the custom demo.

Langchain QA Chain

Fine-tuning a model vs. Vector Search

To work with your own data, there are two approaches. One way is to start with a minimal model provided by OpenAI or any model trained on generic text, which can understand English language and perform communication. Then, you can train the model further using your own data. This process requires a significant amount of data, typically thousands or tens of thousands of chunks, each consisting of around 8,000 words. However, training your own model is time-consuming and resource-intensive.

The alternative approach is to use a basic text model like text-davinci-003, which we are currently using, and have a separate vector database for your data. In this case, you create chunks of your data and store them as vectors in the database. Initially, you perform similarity searches on this database. If we have enough time, later on, we can explore data exploration and related topics further.

When querying the vector database, you retrieve a chunk of text from your own data, not a pre-prepared answer. You then provide this chunk of text to the LLM model using a template with context and a question prompt. The LLM model reads and understands the text before generating an answer. We ensure it doesn’t hallucinate or produce inaccurate information by explicitly instructing the LLM not to make up answers or go beyond the given context.

While the LLM model can sometimes generate relevant answers, it can also generate non-relevant or factually incorrect content. AI-generated content often lacks proper context and may be based on sentences it has encountered before. To prevent hallucination, we explicitly instruct the LLM not to answer questions it doesn’t know the answer to.

Chains

Different chains, such as LLM chains, sequential chains, and routing chains, can be utilized in the LangChain framework. LLM chains are explained earlier and involve a sequence of steps performed by the LLM model. Sequential chains allow multiple chains to run consecutively, where the output of one chain becomes the input of the next. Routing chains involve conditional execution based on specific conditions.

Agents

Agents, as the name suggests, are pieces of code that orchestrate tasks to answer complex questions. They are not the models themselves but facilitate the interaction between different tasks. Agents iterate until they find the final answer, performing various tasks and utilizing different tools within their toolbox. They automate the execution of multiple tasks within a single command.

Langchain Agent Thought process
Langchain Agent Thought process

For example, if you ask an agent the question, “When was the founder of Craigslist born?” the agent would identify the intermediate questions required to answer the main question. It might ask, “Who was the founder of Craigslist?” and then ask the LLM model for the answer. It would then ask, “When was he born?” and obtain the answer from the LLM model. By going through these intermediate steps, the agent arrives at the final answer.

Langchain agent thought process: Weather tool
Langchain agent thought process: Weather tool

Another example involves querying the weather in the capital of Germany. The agent would first determine the capital of Germany, which is Berlin, and then proceed to check the weather in Berlin. Since the LLM model doesn’t have direct knowledge of the weather in Berlin, the agent would use a tool, such as a Weather API, to retrieve the desired information. The agent utilizes the tool by making a request with the input gathered from previous questions and returns the answer to the user.

These are some of the concepts and functionalities within the Langhain framework.

Pandas Dataframe Agent

It’s overtime, but here is an agent. It’s called Pandas Data Frame Agent, and it’s even included in the library. It uses a Python library called pandas to work with data frames. And what it does is, of course, it takes in natural language questions and answers them in natural language or in charts and stuff like that.

So, I created this dataset using ChatGPT, some pet food products with prices, and all this data.

[
  {
    "Name": "Meow Delight",
    "Pet Type": "Cat",
    "Nutritional Facts": "Protein - 30%, Fat - 12%, Fiber - 4%",
    "Expiry Date": "November 15, 2023",
    "Price": "£9.99",
    "Quantity in Stock": "100 units",
    "Description": "Meow Delight is a premium cat food formulated to provide complete and balanced nutrition ...."
  },
  {
    "Name": "Woofy's Choice",
    "Pet Type": "Dog",
    "Nutritional Facts": "Protein - 24%, Fat - 18%, Fiber - 3%",
    "Expiry Date": "December 10, 2023",
    "Price": "£12.49",
    "Quantity in Stock": "75 units",
    "Description": "Woofy's Choice is a high-quality dog food crafted with real beef, vegetables,..."
  }
]

And I loaded that and then simply used this agent and started asking questions like, “How many rows are there?” It’s quite simple. And then, “What are the three most expensive products?” For example, it knows what queries to run, and this is all inside the agent’s brain.

> Entering new AgentExecutor chain...
Thought: I need to sort the dataframe by price in descending order.
Action: python_repl_ast
Action Input: df.sort_values(by='Price', ascending=False)
Observation:                Name Pet Type                       Nutritional Facts  \
8      Puppy Palate      Dog  Protein - 26%, Fat - 14%, Fiber - 3.5%   
1    Woofy's Choice      Dog    Protein - 24%, Fat - 18%, Fiber - 3%   
7  Purrfectly Plump      Cat    Protein - 32%, Fat - 16%, Fiber - 6%   
3       Barky Bites      Dog    Protein - 22%, Fat - 16%, Fiber - 4%   
0      Meow Delight      Cat    Protein - 30%, Fat - 12%, Fiber - 4%   
2   Purrrfect Blend      Cat    Protein - 28%, Fat - 14%, Fiber - 5%   
9   Whisker Lickers      Cat  Protein - 25%, Fat - 10%, Fiber - 3.5%   
6       Hoppity Hay   Rabbit    Protein - 14%, Fat - 2%, Fiber - 28%   
4    Feathery Feast     Bird     Protein - 18%, Fat - 6%, Fiber - 8%   
5      Fishy Frenzy     Fish     Protein - 40%, Fat - 8%, Fiber - 2%   

  Expiry Date  Price Quantity in Stock  \
8  2023-11-28  14.99         100 units   
1  2023-12-10  12.49          75 units   
7  2023-10-31  11.99          60 units   
3  2023-11-25  10.99          90 units   
0  2023-11-15   9.99         100 units   
2  2023-10-30   8.99         120 units   
9  2023-12-12   7.99         110 units   
6  2023-12-15   7.49          80 units   
4  2023-12-05   6.99         150 units   
5  2023-11-20   4.99         200 units   

Thought: I now know the three most expensive products.
Final Answer: The three most expensive products are Puppy Palate, Woofy's Choice, and Purrfectly Plump.

> Finished chain.

Actually, this is a console. Nobody’s going to see this. This is what’s going on inside the agent’s head. For example, it tries to see the table filter, whatever and then prepares an answer for you. And it says, “The three most expensive products are…” and it lists them.

There was one issue with this that we actually talked about, and because of that, we suddenly thought, “How about we just tell the LLM to do it in a different way?” because we knew what the problem was. And then we just asked it to do the queries in a different way, and it actually worked.

--- a/prompt.py
+++ b/prompt.py
You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
+ Try to form single command if the question is in multiple parts.
+ If you are asked to get three results from sorted dataframes, you need to create a chained 
+ command that does sorting and head() in a single python command.
You should use the tools below to answer the question posed of you:

And now, this is the working version. Before this, the non-working version was full of all the answers that were not correct. “What are the three products that are going to expire soonest, and what quantity is left in stock? When are they expiring? Plot it on an appropriate chart.” Probably didn’t answer this. Actually, it did. So that question was answered everything perfectly. Every number is correct, and it responds in human language.

'The three products that are going to expire soonest are Purrrfect Blend, Purrfectly Plump, and Meow Delight. They are expiring on October 30, October 31, and November 15, respectively. The quantity in stock for each product is 120 units, 60 units, and 100 units, respectively. The data can be plotted on a bar chart.'
Langchain – Tool that helps you build AI apps with ease

It lists the three products that will expire soonest: Purrfect Blend, Purrfectly Plump, and Meow Delight. And it also mentions the dates of their expiration and the quantities left in stock. And it even showed it in some charts, though it might be a bit strange. But it’s okay. So, quantities and then some problems there.

But if you ask simpler questions like “Draw a pie chart,” for example, it does it very accurately. Like here… I asked for grouping the products by categories and then draw a pie chart, and it grouped all the products for Cat, Dog, and Bird and created the pie chart.

Yeah, and some actions take a lot of tries.

"Draw a bar graph for Cat products ordered by protein content, labeled by the product name"

But yes, this is a very complex task. It has a field with a string of nutritional facts, and I asked something about that like “order by protein content.”

> Entering new AgentExecutor chain...
Thought: I need to get the data for the cat products and their protein content
Action: python_repl_ast
Action Input: df[df['Pet Type'] == 'Cat'][['Name', 'Nutritional Facts']]
Observation:                Name                       Nutritional Facts
0      Meow Delight    Protein - 30%, Fat - 12%, Fiber - 4%
2   Purrrfect Blend    Protein - 28%, Fat - 14%, Fiber - 5%
7  Purrfectly Plump    Protein - 32%, Fat - 16%, Fiber - 6%
9   Whisker Lickers  Protein - 25%, Fat - 10%, Fiber - 3.5%


Thought: I need to extract the protein content from the Nutritional Facts column
Action: python_repl_ast
Action Input: df[df['Pet Type'] == 'Cat']['Nutritional Facts'].str.extract('Protein - (\d+\.?\d*)')
Observation:     0
0  30
2  28
7  32
9  25


Thought: I need to create a bar graph with the extracted data
Action: python_repl_ast
Action Input: df[df['Pet Type'] == 'Cat'][['Name', 'Nutritional Facts']].str.extract('Protein - (\d+\.?\d*)').plot.bar(x='Name', y='Nutritional Facts')
Observation: AttributeError("'DataFrame' object has no attribute 'str'")


Thought: I need to convert the extracted data to a dataframe
Action: python_repl_ast
Action Input: df_cat = pd.DataFrame(df[df['Pet Type'] == 'Cat']['Nutritional Facts'].str.extract('Protein - (\d+\.?\d*)'), columns=['Protein'])
Observation: NameError("name 'pd' is not defined")


Thought: I need to import pandas
Action: python_repl_ast
Action Input: import pandas as pd
Observation: 


Thought: I need to convert the extracted data to a dataframe
Action: python_repl_ast
Action Input: df_cat = pd.DataFrame(df[df['Pet Type'] == 'Cat']['Nutritional Facts'].str.extract('Protein - (\d+\.?\d*)'), columns=['Protein'])
Observation: 


Thought: I need to merge the extracted data with the original dataframe
Action: python_repl_ast
Action Input: df_cat = df_cat.merge(df[df['Pet Type'] == 'Cat'][['Name']], left_index=True, right_index=True)
Observation: 

Thought: I now have the data I need to create the bar graph
Action: python_repl_ast
Action Input: df_cat.plot.bar(x='Name', y='Protein')
Observation: Axes(0.125,0.11;0.775x0.77)

Thought: I now know the final answer
Final Answer: Use the python_repl_ast tool to execute the command `df_cat.plot.bar(x='Name', y='Protein')` to draw a bar graph for Cat products sorted by their protein content, labeled by the product name.

> Finished chain.

Thats enough AI Stuffs….

Previous
[Livestream] Building an AI-Powered App for Microsoft Teams
Next
Safe and Unsafe pins in ESP32 and ESP32-S3
© 2024 Anil Maharjan