Retrieval-Augmented Generation With Claude Sonnet 3.5 and Pgvector
The AI chatbot race, initiated by OpenAI with the release of ChatGPT, is now seeing new competitors like Gemini, Cohere, and many others. OpenAI’s latest release of GPT-4o amazed users with its advanced visual processing, auditory recognition, and conversational abilities. While Sam Altman’s company, OpenAI, appears to lead the AI race, Anthropic provided tougher competition with its models. Anthropic, a prominent AI research company dedicated to developing safe and ethical AI systems, is making waves in the field with its Claude family of large language models (LLMs).
The new Claude launch is a notable upgrade from its predecessor. Anthropic claims it can surpass OpenAI's GPT-4o model on key benchmarks like GPQA (Graduate-Level Google-Proof Q&A), multilingual math (MGSM), and more.
In this article, we’ll discuss Claude’s new top-tier model, its strengths and usefulness, and compare it to other models in the Claude family. We will also use Sonnet 3.5 and pgvector to build a retrieval-augmented generation (RAG) application.
What Is RAG?
RAG (retrieval-augmented generation) is a natural language processing (NLP) technique that combines generative large language models (LLMs) with traditional information retrieval systems (databases). RAG systems can process and consolidate knowledge to create context-aware answers, explanations, and instructions in human-like language.
Everything About Claude Sonnet 3.5
Claude Sonnet 3.5 is a new model from Anthropic, a U.S.-based company. On various evaluations, it outperforms competitor models and Claude 3 Opus, matching the speed and cost of Claude 3 Sonnet.
This new version is available for free on Claude.ai and the Claude iOS app, with Claude Pro and Team plan subscribers receiving higher rate limits. Additionally, it can be accessed through the Anthropic API, which is the focus of this article, as well as through Amazon Bedrock and Google Cloud’s Vertex AI. Another way to access it is by using its web application. If you want to build RAG applications using Amazon Bedrock, here's a beginner-friendly guide.)
Metrics | Comments |
Speed | Claude 3.5 Sonnet is twice as fast as its predecessor, Claude 3 Opus. Figure 1 shows its elevated speed. |
Cost | The model is priced at 5x less than Opus, at $75 per million tokens. |
Performance | Claude 3.5 Sonnet excels in performance across multiple evaluations (for more, see Figure 2), including graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval). It surpasses competitors like OpenAI's GPT-4o and Google's Gemini 1.5 Pro. |
User Reviews | Many developers reported 3-4x productivity increases with Claude 3.5 Sonnet for coding tasks. Users praised its natural, pleasant interaction style, finding it more intuitive and less frustrating than other AI assistants. Early adopters described it as "absolutely phenomenal" and superior to GPT-4 for automation and troubleshooting. |
Claude Sonnet 3.5 vs. Anthropic family of AI models
The Anthropic family contains multiple models, including Claude Sonnet 3.5, and each model provides different types of performance and costs. Below is a side-by-side comparison of the Sonnet model and other members of the Anthropic family.
If you’re unfamiliar with any of the terms in the table below, the following provides you with descriptions of each of the features:
- Input context window: the number of tokens the input context window supports
- Maximum output tokens: the number of tokens the model can generate in a single request
- Input token pricing: the cost of input data provided to the model
- Output token pricing: the cost of output tokens generated by the model
- MMLU: evaluates LLM knowledge acquisition in zero-shot and few-shot settings
- MMMU: a wide-ranging multi-discipline and multimodal benchmark
Features | Claude Sonnet 3.5 | Sonnet | Opus | Haiku |
Input Context Window | 200K | 200K | 200K | 200K |
Maximum Output Tokens | 4096 | 4096 | 4096 | 4096 |
Input Token Pricing | $3 per million tokens | $3 per million tokens | $15 per million tokens | $0.25 per million tokens |
Output Token Pricing | $15 per million tokens | $15 per million tokens | $75 per million tokens | $1.25 per million tokens |
MMLU Benchmark | 90.4 | 81.5 | 88.2 | 76.7 |
MMMU Benchmark | 68.3 | 53.1 | 59.4 | 50.2 |
RAG Implementation With Sonnet 3.5 and Pgvector
Now that you know more about Claude Sonnet 3.5, let’s use it along with pgvector to implement a retrieval-augmented generation engine.
Before we do that, however, we’ll look at the architecture and necessary concepts, starting with the schematic diagram.
The diagram above represents a RAG pipeline with the following steps:
- Documents: the process begins with collecting documents that need to be indexed.
- Data indexing: these documents are indexed and stored in a vector database, in this case PostgreSQL with pgvector.
- Query: a user query is input into the system, which retrieves relevant information from the vector database.
- Vector database: this database stores the documents' embeddings using pgvector. When a query is made, it retrieves the top results based on relevance.
- LLM (Claude Sonnet 3.5): the retrieved results and the original query are fed into a language model for further processing and understanding.
- Results: the final output is generated, giving the user a response incorporating the retrieved data and the language model's processing.
This pipeline combines document retrieval with language understanding to generate relevant responses.
Setup and Imports
To begin, we will first install and import the necessary libraries for RAG in this section.
%pip install psycopg2 pgvector anthropic sentence-transformers opendatasets pandas
We then need to import them into our environment.
import anthropic
import psycopg2
import itertools
import opendatasets as od
from PIL import Image
import os
import random
import shutil
import base64
import httpx
import matplotlib.pyplot as plt
To set it up, each API call needs a valid API key. The SDKs can retrieve this key from the ANTHROPIC_API_KEY
environment variable, or you can provide it when initializing the Anthropic client. On Windows, we can simply add an environment variable like this:
setx ANTHROPIC_API_KEY "your-api-key-here"
After this, we can set our client.
client = anthropic.Anthropic()
Let's ensure it’s working by checking the client's response. The code below uses an API client to query Anthropic and prints the response. Here’s the code:
message = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1000,
temperature=0,
system="Answer the questions provided",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is Pgvector?"
}
]
}
]
)
print(message.content)
The code initializes a message request to an AI model (Claude-3-5-sonnet)
with specific settings:
- It asks the model to generate a response to the user query, "What is pgvector?".
- The
max_tokens=1000 parameter
specifies the maximum length of the response in tokens (words). temperature=0
ensures the response is deterministic rather than random.- The system parameter guides the model in answering the questions provided.
- It uses a text query format to ask the model, "What is pgvector?". Note that the model supports multimodal inputs, meaning it could also process image queries alongside text.
Next, we will set up the embedding model.
Setting Up the Embedding Model
To convert the text chunks to embeddings, we will use Sentence Transformers (SBERT). SBERT is a Python module for text and image embeddings, ideal for applications like semantic search and similarity scoring. It offers over 5,000 pre-trained models on Hugging Face🤗 and supports easy training or fine-tuning for custom use cases.
Let's load the model paraphrase-MiniLM-L6-v2
and generate embeddings on an example sentence. Here’s the code:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']
# Sentences are encoded by calling model.encode()
embedding = embedding_model.encode(sentence)
Connecting to PostgreSQL Using Timescale Cloud
A key component for a RAG application is a vector database, which is essential for querying indexed documents and retrieving relevant context for searches. In this tutorial, we'll use PostgreSQL and pgvector as our vector database. The pgvector extension turns PostgreSQL into a fully featured vector database, with support for similarity search, as well as sparse embeddings for keyword searches.
- To get started, sign up, create a new database, and follow the provided instructions. For detailed guidance, refer to the Timescale guide.
- After signing up, connect to the Timescale database using the service URI from the dashboard, which looks like this:
postgres://tsdbadmin:@.tsdb.cloud.timescale.com:/tsdb?sslmode=require
- Create a password in the project settings by clicking on Create credentials.
4. This setup checks the connection validity and runs a basic query to confirm database access, ensuring it’s ready for use.
5. Now you can connect to the database, which can be done as shown below:
CONNECTION = "<Your Connection String>"
conn = psycopg2.connect(CONNECTION)
cursor = conn.cursor()
# use the cursor to interact with your database
cursor.execute("SELECT 'hello world'")
print(cursor.fetchone())
Basic RAG
In this implementation, we'll construct a RAG pipeline using relevant news articles to enhance the accuracy of the Claude model. This example serves as a foundational demonstration for our advanced RAG implementation.
Dataset overview
The CNN/DailyMail Dataset comprises over 300,000 English-language news articles from CNN and the Daily Mail. It supports both extractive and abstractive summarization and was originally designed for machine reading, comprehension, and abstractive question answering.
Data fields
id: SHA1:
Hash of theURL
where the story was retrieved.article:
Body of the news article.Highlights:
Highlights of the article, authored by the article's writer.
For our hybrid search engine, we will use the article field, which contains comprehensive information about the incidents. Given the dataset's size, we will work with a smaller subset of approximately 1,000 articles for this demonstration.
from datasets import load_dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
content = dataset["train"]
content = content.shuffle(seed=42).select(range(0,1000))
sample_dataset = content["highlights"][:10]
Let's generate embeddings for the sample_dataset
. The size of the embedding array will be required later when creating the table.
embeddings = embedding_model.encode(sample_dataset)
Converting numpy.ndarray
to Python list as it is accepted by pgvector.
embeddings = embeddings.tolist()
Table creation and ingesting documents
We'll create a documents table to store our documents and their embeddings. This table will have the following columns:
- Id: Serves as the primary key to identify each row uniquely.
- Contents: Stores the content of the articles as
TEXT
. - Embedding: Stores the embeddings of the articles as
VECTOR
. TheVECTOR
size is set to 384, matching the embedding dimension of theparaphrase-MiniLM-L6-v2
model.
- To enable operations on embeddings, we will install the vector extension in PostgreSQL.
extension = """CREATE EXTENSION IF NOT EXISTS vector"""
cursor.execute(extension)
conn.commit()
2. Now, it’s time to create the table given the details above:
document_table = """CREATE TABLE IF NOT EXISTS documents (
id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
contents TEXT,
embedding VECTOR(384)
)"""
cursor.execute(document_table)
conn.commit()
3. Next, we need to insert the elements in the table. The code below inserts article contents and their corresponding embeddings into the documents table in PostgreSQL. It constructs an SQL INSERT
statement, combining the first five articles with their embeddings from embeddings, and executes the insertion using the database cursor. Finally, it commits the transaction to save the changes.
sql = 'INSERT INTO documents (contents, embedding) VALUES ' + ', '.join(['(%s, %s)' for _ in embeddings])
params = list(itertools.chain(*zip(sample_dataset, embeddings)))
cursor.execute(sql, params)
conn.commit()
4. Let's check the documents inserted into the table in the DB.
cursor.execute("SELECT * From documents")
cursor.fetchone()
Relevant search
Now we need to create a function to retrieve relevant documents based on vector similarity. This function will leverage the index to efficiently search for the K nearest vectors, significantly reducing computation time.
Indexing
Two indexing algorithms, IVFFlat and HNSW are used to ensure fast search performance.
- IVFFlat: This algorithm divides vectors into clusters, creating lists for each centroid. Only a subset of lists (those whose centroids are closest to the search vector) are examined during a search. This reduces the number of distance calculations.
- HNSW (hierarchical navigable small world): This algorithm builds a graph where nodes represent vectors and edges connect nearby vectors. It uses hierarchical layers to navigate the graph, efficiently finding the nearest neighbors. HNSW is known for its high recall and speed, making it suitable for large-scale searches.
We will create an index on the embedding column using IVFFlat, using the concepts explained to optimize search performance. Timescale has introduced another indexing technique via the PostgreSQL open-source extension pgvectorscale (GitHub stars welcome!), readily available on Timescale Cloud. This new indexing technique, StreamingDiskANN
, along with Statistical Binary Quantization—also a Timescale innovation—unlocks high-performance AI use cases, making PostgreSQL as fast as Pinecone.
ivfflat = """CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)"""
cursor.execute(ivfflat)
conn.commit()
The code below retrieves the contents of the two documents closest to the given query embeddings from the documents table in PostgreSQL. Here's the code for it:
def relevant_search(conn, query):
query_embeddings = embedding_model.encode(query).tolist()
with conn.cursor() as cur:
cur.execute('SELECT contents FROM documents ORDER BY embedding <=> %s::vector LIMIT 2', (query_embeddings))
return cur.fetchall()
query = ["News related to people who died due to Carbon Monoxide"]
relevant_search(conn, query)
Here’s the breakdown of the code:
with conn.cursor() as cur
: Opens a cursor to interact with the PostgreSQL database.cur.execute(...)
: Executes an SQL query.SELECT contents FROM documents ORDER BY embedding <=> %s::vector LIMIT 2
: Select the contents column from the documents table, ordering the results by the distance between the embedding column and the provided query embeddings. The <=> operator is used for vector distance comparison.(query_embeddings)
: Supplies the query embeddings as a parameter to the SQL query.return cur.fetchall()
: Fetches and returns the query results (the contents of the two closest documents).
Combining LLM reasoning with semantic search
This final step of the RAG application integrates language model generation with semantic search. Once the relevant documents are retrieved, they are passed alongside the query to the Claude Sonnet 3.5 model for response generation.
def rag_function(conn, client, model_name, query):
# Step 1: Retrieve relevant documents
relevant_docs = relevant_search(conn, query)
relevant_text = " ".join([doc[0] for doc in relevant_docs]) # Combine the contents of the retrieved documents
# Step 2: Use the retrieved documents to augment the query for the Claude model
full_query = (f"Context: The following are relevant news articles related to the query.\n"
f"{relevant_text}\n\n"
f"Based on the above context, please answer the following question:\n"
f"Question: {query[0]}")
# Step 3: Query the Claude model with the augmented query
message = client.messages.create(
model=model_name,
max_tokens=1000,
temperature=0,
system="Given a query and context, provide the accurate information. Don't hallucinate, If the context does not provide relevant information. ",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": full_query
}
]
}
]
)
return message.content
# Example usage:
query = ["News related to people who died due to Carbon Monoxide"]
response = rag_function(conn, client, "claude-3-5-sonnet-20240620", query)
print(response)
Alright! We have completed a basic RAG example using Claude Sonnet 3.5.
Conclusion
This article covered everything you need to know about Claude Sonnet 3.5. We discussed its superior intelligence, multimodality, and reduced cost compared to other family members. These superpowers were later used to implement RAG using pgvector on Timescale Cloud.
We started with a basic RAG example to retrieve relevant news content. Stay tuned for the advanced version involving an AI image gallery! You, too, can start using Timescale to build your AI application with Claude and pgvector. Timescale Cloud has a complete open-source stack for your AI applications, with pgvector, pgai, and pgvectorscale. If you’re implementing RAG, pgai makes it easier to build search and RAG applications by bringing more AI workflows into PostgreSQL.
To make your AI application more scalable and performant, try pgvectorscale. This extension adds a third approximate nearest-neighbor (ANN) search algorithm to pgvector (StreamingDiskANN) and utilizes a streaming model that allows the index to continuously retrieve the “next closest” item for a given query, revving your application’s performance.
Both pgai and pgvectorscale are open source under the PostgreSQL license. To install them, check out the pgai and pgvectorscale GitHub repos (stars are always welcome!). To get started more quickly, sign up for Timescale Cloud and create a free cloud PostgreSQL database for your RAG application.