- Published on
A brief analysis on RAG with Pinecone Serverless and Unstructured.io
- Authors
- Name
- Athos Georgiou
Welcome to the latest installment of the series on building an AI chat assistant from scratch. This time around, however I'd like to change the format a bit. Instead of guiding you through the process and showing snippets of code, I will instead provide a high-level overview of the process and the tools used and then a brief analysis of the RAG model using Pinecone Serverless and Unstructured.io.
For reference, here are the previous articles in the series:
- Part 1 - Integrating Markdown in Streaming Chat for AI Assistants
- Part 2 - Creating a Customized Input Component in a Streaming AI Chat Assistant Using Material-UI
- Part 3 - Integrating Next-Auth in a Streaming AI Chat Assistant Using Material-UI
- Part 5 - Integrating the OpenAI Assistants API in a Streaming AI Chat Assistant Using Material-UI
- Part 6 - Integrating Vision using the latest OpenAI API
The model is pretty much a prototype, and it may contain bugs or issues, so please let me know if you find any ground-breaking issues. And as always, if you'd prefer to skip the article and get the code yourself, you can find it on GitHub.
Overview
RAG (Retrieval-Augmented Generation) is a Model commonly used in Generative AI to enhance the quality of an AI response, by providing context to an LLM (Large Language Model) such as GPT-4. The context is retrieved from a Vector Database using Semantic Search and supplements the user message to be sent. As a result, The LLM can generate responses that are not only more accurate and relevant, but also mitigate the infamous "hallucination" phenomenon that is so common in LLMs these days.
Although I've been building RAG based open source applications for a while now, I've mostly stuck to basic techniques and tools that provided average results and performance. However, I recently came across two new tools that have completely changed the game for me: Pinecone Serverless and Unstructured.io.
Motivations
Pinecone Serverless is a newly released Vector Database, used for Generative AI based applications and leveraging concepts such as RAG, Semantic search and classification. I've used Pinecone before with decent results, but with the recent release of a serverless option, I was intrigued to try again.
In past efforts, the primary challenge in utilizing a Vector Database wasn't about the database itself, the embeddings, or even the earlier versions of GPT-4 models, which were often reluctant to incorporate provided context. The real difficulty lay in parsing unstructured data effectively and devising a solid chunking strategy. This was especially problematic for complex, unstructured documents containing images and tables, leading to decreased precision during the retrieval phase. While this might have been acceptable for small-scale, applications, or personal projects, it certainly would not suffice for a production-grade application.
This is when I started looking for document parsers and came across Unstructured.io, an application used for extracting and analyzing information from unstructured data, including Tables and Images. I was eager to see how it could be integrated in a RAG based model and what kind of precision and performance it could provide.
These two tools also came with a free tier, or free credits, which was a huge plus for me. So I thought, why not build a new RAG model using these two tools and see how it performs?
The Process
I implemented the project using my existing Next.js-based template named Titanium, which already incorporates several advanced Generative AI features.
The RAG process follows these steps:
- A document uploaded by the user is parsed using
Unstructured.io
. - The parsed chunks of the document are embedded with OpenAI's
ada-003 model
. Pinecone Serverless
indexes the embedded data within a user-specific namespace, including any additional metadata.- When the user sends a message to the AI Assistant, this context augments the message, which is then processed by the
gpt-4-0125-preview
model. - GPT-4 generates a response using the enriched message, ensuring relevance and accuracy.
Analysis
I tested this RAG model using 2 types of documents in PDF format; A 15 page
sample offer letter and a 88 page
Book.
Some of the Parameters that I adjusted and tested included:
Top K
elements to retrieve from PineconeBatch Size
to prevent swamping the Pinecone API with requestsParsing strategy
for Unstructured.io, which included several options such asauto
,fast
, andhi_res
Metrics
- The response timing for each process.
- The precision of the responses in the streaming chat experience.
- The costs incurred for each process.
Parsing
Document parsing performance varied with the document's size and the chosen parsing strategy. The
fast
strategy was the quickest but least accurate, while thehi_res
strategy was the most accurate but slowest. Theauto
strategy offered a balance between speed and accuracy.For a
15-page
text document, thehi_res
strategy took an average of45-50 seconds
to parse, whereas thefast
strategy took about3-4 seconds
. Theauto
strategy consistently took around3 seconds
.Parsing an
88-page
document containing both text and images with thehi_res
strategy took between5.3 to 5.5 minutes
, whereas thefast
andauto
strategies took between25-30 seconds
, indicating a roughly10x
increase in time for the larger document.Adjusting the
combine_under_n_chars
parameter to1500
did not impact parsing times. This is currently not configrable in the app, but I'll be adding it soon.
Embedding
- I only embedded the parsed
JSON based chunks
at the default settings.text/csv
based chunking is also supported, but I have not yet ran tests on that. - I found the
ada-003 model
to be quite fast, completing the embedding of both documents in under2.5 seconds
each, which is great.
Indexing/Upserting
Indexing/Upserting
intoPinecone Serverless
was remarkably fast, with both documents being indexed in under4 seconds
on average.- To avoid swamping the Pinecone API with requests, I used a
batch size
of250
for theUpsertMany
process, which worked well. Anything over250
has a risk of getting refused, if your chunk size is too large.
Retrieval
- Although a smaller Top K value typically resulted in faster retrieval times, there were occasions where a
Top K of 50
would resolve as quickly as2 to 3 seconds
. - More testing is needed to determine the optimal
Top K
value for the RAG model. Additionally, something to note is that server stability and load can also affect the retrieval times. - As of now, retrieval via
metadata
is not supported with the Serverless version. Pinecone, how could you do this to me? I thought we had something special!
Deletion
- Using
DeleteMany
for Pinecone with abatch size of 250
, the process completed effortlessly inunder 2 seconds
for the majority of the tests. - The
DeleteAll
process was also quite fast, taking around1 second
, on average to complete.
Streaming Chat Experience
- The
streaming chat
experience was quite smooth, with the user receiving a response between1 to 4 seconds
on average after sending a message. However, this was heavily dependent on the Top K value, the size of the overall enhanced message, and server load from OpenAI.
Costs
- Parsing:
Unstructured.io
provides a free tier of1000 pages per month
, so I didn't incur any costs for the tests. If anyone has any experience with the paid tier, I'd love to hear about it. - Embedding: The entire costs of running these tests were around
$0.02
, so I was quite pleased with that. - Indexing/Upserting:
Pinecone Serverless
proved true to it's marketed cost reduction over the pod based model. The entire experiment, which consisted of roughly2000 pages
worth of documents, cost me$0.13
from the$100
free credits. It's worth noting, however that this experiment mostly involved mostlywrites
, as I was mostly focused on theindexing
andupserting
aspects. Furtherreads
andstorage units
would naturally increase the costs. - Conversation: The cost of asking a question to the
gpt-4-0125-preview
model and getting a response varied dramatically, with theTop K
value andchunk size
being the primary factor. But, this has always been the case with OpenAI's API, so nothing surprising there. Overall, with the entire experiment cost around$4.50
, but there was some bug fixing and retesting involved, so the actual cost would be lower.
Conclusion
Overall, I've had a blast building this RAG model using Pinecone Serverless
and Unstructured.io
. I'm quite pleased with the performance and offcourse being able to develop this model for free. I'm looking forward to further testing and optimizing the model, and I'm excited to see how these tools will evolve in the future.
unstructured.io
is perhaps geared to becoming major player in the field of unstructured data parsing. The hi_res
parsing strategy's performance appears to scale linearly
with document size, yet the accuracy it delivers is notably high. Combined with a fast and efficient Vector Database like Pinecone Serverless
and a powerful LLM like GPT-4
, we can confidently build a production-grade RAG model that can handle complex, unstructured documents for really affordable costs.
So, that's it for now!
I hope this brief analysis has been helpful to you. If you have any questions or suggestions, feel free to reach out to me on GitHub, LinkedIn, or via email.
Oh, and I'm not in any way affiliated with Pinecone
or Unstructured.io
, I just really like their products and most importantly, their free tiers. Wink, wink!
See ya around and happy coding!