- Athos Georgiou
Welcome to the latest installment of the series on building an AI chat assistant from scratch. This time around, however I'd like to change the format a bit. Instead of guiding you through the process and showing snippets of code, I will instead provide a high-level overview of the process and the tools used and then a brief analysis of the RAG model using Pinecone Serverless and Unstructured.io.
For reference, here are the previous articles in the series:
- Part 1 - Integrating Markdown in Streaming Chat for AI Assistants
- Part 2 - Creating a Customized Input Component in a Streaming AI Chat Assistant Using Material-UI
- Part 3 - Integrating Next-Auth in a Streaming AI Chat Assistant Using Material-UI
- Part 5 - Integrating the OpenAI Assistants API in a Streaming AI Chat Assistant Using Material-UI
- Part 6 - Integrating Vision using the latest OpenAI API
The model is pretty much a prototype, and it may contain bugs or issues, so please let me know if you find any ground-breaking issues. And as always, if you'd prefer to skip the article and get the code yourself, you can find it on GitHub.
RAG (Retrieval-Augmented Generation) is a Model commonly used in Generative AI to enhance the quality of an AI response, by providing context to an LLM (Large Language Model) such as GPT-4. The context is retrieved from a Vector Database using Semantic Search and supplements the user message to be sent. As a result, The LLM can generate responses that are not only more accurate and relevant, but also mitigate the infamous "hallucination" phenomenon that is so common in LLMs these days.
Although I've been building RAG based open source applications for a while now, I've mostly stuck to basic techniques and tools that provided average results and performance. However, I recently came across two new tools that have completely changed the game for me: Pinecone Serverless and Unstructured.io.
Pinecone Serverless is a newly released Vector Database, used for Generative AI based applications and leveraging concepts such as RAG, Semantic search and classification. I've used Pinecone before with decent results, but with the recent release of a serverless option, I was intrigued to try again.
In past efforts, the primary challenge in utilizing a Vector Database wasn't about the database itself, the embeddings, or even the earlier versions of GPT-4 models, which were often reluctant to incorporate provided context. The real difficulty lay in parsing unstructured data effectively and devising a solid chunking strategy. This was especially problematic for complex, unstructured documents containing images and tables, leading to decreased precision during the retrieval phase. While this might have been acceptable for small-scale, applications, or personal projects, it certainly would not suffice for a production-grade application.
This is when I started looking for document parsers and came across Unstructured.io, an application used for extracting and analyzing information from unstructured data, including Tables and Images. I was eager to see how it could be integrated in a RAG based model and what kind of precision and performance it could provide.
These two tools also came with a free tier, or free credits, which was a huge plus for me. So I thought, why not build a new RAG model using these two tools and see how it performs?
I implemented the project using my existing Next.js-based template named Titanium, which already incorporates several advanced Generative AI features.
The RAG process follows these steps:
- A document uploaded by the user is parsed using
- The parsed chunks of the document are embedded with OpenAI's
Pinecone Serverlessindexes the embedded data within a user-specific namespace, including any additional metadata.
- When the user sends a message to the AI Assistant, this context augments the message, which is then processed by the
- GPT-4 generates a response using the enriched message, ensuring relevance and accuracy.
I tested this RAG model using 2 types of documents in PDF format; A
15 page sample offer letter and a
88 page Book.
Some of the Parameters that I adjusted and tested included:
Top Kelements to retrieve from Pinecone
Batch Sizeto prevent swamping the Pinecone API with requests
Parsing strategyfor Unstructured.io, which included several options such as
- The response timing for each process.
- The precision of the responses in the streaming chat experience.
- The costs incurred for each process.
Document parsing performance varied with the document's size and the chosen parsing strategy. The
faststrategy was the quickest but least accurate, while the
hi_resstrategy was the most accurate but slowest. The
autostrategy offered a balance between speed and accuracy.
15-pagetext document, the
hi_resstrategy took an average of
45-50 secondsto parse, whereas the
faststrategy took about
3-4 seconds. The
autostrategy consistently took around
88-pagedocument containing both text and images with the
hi_resstrategy took between
5.3 to 5.5 minutes, whereas the
autostrategies took between
25-30 seconds, indicating a roughly
10xincrease in time for the larger document.
1500did not impact parsing times. This is currently not configrable in the app, but I'll be adding it soon.
- I only embedded the parsed
JSON based chunksat the default settings.
text/csvbased chunking is also supported, but I have not yet ran tests on that.
- I found the
ada-003 modelto be quite fast, completing the embedding of both documents in under
2.5 secondseach, which is great.
Pinecone Serverlesswas remarkably fast, with both documents being indexed in under
4 secondson average.
- To avoid swamping the Pinecone API with requests, I used a
UpsertManyprocess, which worked well. Anything over
250has a risk of getting refused, if your chunk size is too large.
- Although a smaller Top K value typically resulted in faster retrieval times, there were occasions where a
Top K of 50would resolve as quickly as
2 to 3 seconds.
- More testing is needed to determine the optimal
Top Kvalue for the RAG model. Additionally, something to note is that server stability and load can also affect the retrieval times.
- As of now, retrieval via
metadatais not supported with the Serverless version. Pinecone, how could you do this to me? I thought we had something special!
DeleteManyfor Pinecone with a
batch size of 250, the process completed effortlessly in
under 2 secondsfor the majority of the tests.
DeleteAllprocess was also quite fast, taking around
1 second, on average to complete.
Streaming Chat Experience
streaming chatexperience was quite smooth, with the user receiving a response between
1 to 4 secondson average after sending a message. However, this was heavily dependent on the Top K value, the size of the overall enhanced message, and server load from OpenAI.
Unstructured.ioprovides a free tier of
1000 pages per month, so I didn't incur any costs for the tests. If anyone has any experience with the paid tier, I'd love to hear about it.
- Embedding: The entire costs of running these tests were around
$0.02, so I was quite pleased with that.
Pinecone Serverlessproved true to it's marketed cost reduction over the pod based model. The entire experiment, which consisted of roughly
2000 pagesworth of documents, cost me
$100free credits. It's worth noting, however that this experiment mostly involved mostly
writes, as I was mostly focused on the
storage unitswould naturally increase the costs.
- Conversation: The cost of asking a question to the
gpt-4-0125-previewmodel and getting a response varied dramatically, with the
Top Kvalue and
chunk sizebeing the primary factor. But, this has always been the case with OpenAI's API, so nothing surprising there. Overall, with the entire experiment cost around
$4.50, but there was some bug fixing and retesting involved, so the actual cost would be lower.
Overall, I've had a blast building this RAG model using
Pinecone Serverless and
Unstructured.io. I'm quite pleased with the performance and offcourse being able to develop this model for free. I'm looking forward to further testing and optimizing the model, and I'm excited to see how these tools will evolve in the future.
unstructured.io is perhaps geared to becoming major player in the field of unstructured data parsing. The
hi_res parsing strategy's performance appears to scale
linearly with document size, yet the accuracy it delivers is notably high. Combined with a fast and efficient Vector Database like
Pinecone Serverless and a powerful LLM like
GPT-4, we can confidently build a production-grade RAG model that can handle complex, unstructured documents for really affordable costs.
So, that's it for now!
Oh, and I'm not in any way affiliated with
Unstructured.io, I just really like their products and most importantly, their free tiers. Wink, wink!
See ya around and happy coding!
what is Next.js?