- Published on
A brief analysis on RAG with Pinecone Serverless and Unstructured.io
- Authors

- Name
- Athos Georgiou

Welcome to the latest installment of the series on building an AI chat assistant from scratch. This time around, however I'd like to change the format a bit. Instead of guiding you through the process and showing snippets of code, I will instead provide a high-level overview of the process and the tools used and then a brief analysis of the RAG model using Pinecone Serverless and Unstructured.io.
For reference, here are the previous articles in the series:
- Part 1 - Integrating Markdown in Streaming Chat for AI Assistants
- Part 2 - Creating a Customized Input Component in a Streaming AI Chat Assistant Using Material-UI
- Part 3 - Integrating Next-Auth in a Streaming AI Chat Assistant Using Material-UI
- Part 5 - Integrating the OpenAI Assistants API in a Streaming AI Chat Assistant Using Material-UI
- Part 6 - Integrating Vision using the latest OpenAI API
The model is pretty much a prototype, and it may contain bugs or issues, so please let me know if you find any ground-breaking issues. And as always, if you'd prefer to skip the article and get the code yourself, you can find it on GitHub.
Overview
RAG (Retrieval-Augmented Generation) is a Model commonly used in Generative AI to enhance the quality of an AI response, by providing context to an LLM (Large Language Model) such as GPT-4. The context is retrieved from a Vector Database using Semantic Search and supplements the user message to be sent. As a result, The LLM can generate responses that are not only more accurate and relevant, but also mitigate the infamous "hallucination" phenomenon that is so common in LLMs these days.
Although I've been building RAG based open source applications for a while now, I've mostly stuck to basic techniques and tools that provided average results and performance. However, I recently came across two new tools that have completely changed the game for me: Pinecone Serverless and Unstructured.io.
Motivations
Pinecone Serverless is a newly released Vector Database, used for Generative AI based applications and leveraging concepts such as RAG, Semantic search and classification. I've used Pinecone before with decent results, but with the recent release of a serverless option, I was intrigued to try again.
In past efforts, the primary challenge in utilizing a Vector Database wasn't about the database itself, the embeddings, or even the earlier versions of GPT-4 models, which were often reluctant to incorporate provided context. The real difficulty lay in parsing unstructured data effectively and devising a solid chunking strategy. This was especially problematic for complex, unstructured documents containing images and tables, leading to decreased precision during the retrieval phase. While this might have been acceptable for small-scale, applications, or personal projects, it certainly would not suffice for a production-grade application.
This is when I started looking for document parsers and came across Unstructured.io, an application used for extracting and analyzing information from unstructured data, including Tables and Images. I was eager to see how it could be integrated in a RAG based model and what kind of precision and performance it could provide.
These two tools also came with a free tier, or free credits, which was a huge plus for me. So I thought, why not build a new RAG model using these two tools and see how it performs?
The Process
I implemented the project using my existing Next.js-based template named Titanium, which already incorporates several advanced Generative AI features.
The RAG process follows these steps:
- A document uploaded by the user is parsed using
Unstructured.io. - The parsed chunks of the document are embedded with OpenAI's
ada-003 model. Pinecone Serverlessindexes the embedded data within a user-specific namespace, including any additional metadata.- When the user sends a message to the AI Assistant, this context augments the message, which is then processed by the
gpt-4-0125-previewmodel. - GPT-4 generates a response using the enriched message, ensuring relevance and accuracy.
Analysis
I tested this RAG model using 2 types of documents in PDF format; A 15 page sample offer letter and a 88 page Book.
Some of the Parameters that I adjusted and tested included:
Top Kelements to retrieve from PineconeBatch Sizeto prevent swamping the Pinecone API with requestsParsing strategyfor Unstructured.io, which included several options such asauto,fast, andhi_res
Metrics
- The response timing for each process.
- The precision of the responses in the streaming chat experience.
- The costs incurred for each process.
Parsing
Document parsing performance varied with the document's size and the chosen parsing strategy. The
faststrategy was the quickest but least accurate, while thehi_resstrategy was the most accurate but slowest. Theautostrategy offered a balance between speed and accuracy.For a
15-pagetext document, thehi_resstrategy took an average of45-50 secondsto parse, whereas thefaststrategy took about3-4 seconds. Theautostrategy consistently took around3 seconds.Parsing an
88-pagedocument containing both text and images with thehi_resstrategy took between5.3 to 5.5 minutes, whereas thefastandautostrategies took between25-30 seconds, indicating a roughly10xincrease in time for the larger document.Adjusting the
combine_under_n_charsparameter to1500did not impact parsing times. This is currently not configrable in the app, but I'll be adding it soon.
Embedding
- I only embedded the parsed
JSON based chunksat the default settings.text/csvbased chunking is also supported, but I have not yet ran tests on that. - I found the
ada-003 modelto be quite fast, completing the embedding of both documents in under2.5 secondseach, which is great.
Indexing/Upserting
Indexing/UpsertingintoPinecone Serverlesswas remarkably fast, with both documents being indexed in under4 secondson average.- To avoid swamping the Pinecone API with requests, I used a
batch sizeof250for theUpsertManyprocess, which worked well. Anything over250has a risk of getting refused, if your chunk size is too large.
Retrieval
- Although a smaller Top K value typically resulted in faster retrieval times, there were occasions where a
Top K of 50would resolve as quickly as2 to 3 seconds. - More testing is needed to determine the optimal
Top Kvalue for the RAG model. Additionally, something to note is that server stability and load can also affect the retrieval times. - As of now, retrieval via
metadatais not supported with the Serverless version. Pinecone, how could you do this to me? I thought we had something special!
Deletion
- Using
DeleteManyfor Pinecone with abatch size of 250, the process completed effortlessly inunder 2 secondsfor the majority of the tests. - The
DeleteAllprocess was also quite fast, taking around1 second, on average to complete.
Streaming Chat Experience
- The
streaming chatexperience was quite smooth, with the user receiving a response between1 to 4 secondson average after sending a message. However, this was heavily dependent on the Top K value, the size of the overall enhanced message, and server load from OpenAI.
Costs
- Parsing:
Unstructured.ioprovides a free tier of1000 pages per month, so I didn't incur any costs for the tests. If anyone has any experience with the paid tier, I'd love to hear about it. - Embedding: The entire costs of running these tests were around
$0.02, so I was quite pleased with that. - Indexing/Upserting:
Pinecone Serverlessproved true to it's marketed cost reduction over the pod based model. The entire experiment, which consisted of roughly2000 pagesworth of documents, cost me$0.13from the$100free credits. It's worth noting, however that this experiment mostly involved mostlywrites, as I was mostly focused on theindexingandupsertingaspects. Furtherreadsandstorage unitswould naturally increase the costs. - Conversation: The cost of asking a question to the
gpt-4-0125-previewmodel and getting a response varied dramatically, with theTop Kvalue andchunk sizebeing the primary factor. But, this has always been the case with OpenAI's API, so nothing surprising there. Overall, with the entire experiment cost around$4.50, but there was some bug fixing and retesting involved, so the actual cost would be lower.
Conclusion
Overall, I've had a blast building this RAG model using Pinecone Serverless and Unstructured.io. I'm quite pleased with the performance and offcourse being able to develop this model for free. I'm looking forward to further testing and optimizing the model, and I'm excited to see how these tools will evolve in the future.
unstructured.io is perhaps geared to becoming major player in the field of unstructured data parsing. The hi_res parsing strategy's performance appears to scale linearly with document size, yet the accuracy it delivers is notably high. Combined with a fast and efficient Vector Database like Pinecone Serverless and a powerful LLM like GPT-4, we can confidently build a production-grade RAG model that can handle complex, unstructured documents for really affordable costs.
So, that's it for now!
I hope this brief analysis has been helpful to you. If you have any questions or suggestions, feel free to reach out to me on GitHub, LinkedIn, or via email.
Oh, and I'm not in any way affiliated with Pinecone or Unstructured.io, I just really like their products and most importantly, their free tiers. Wink, wink!
See ya around and happy coding!
