Published on

A brief analysis on RAG with Pinecone Serverless and Unstructured.io

  • avatar
    Athos Georgiou

Welcome to the latest installment of the series on building an AI chat assistant from scratch. This time around, however I'd like to change the format a bit. Instead of guiding you through the process and showing snippets of code, I will instead provide a high-level overview of the process and the tools used and then a brief analysis of the RAG model using Pinecone Serverless and Unstructured.io.

For reference, here are the previous articles in the series:

  • Part 1 - Integrating Markdown in Streaming Chat for AI Assistants
  • Part 2 - Creating a Customized Input Component in a Streaming AI Chat Assistant Using Material-UI
  • Part 3 - Integrating Next-Auth in a Streaming AI Chat Assistant Using Material-UI
  • Part 5 - Integrating the OpenAI Assistants API in a Streaming AI Chat Assistant Using Material-UI
  • Part 6 - Integrating Vision using the latest OpenAI API

The model is pretty much a prototype, and it may contain bugs or issues, so please let me know if you find any ground-breaking issues. And as always, if you'd prefer to skip the article and get the code yourself, you can find it on GitHub.


RAG (Retrieval-Augmented Generation) is a Model commonly used in Generative AI to enhance the quality of an AI response, by providing context to an LLM (Large Language Model) such as GPT-4. The context is retrieved from a Vector Database using Semantic Search and supplements the user message to be sent. As a result, The LLM can generate responses that are not only more accurate and relevant, but also mitigate the infamous "hallucination" phenomenon that is so common in LLMs these days.

Although I've been building RAG based open source applications for a while now, I've mostly stuck to basic techniques and tools that provided average results and performance. However, I recently came across two new tools that have completely changed the game for me: Pinecone Serverless and Unstructured.io.


Pinecone Serverless is a newly released Vector Database, used for Generative AI based applications and leveraging concepts such as RAG, Semantic search and classification. I've used Pinecone before with decent results, but with the recent release of a serverless option, I was intrigued to try again.

In past efforts, the primary challenge in utilizing a Vector Database wasn't about the database itself, the embeddings, or even the earlier versions of GPT-4 models, which were often reluctant to incorporate provided context. The real difficulty lay in parsing unstructured data effectively and devising a solid chunking strategy. This was especially problematic for complex, unstructured documents containing images and tables, leading to decreased precision during the retrieval phase. While this might have been acceptable for small-scale, applications, or personal projects, it certainly would not suffice for a production-grade application.

This is when I started looking for document parsers and came across Unstructured.io, an application used for extracting and analyzing information from unstructured data, including Tables and Images. I was eager to see how it could be integrated in a RAG based model and what kind of precision and performance it could provide.

These two tools also came with a free tier, or free credits, which was a huge plus for me. So I thought, why not build a new RAG model using these two tools and see how it performs?

The Process

I implemented the project using my existing Next.js-based template named Titanium, which already incorporates several advanced Generative AI features.

The RAG process follows these steps:

  1. A document uploaded by the user is parsed using Unstructured.io.
  2. The parsed chunks of the document are embedded with OpenAI's ada-003 model.
  3. Pinecone Serverless indexes the embedded data within a user-specific namespace, including any additional metadata.
  4. When the user sends a message to the AI Assistant, this context augments the message, which is then processed by the gpt-4-0125-preview model.
  5. GPT-4 generates a response using the enriched message, ensuring relevance and accuracy.


I tested this RAG model using 2 types of documents in PDF format; A 15 page sample offer letter and a 88 page Book.

Some of the Parameters that I adjusted and tested included:

  • Top K elements to retrieve from Pinecone
  • Batch Size to prevent swamping the Pinecone API with requests
  • Parsing strategy for Unstructured.io, which included several options such as auto, fast, and hi_res


  • The response timing for each process.
  • The precision of the responses in the streaming chat experience.
  • The costs incurred for each process.


  • Document parsing performance varied with the document's size and the chosen parsing strategy. The fast strategy was the quickest but least accurate, while the hi_res strategy was the most accurate but slowest. The auto strategy offered a balance between speed and accuracy.

  • For a 15-page text document, the hi_res strategy took an average of 45-50 seconds to parse, whereas the fast strategy took about 3-4 seconds. The auto strategy consistently took around 3 seconds.

  • Parsing an 88-page document containing both text and images with the hi_res strategy took between 5.3 to 5.5 minutes, whereas the fast and auto strategies took between 25-30 seconds, indicating a roughly 10x increase in time for the larger document.

  • Adjusting the combine_under_n_chars parameter to 1500 did not impact parsing times. This is currently not configrable in the app, but I'll be adding it soon.


  • I only embedded the parsed JSON based chunks at the default settings. text/csv based chunking is also supported, but I have not yet ran tests on that.
  • I found the ada-003 model to be quite fast, completing the embedding of both documents in under 2.5 seconds each, which is great.


  • Indexing/Upserting into Pinecone Serverless was remarkably fast, with both documents being indexed in under 4 seconds on average.
  • To avoid swamping the Pinecone API with requests, I used a batch size of 250 for the UpsertMany process, which worked well. Anything over 250 has a risk of getting refused, if your chunk size is too large.


  • Although a smaller Top K value typically resulted in faster retrieval times, there were occasions where a Top K of 50 would resolve as quickly as 2 to 3 seconds.
  • More testing is needed to determine the optimal Top K value for the RAG model. Additionally, something to note is that server stability and load can also affect the retrieval times.
  • As of now, retrieval via metadata is not supported with the Serverless version. Pinecone, how could you do this to me? I thought we had something special!


  • Using DeleteMany for Pinecone with a batch size of 250, the process completed effortlessly in under 2 seconds for the majority of the tests.
  • The DeleteAll process was also quite fast, taking around 1 second, on average to complete.

Streaming Chat Experience

  • The streaming chat experience was quite smooth, with the user receiving a response between 1 to 4 seconds on average after sending a message. However, this was heavily dependent on the Top K value, the size of the overall enhanced message, and server load from OpenAI.


  • Parsing: Unstructured.io provides a free tier of 1000 pages per month, so I didn't incur any costs for the tests. If anyone has any experience with the paid tier, I'd love to hear about it.
  • Embedding: The entire costs of running these tests were around $0.02, so I was quite pleased with that.
  • Indexing/Upserting: Pinecone Serverless proved true to it's marketed cost reduction over the pod based model. The entire experiment, which consisted of roughly 2000 pages worth of documents, cost me $0.13 from the $100 free credits. It's worth noting, however that this experiment mostly involved mostly writes, as I was mostly focused on the indexing and upserting aspects. Further reads and storage units would naturally increase the costs.
  • Conversation: The cost of asking a question to the gpt-4-0125-preview model and getting a response varied dramatically, with the Top K value and chunk size being the primary factor. But, this has always been the case with OpenAI's API, so nothing surprising there. Overall, with the entire experiment cost around $4.50, but there was some bug fixing and retesting involved, so the actual cost would be lower.


Overall, I've had a blast building this RAG model using Pinecone Serverless and Unstructured.io. I'm quite pleased with the performance and offcourse being able to develop this model for free. I'm looking forward to further testing and optimizing the model, and I'm excited to see how these tools will evolve in the future.

unstructured.io is perhaps geared to becoming major player in the field of unstructured data parsing. The hi_res parsing strategy's performance appears to scale linearly with document size, yet the accuracy it delivers is notably high. Combined with a fast and efficient Vector Database like Pinecone Serverless and a powerful LLM like GPT-4, we can confidently build a production-grade RAG model that can handle complex, unstructured documents for really affordable costs.

So, that's it for now!

I hope this brief analysis has been helpful to you. If you have any questions or suggestions, feel free to reach out to me on GitHub, LinkedIn, or via email.

Oh, and I'm not in any way affiliated with Pinecone or Unstructured.io, I just really like their products and most importantly, their free tiers. Wink, wink!

See ya around and happy coding!

what is Next.js?