The RAG blueprint for our AI assistant

So we are building and AI assistant... Easy enough...
In my previous post, I described our overall objective. To recap: we are building an AI assistant with reliable expert domain knowledge behind an email interface.
MVP mentality
A great idea, with potential real users. Let's not overengineer it from the get-go, though. At least not too much. 😉
Oh, over-engineering, the siren of the analytical mind. How sweet your singing is, and how quickly you draw innocent engineers into the waters of never-shipped (get it?) projects.
Anyway, for our intent and purposes, let's identify where real scale benefits our solution and where we can get away with not considering scalability, redundancy, and other limitations. In our case, at least during the development and testing phase, the size of the user base will be small. Even after launch, we are not expecting a lot of users. By a lot, I mean thousands of daily active users or higher—scales at which certain assumptions cannot be made anymore.
The size of our knowledge dataset is also limited. The size of which is considerably larger than what we can fit into context, but which does not need special considerations from a storage and retrieval perspective. This includes ingestion and pre-processing, too. The volume and frequency of which is not going to be too high (especially given our requirement of a manually curated catalog).
These assumptions can still be challenged later on. However, with a proper separation of concerns and an assessment of how future growth would impact certain elements, we can plan for expansion while keeping current complexity to a minimum.
Let's start simple
Before we even build anything, let's test the out-of-the-box capabilities of models with our ground truth dataset. Run it through a couple of models to get a baseline.
What we found was that many models from different providers already perform quite well on our benchmark. However, we need to improve in certain areas, both for the sake of improving the answer quality and because we want to create certain functionalities not built into the LLMs.
Remember, we want to reference our knowledge base as closely as possible for compliance and safety reasons.
Adding RAG
Since we have more contextual data than fits into the context window of the models, and we also want to filter which documents we attach for both performance and cost optimization, we opt for using RAG.
As our documents are often multi-hundred pages long and contain other information than what's required for individual queries, we apply pre-processing steps before using them in the generation step.
This is done for a variety of reasons:
- Improve context quality. Provide only the relevant part of the document.
- Minimize input token count. This leads to cost reduction and improved latency, and better utilization of the LLM API quotas.
- Gain the ability to provide the relevant documentation not just to the LLM, but to the user too.
- Extract relevant metadata for improved retrieval during generation.
With this reasoning in place, I will now describe our RAG pre-processing pipeline.
Pre-processing pipeline
Our pipeline is built on a serverless, event-driven architecture using Google Cloud services. This approach ensures scalability and decouples the different stages of the process, making the system resilient and easier to maintain. The entire workflow is triggered by file uploads to a Google Cloud Storage (GCS) bucket.
Here's a breakdown of the steps involved:
1. Event Triggering and Initial Handling
The process kicks off the moment a PDF is uploaded to our source GCS bucket. This event triggers a Cloud Function. The first thing this function does is calculate an MD5 hash of the PDF's content. This hash serves as a unique identifier for the document, which is crucial for preventing the system from processing the same file multiple times. In all honesty, a batch pipeline would suffice, but making it event-driven doesn't increase complexity in this case. If anything, we are minimizing the complexity of triggering the flow.
2. Intelligent Document Sectioning
This is where the magic really begins. Instead of naively splitting the document into arbitrary, fixed-size chunks, we use the PDF's own internal structure. Our service inspects the document's outline (the bookmarks you see in a PDF reader), if available, to break it down into logical sections. A 200-page user manual might be split into sections like "Introduction," "Installation," "Troubleshooting," and so on. This has a massive impact on retrieval quality. When a user asks a question about troubleshooting, the RAG system can retrieve the entire troubleshooting section, providing the LLM with complete, focused context rather than a fragmented piece of text. Furthermore, the processor is smart enough to distinguish and separate tables from plain text, preserving the structured data within the document. Since these documents also contain useful images and infographics, we opted to use LLMs that support multi-modal inputs. For the time being, we only use the text content for embedding, as it provides sufficient retrieval accuracy, especially given how we split the documents. This approach works well because we have the 'input token budget' to provide the full relevant section(s) as context. Each of these extracted sections is then saved as a new, smaller PDF in a separate destination GCS bucket.
3. Metadata and Version Control
As we process the document, we store a rich set of metadata in a Firestore database. This includes the parent document's title, category, issue number, and issue date, which are extracted from the PDF content itself. We also store metadata for every section we created, linking it back to the original parent document. Our pipeline also handles document updates gracefully. If a new version of a manual is uploaded, the system queries Firestore for documents with the same title and category. By comparing the issue date or issue number, it can determine if the newly uploaded file is an update. If it is, the older version and all of its associated sections are flagged as is_deprecated, ensuring our AI assistant always references the latest information. We can use the information about version and issue changes to inform the user if a certain piece of information they have was initially correct but is now outdated. When a document is re-processed, the system intelligently calculates the difference between the old and new sections. Stale sections are deleted, new ones are added, and only sections whose content has actually changed are marked for reprocessing.
4. Chunking for Embedding
While we now have logically separated sections, some sections might still be too large for an embedding model's context window. The next step is to break down the text content of each section into smaller, overlapping chunks to ensure contextual continuity between them.
5. Decoupling with Pub/Sub
Once the text chunks are ready, instead of immediately sending them to be embedded, we publish them as individual tasks to a Pub/Sub topic. Each message contains the text chunk and a reference to its parent section in Firestore. The reason for this was the considerably higher quota limits for batch embedding APIs. This also creates a clean separation of concerns. The ingestion pipeline's only job is to process and prepare the data. A separate, independently scalable service (e.g., another Cloud Function or a Cloud Run job) can then subscribe to this topic to handle the computationally intensive task of generating embeddings if we decide to do the embeddings ourselves.
6. Handling Deletions
Finally, the system is designed to maintain integrity when source files are removed. A second Cloud Function is triggered when a file is deleted from the source GCS bucket. This function finds the corresponding document in Firestore and performs a "soft delete." This ensures we don't permanently lose data and can easily restore it if needed.
This robust, content-aware preprocessing pipeline is the foundation of our AI assistant. It transforms raw, unstructured PDFs into a highly structured, version-controlled, and easily searchable knowledge base, paving the way for the accurate and contextually relevant responses we need.
Next step
With the RAG pre-processing pipeline in place, in the next post, I will focus on the flow of the AI assistant itself. Stay tuned.