The most common request in our introductory AI conversations with SMB executives has been virtually identical for three years: "We want a chatbot that knows our own documents — our contracts, our technical data sheets, our service manuals." This reflects a sound intuition: a generic language model knows the internet, but not your in-house order processing, the internal compliance handbook, or the accumulated knowledge from ten years of support tickets. That is precisely the gap that Retrieval-Augmented Generation — RAG for short — closes. In 2026 it is by far the most important AI architecture pattern for SMBs. This article explains what RAG delivers technically, when it is the right choice, how a production-ready system is built, which vector databases and embedding models are worth considering, and how to make answer quality measurable. For the strategic context, see our AI Guide for SMBs.
What is RAG — Definition and Distinction from Fine-Tuning
Retrieval-Augmented Generation is an architecture pattern that supplies a language model at runtime with additional context that is not part of its training dataset. Instead of training the model on new content, the system searches its own document base for relevant passages on every user query and delivers those passages alongside the question to the model. The model then generates its answer not from pure prior knowledge, but from the specifically provided sources. The difference from classical prompt engineering is scalability: RAG works even when the document base spans hundreds of thousands of pages, because only the currently relevant excerpts are loaded into the context.
The distinction from fine-tuning matters to decision-makers because both concepts are frequently confused. Fine-tuning means further training the model itself on your own data — the internal weights change, and the model "knows" more afterward. This sounds more elegant at first, but has three practical drawbacks: first, every training run takes hours to days and costs four-figure sums; second, the sources of any given answer can no longer be traced; third, the hard-coded content becomes stale the moment the underlying material changes. For pure factual knowledge, fine-tuning is therefore rarely the right choice. It pays off primarily where style, format, or domain language needs to be changed — for example when a model should consistently answer in a legal writing style, or when it needs to master an internal classification schema.
RAG, by contrast, keeps the model unchanged and instead dynamically swaps out the document base. A new contract version is part of the system within minutes, a withdrawn data sheet disappears immediately, and every answer includes source citations with page numbers. These properties are critical for SMBs because they are not only technical requirements but legal ones — anyone giving an employee an answer must be able to say where it came from.
When RAG Makes Sense — Typical Use Cases
Not every AI application needs RAG. It is useful wherever a model needs to access your own, frequently updated, often extensive content and the answers must remain traceable. Four use cases dominate our SMB practice:
- Technical Documentation and Service KnowledgeMechanical and plant engineering companies and software vendors store thousands of pages of operating manuals, maintenance plans, and error codes. A RAG system makes this knowledge base directly searchable for service technicians — typical response time under five seconds, with references to specific manual pages.
- Contracts, Proposals, and Legal TextsSales and legal departments work with hundreds of contracts, framework agreements, and standard clauses. RAG finds comparable clauses from older contracts, answers questions about terms and special conditions, and detects deviations from standard templates.
- Internal Knowledge Base and OnboardingHR, IT helpdesk, and compliance teams maintain Confluence, SharePoint, or Notion knowledge bases that gather dust unused because the search is poor. RAG turns these resources into a genuine assistant — new employees ask in natural language instead of clicking through nested folders.
- Support FAQ and Customer ServiceCustomer support teams answer the same questions day after day. A RAG system learns from old tickets and FAQ articles and suggests an answer to the support agent that they only need to review and send. Realistic reduction in handling time: 35 to 60 percent.
Where RAG is weak: mathematical calculations, pure logic tasks without reference to source texts, and creative tasks without a documented basis. Anyone building a copywriting assistant for marketing slogans does not need RAG but a good prompt template. More on tooling selection in the AI Tools Comparison 2026.
RAG Architecture — The Five Components
A production-ready RAG system consists of five clearly delineated components. Omitting or shortchanging any of them results in either poor answers or a non-scalable system. The components interlock in this order:
| Component | Function | Typical Choice 2026 |
|---|---|---|
| Embedding model | Converts text chunks and user questions into numerical vectors | OpenAI text-embedding-3-large, Voyage-3, BGE-M3 |
| Vector database | Stores chunk vectors, returns similar vectors in milliseconds | pgvector, Qdrant, Pinecone, Weaviate |
| Retrieval logic | Translates question into vector, queries DB, filters by metadata | Hybrid search from BM25 + vector similarity |
| Reranking | Re-sorts the top-N results using a more precise model | Cohere Rerank, Voyage rerank-2, BGE-Reranker |
| LLM generator | Generates answer from question and top chunks with source references | Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro |
The real value lever sits at a point that demos often skip: reranking. A vector search typically returns 20 to 50 roughly matching chunks. Without reranking, all of them end up in the model context — wasteful and quality-degrading. With a dedicated reranker model, this list is reduced to the five to eight genuinely most relevant chunks before the LLM responds. In client projects we consistently measure 15 to 25 percent better Faithfulness scores through clean reranking, with the same model and the same document base.
Free RAG Architecture Consultation
Planning a RAG system for internal documentation, contracts, or support? We offer a free 30-minute introductory call — we assess your data inventory, propose a suitable stack, and provide a realistic effort estimate for the first prototype.
Request a free RAG consultationVector Databases Compared
The vector database is the central infrastructure decision in building a RAG system because it determines operating costs, data residency, and scaling behavior. Four options dominate the market in 2026 — the choice depends less on vector performance than on the surrounding requirements.
| Solution | Deployment model | Strength | Weakness |
|---|---|---|---|
| pgvector | PostgreSQL extension, self-hosted | No additional component, transactional, GDPR-simple | Index tuning needed beyond several million chunks |
| Qdrant | Self-hosted or cloud, EU region available | Very fast filter logic, mature metadata support, open source | Second DB to operate, own backup required |
| Weaviate | Self-hosted or cloud | Module system with built-in vectorizer, GraphQL API | Complex to operate, steeper learning curve |
| Pinecone | Fully managed cloud (US provider) | Zero operational overhead, very fast scaling | Data residency requires GDPR argumentation |
For SMB first projects, our standard recommendation is pgvector — the majority of our clients already run a PostgreSQL database, the overhead for the vector extension is one hour of configuration, and the GDPR discussion is entirely avoided. At around five million chunks or when complex metadata filters with high write loads are required, we switch to Qdrant. We recommend Pinecone only when managed-service convenience explicitly outweighs data residency — rarely the case in the SMB market.
Embedding Models — Selection and Costs
The embedding model determines the quality of every search in the RAG system, because the vector database can only be as precise as the embedding captures the semantic content of a text. Four model families are relevant in 2026:
- OpenAI text-embedding-3-large — 3072 dimensions, excellent quality on English and multilingual text, $0.13 per million tokens. EU data residency available via Azure OpenAI with locations in Frankfurt and Sweden.
- Voyage voyage-3-large — 1024 dimensions, top-ranked in many independent benchmarks for specialist texts, $0.18 per million tokens. Anthropic-affiliated provider, US hosting.
- Cohere embed-multilingual-v3 — 1024 dimensions, excellent for mixed-language corpora with more than two languages, $0.10 per million tokens. EU hosting available.
- BGE-M3 / Multilingual-E5-large — open source, runs self-hosted on a T4 or L4 GPU, free to license. Quality is roughly five to ten percent below the US providers, but data never leaves your own network.
A pragmatic rule of thumb: as long as strictly confidential content is not being embedded, OpenAI text-embedding-3-large on Azure is the most cost-effective choice — good quality, EU location, low price. For GDPR-strict setups involving contracts, personnel files, or patient data, self-hosted BGE-M3 is the recommendation. Important: the chosen model must be applied consistently, both during the initial database population and for every subsequent search query. Switching the embedding model requires a complete re-indexing of the entire document base.
Chunking Strategies
Chunking — breaking long documents into individually vectorizable pieces — is the most inconspicuous yet quality-critical step in RAG setup. Poor chunking cannot be compensated for by any embedding model or reranker. Three strategies dominate practice:
Fixed size with overlap. The simplest approach: each document is split into pieces of fixed token length, typically 400 to 800 tokens, with 10 to 15 percent overlap between consecutive chunks. Advantage: deterministic, fast, implementable in two hours. Disadvantage: tears logical units such as tables, contract clauses, and code blocks at arbitrary points.
Structure-aware chunking. Chunks follow the document structure — heading hierarchy, paragraph boundaries, contract clauses, or Markdown sections. For technical documentation and legal texts, this approach consistently delivers 15 to 30 percent better retrieval results. Implementation takes one to two days per document type.
Semantic chunking. An additional model decides paragraph by paragraph whether to merge it thematically with the previous one or start a new chunk. Highest quality, but the most expensive indexing run and hardest to maintain. Recommended only for highly heterogeneous collections where neither token boundaries nor structural markers work well.
For 80 percent of SMB projects, structure-aware chunking with a moderate 600-token maximum size and 80-token overlap is the right choice. A complementary best practice is enriching each chunk with metadata — document type, version, date, tenant, language — so that the retrieval logic can later filter by tenant or exclude outdated versions.
Evaluating Answer Quality — Recall, Precision, Faithfulness, RAGAS
A RAG system without a measurement setup is flying blind. The following four metrics belong in every RAG project from the first prototype — they provide objective numbers instead of gut feeling and make improvements comparable.
| Metric | What it measures | Target value (production) |
|---|---|---|
| Context Recall | Share of relevant documents that the retrieval actually found | above 70 percent |
| Context Precision | Share of genuinely relevant results among those retrieved | above 60 percent |
| Faithfulness | Share of answer statements that are backed by sources | above 80 percent |
| Answer Relevancy | How directly the answer addresses the original question | above 75 percent |
The RAGAS framework — an open-source project for RAG evaluation — automates these measurements by deploying a second language model as a judge. In practice this means: you maintain a test dataset of 50 to 200 typical questions with reference answers, RAGAS runs them through the system, and calculates the four metrics per question. This measurement pipeline ideally runs automatically on every system change — different embedding model, different chunking, different prompt — and immediately delivers a comparison figure.
From practice: the most common weakness in fresh RAG systems is low Context Recall — the retrieval simply does not find the right documents. The second most common is low Faithfulness — the model invents content despite available sources. Both problems have different solutions, and without measurement there would be no way to decide which to address.
Security and GDPR
RAG systems typically process an organization's internal knowledge assets — and therefore the most sensitive material in the business. Three data-protection questions should be answered before the architecture decision is made.
First, embedding data residency. Sending vector representations of confidential documents to a US embedding model means transferring processed personal or business-critical data to the United States. Embeddings are numeric but not anonymous — modern inversion techniques can reconstruct the original text with high accuracy. Consequence: for GDPR-strict content, embeddings must go to either an EU-hosted provider such as Azure OpenAI in Frankfurt, or a self-hosted model.
Second, tenant separation in retrieval. When a RAG system serves multiple departments, subsidiaries, or clients, the search logic must filter by permissions without exception. It is not sufficient to enforce permissions only in the UI — the LLM must never see the other tenant's chunks in the first place. The clean solution is metadata filters applied directly in the vector database query, with permission tags maintained per user.
Third, logging. Every user query potentially contains sensitive information — a question about "termination of employee Smith" is itself a personal data processing event. Logs should be encrypted, tenant-separated, and kept with short retention periods. For GDPR compliance, a logging concept is a mandatory component of the data protection impact assessment. For a deeper dive into GDPR architecture, see our AI and GDPR cluster, and for the general hosting question see the comparison LLM On-Premise vs. Cloud.
Reepa RAG Setup with Claude and pgvector
For SMB first projects we deliberately recommend a lean stack that can be made production-ready in two to four weeks. The following setup proposal covers around 80 percent of our client requirements without customization:
- Data storagePostgreSQL with the pgvector extension, hosted in an EU region (Hetzner, OVH, AWS Frankfurt). A single database for structured data and vector embeddings — no second system required.
- Embedding modelOpenAI text-embedding-3-large via Azure OpenAI Frankfurt for non-sensitive content, BGE-M3 self-hosted for strictly confidential collections. Consistency between index build and query time is mandatory.
- Retrieval and rerankingHybrid search from pgvector cosine similarity and PostgreSQL full-text search, followed by Cohere Rerank for the top 20 results. Reduction to 5 to 8 final chunks for the model context.
- LLM generatorClaude Sonnet 4.5 for most answers, Claude Opus 4 for complex contract analyses. Both via the Anthropic API with an EU data processing addendum in the DPA.
- OrchestrationVercel AI SDK in a Next.js application, deployed to an EU region. Streaming responses, source citations per answer, audit log in the same PostgreSQL DB.
- EvaluationRAGAS pipeline with 100 curated test questions, automatic run on every configuration change, dashboard with trend figures for Context Recall, Precision, and Faithfulness.
This stack is deliberately undramatic — no exotic components, no vendor lock-in beyond two suppliers, clear GDPR argumentation. The first prototype typically runs within three weeks; we reach production-grade quality with RAGAS validation within eight to twelve weeks. Those wanting to go deeper on the prompt side will find the most important patterns in our Prompt Engineering cluster.
Frequently Asked Questions
What is the difference between RAG and fine-tuning?
RAG augments an existing language model at runtime with relevant document excerpts from a vector database — the model itself remains unchanged, the knowledge base is interchangeable and can be updated instantly. Fine-tuning, by contrast, trains the model on additional data and permanently alters its weights. For knowledge applications in SMBs, RAG is almost always the right approach because content can be updated daily, sources remain traceable, and no expensive training runs are required. Fine-tuning pays off primarily for style or format adjustments, not for pure factual knowledge.
Which vector database is the right fit for an SMB RAG project?
For most SMBs we recommend pgvector as an extension to an existing PostgreSQL database — no additional component, no extra costs, GDPR-compliant self-hosted in your own data center or an EU cloud. Qdrant is the right choice at higher volumes of several million chunks or when sophisticated metadata filter logic is required. Pinecone and Weaviate make sense for very large setups or when managed-service convenience outweighs data residency concerns. Introducing a standalone vector database solely for RAG is overkill in 80 percent of cases.
Which embedding models make sense for English content in 2026?
In 2026, three model families lead the field. Voyage voyage-3-large delivers the best quality on specialist texts but is tied to a US provider. OpenAI text-embedding-3-large offers excellent quality, EU data residency via Azure OpenAI, and is attractively priced. For GDPR-strict setups we recommend BGE-M3 or Multilingual-E5-large as a self-hosted solution on a small GPU — quality is only slightly below the US providers, but embeddings never leave your own data center. Cohere embed-multilingual-v3 is the recommendation for mixed-language corpora spanning many languages.
How large should chunks be in a RAG system?
A well-proven size is 400 to 800 tokens per chunk with 10 to 15 percent overlap between consecutive chunks. Smaller chunks increase precision because retrieved passages are more targeted, but they destroy context. Larger chunks preserve context but dilute the embedding and lead to unspecific matches. For technical documentation and contracts, structure-aware chunking strategies — chunks aligned to headings, paragraphs, or contract clauses — almost always outperform fixed token boundaries.
How do you objectively measure the quality of a RAG system?
The most important metrics are Recall, Precision, and Faithfulness. Recall measures whether the system finds the relevant documents at all. Precision measures what fraction of the retrieved documents are actually relevant. Faithfulness measures whether the generated answer stays true to the retrieved sources or invents content. The RAGAS framework automates this evaluation — it uses a second language model as a judge, compares answers to sources, and delivers a per-question score. A production RAG system should achieve at least 80 percent Faithfulness, 70 percent Context Recall, and under 5 percent hallucination rate before real users see it.
Ready to make your internal knowledge AI-ready?
Let's talk for 30 minutes, no strings attached. We assess your document inventory, propose a suitable RAG stack, and deliver a realistic roadmap for the first 90 days — including GDPR argumentation, RAGAS test setup, and an estimate for licensing and infrastructure costs.
Schedule a 30-minute call