The question "on-premise or cloud?" is in 2026 the most frequently asked architecture decision in our consultations with mid-market companies. It is also the one most often answered incorrectly — either too quickly in favor of cloud because the setup looks deceptively simple, or too quickly in favor of on-premise because a blanket data-protection concern overrides every other argument. Both choices carry expensive consequences: in the first case, recurring compliance corrections; in the second, a six-figure hardware investment that never pays off economically. This guide shows how to make the decision rigorously — based on four criteria: data residency, token volume, operational maturity, and actual 3-year TCO. For the strategic context, see our AI Guide for SMBs, and for the data protection angle, the cluster on AI and GDPR.
What this is really about — and why the decision moves six-figure sums
Behind the apparently binary question "cloud or on-premise?" lie three distinct architecture decisions that are frequently conflated. First: where does the dataset physically leave the organization — or does it leave at all? Second: who operates the model infrastructure — the cloud provider, a service partner, or the in-house IT team? Third: who selects and updates the model? A well-considered decision separates these three layers, because doing so reveals hybrid architectures that are in practice almost always the best solution.
Clarity of terminology also matters. "Cloud LLM" in the strict sense means a provider-hosted API service — OpenAI, Anthropic, Google, Mistral La Plateforme. "On-premise" strictly means a GPU server operated in your own data center running an open model such as Llama 3.x, Mistral, Mixtral, or Qwen. Between these poles lie additional variants: a cloud LLM in an EU region with a data processing agreement, a dedicated cloud GPU server on AWS, Azure, or Hetzner running your own model, or a managed service in a German data center. These intermediate options are often overlooked in the debate, yet they are frequently the most cost-effective choice for mid-market companies.
Cloud LLMs — pros and cons in plain terms
Cloud LLMs are the default choice for mid-market companies, and for good reasons. They deliver the strongest available models, they require no upfront capital, they scale with usage volume, and they relieve the organization entirely of operational burden. At the same time, they have three structural disadvantages you need to understand and plan for.
Advantages. First, model quality: the leading cloud providers — Anthropic with Claude, OpenAI with GPT, Google with Gemini — deliver models that typically lead open alternatives by six to eighteen months. Anyone who needs a qualitatively superior model cannot bypass cloud. Second, pay-per-use: costs scale with actual consumption — no upfront investment, no utilization worries, no hardware depreciation. Third, no ops overhead: the organization does not deal with GPU drivers, model updates, quantization, inference servers, load balancing, or disaster recovery. For mid-market IT teams that are already overstretched, this is a massive advantage.
Disadvantages. First, data residency and confidentiality: even with an EU region and a data processing agreement, the model remains in third-party hands, logs are potentially retained for security analysis, and with US providers a residual risk of extraterritorial data demands remains — more on this in the Schrems II section. Second, vendor lock-in: anyone who tailors all prompts, RAG pipelines, and tool calls to a proprietary provider API ends up with a stack that is hard to unwind. Model switching and negotiating leverage are correspondingly limited. Third, latency and availability: cloud LLMs have typical response latencies of 800 milliseconds to several seconds on long contexts, and they depend on internet connectivity and provider uptime. For interactive real-time applications, this is often a problem.
On-premise LLMs — what is possible in 2026
The on-premise landscape has developed dramatically between 2023 and 2026. Open models have largely closed the gap to closed cloud models across many benchmark disciplines. Three families dominate the market in 2026:
Llama 3.x by Meta. With Llama 3.3 70B, a model is available that matches GPT-4-class performance on many tasks and is published under a license usable by businesses. Llama 3.3 70B in 4-bit quantization mode runs on a single NVIDIA H100 80 GB or two A100 80 GB cards. This makes it the standard choice for mid-market on-premise projects.
Mistral and Mixtral. Mistral 7B and the Mixtral 8x22B mixture-of-experts model from the French provider Mistral are particularly popular in Europe, because the provider itself is EU-based. Mixtral 8x22B delivers quality close to Llama 3.3 70B on many tasks, but is more inference-efficient due to its MoE architecture — a single H100 80 GB suffices in 4-bit mode. Mistral Small and Mistral Medium are the lean variants for resource-constrained setups.
Qwen by Alibaba. Qwen 2.5 emerged in 2025 as an unexpectedly strong open model, particularly for multilingual tasks and code. Qwen 2.5-72B is competitive with Llama 3.3 70B in German-language benchmarks. For many mid-market companies, however, it carries political risk because the provider is Chinese — even without any data transfer, contractual concerns arise in supplier audits.
Hardware reality in 2026. The viable GPUs for production inference are essentially three: the NVIDIA H100 with 80 GB as the premium choice at around €28,000 to €35,000 per card, the A100 with 80 GB as the proven standard choice at around €12,000 to €18,000 on the used and refurbished market, and the L40S with 48 GB as an affordable inference card at around €9,000 to €12,000. A production server with one L40S, suitable for Mistral 7B, Llama 3 8B, or smaller Qwen variants, starts in 2026 at around €15,000 including server chassis, CPU, RAM, and NVMe storage. A production H100 server for Llama 3.3 70B or Mixtral 8x22B runs €50,000 to €75,000.
Inference latency. On-premise has an often underestimated advantage here: time-to-first-token latency is typically 80 to 200 milliseconds versus 600 to 1,500 milliseconds in the cloud, because no internet hop is involved. For interactive applications — customer-service chatbots, code assistants, real-time translation — this is perceptible. Tokens-per-second throughput depends on the model and GPU; with a single H100 running Llama 3.3 70B at 4-bit, it typically ranges from 30 to 60 tokens per second in single-user mode.
Request a free architecture consultation
Facing the choice between cloud, on-premise, or hybrid? We offer a 45-minute initial session at no cost — we assess your data sensitivity and realistic token volume and propose an architecture with a concrete hardware or licensing plan.
Request a free architecture consultationHybrid architectures — the middle path that is usually right
In over 70 percent of our consulting projects, the answer to the cloud-versus-on-premise question is "both, but in different roles." A hybrid architecture separates sensitive from generic workloads and routes each to the appropriate infrastructure. This gives the organization three advantages simultaneously: strictly confidential data stays in-house, generic tasks benefit from the model strength of the cloud, and the overall budget remains manageable because the on-premise model can be smaller in scale.
A proven split in practice looks like this: an on-premise Llama or Mistral processes everything involving contracts, HR records, R&D documents, customer data, or source code. A cloud model such as Claude Enterprise or GPT-4 Enterprise handles the generic tasks — general content creation, brainstorming, translation, publicly available research, marketing content. A routing module upstream decides per request which endpoint is responsible.
Routing can be implemented in three ways. First, manually via the application — the user chooses "internal" or "public" themselves. This is the simplest but most error-prone variant. Second, rule-based — the application classifies documents, keywords, and data sources automatically. Third, model-based — a small local classification model evaluates the sensitivity of each request. The rule-based variant is the pragmatic middle ground for most mid-market companies, because it remains maintainable and is cleanly documentable in audits.
Data residency: EU region, Schrems II, and the pragmatic state of play in 2026
Schrems II has since 2020 been the dominant argument in many data-protection discussions — typically voiced louder and more broadly than the actual legal situation in 2026 warrants. With the EU-US Data Privacy Framework of 2023, the legal basis for data transfers to certified US providers has been restored, and most major cloud LLM providers — OpenAI, Anthropic, Google, Microsoft — are listed under the framework. This does not mean every application is permissible, but it does mean that a blanket "US cloud is prohibited" is not legally tenable.
For a clean argument before data protection officers, works councils, and regulators, a cloud LLM deployment for mid-market companies in 2026 typically requires four components: first, an explicit choice of an EU region with the provider; second, a data processing agreement under Article 28 GDPR; third, a documented data protection impact assessment with risk evaluation; fourth, organizational safeguards such as prohibited data categories and logging restrictions. Anyone who documents these four elements thoroughly is audit-ready in the vast majority of cases. See the cluster on AI and GDPR for full detail.
On-premise only becomes mandatory when three constellations apply: certain KRITIS and regulatory requirements explicitly demand data to remain on-site, contractual confidentiality obligations to customers prohibit external processing, or the data categories are so sensitive — health data, criminal proceedings data, R&D secrets — that the risk assessment allows no defensible cloud scenario. Outside these three constellations, on-premise is an economic and strategic decision, not a legal obligation.
3-year TCO — concrete figures
The economic breakeven between cloud and on-premise is most honestly visible in a three-year comparison. The following table shows a realistic TCO breakdown for a mid-market company with around 150 active AI users and a productive RAG system.
| Cost item | Cloud-only (€ 3 years) | On-premise (€ 3 years) | Hybrid (€ 3 years) |
|---|---|---|---|
| Hardware (server, GPU, storage) | 0 | 55,000–75,000 | 20,000–28,000 |
| Cloud licenses and API consumption | 180,000–260,000 | 0 | 70,000–110,000 |
| Power, cooling, rack hosting | 0 | 12,000–18,000 | 4,500–6,500 |
| Software, maintenance, updates | 0 | 9,000–15,000 | 5,000–8,000 |
| Internal operations (staff) | 15,000–25,000 | 60,000–90,000 | 40,000–55,000 |
| One-time setup and integration | 10,000–20,000 | 25,000–45,000 | 30,000–50,000 |
| Total over 3 years | 205,000–305,000 | 161,000–243,000 | 169,500–257,500 |
The table reveals three important findings. First: on-premise does pay off over three years, but the margin over hybrid is slim and the gap versus cloud-only is not nearly as dramatic as often claimed. Second: the largest on-premise cost item is not hardware but internal staff time. Anyone who honestly accounts for this burden — one to one-and-a-half person-days per week for operations, updates, monitoring, and incident escalation — arrives at very different conclusions than superficial comparisons that only weigh hardware against license costs. Third: the hybrid model is the economically strongest option in most scenarios, because it keeps hardware costs contained and limits cloud license spend to genuinely generic tasks.
For a deeper economic analysis, see the cluster on AI costs and ROI calculation.
Security and audit argumentation
A sound security argument for either model rests on the same building blocks, just weighted differently. For cloud LLMs the emphasis is on: EU region, data processing agreement, provider certifications such as ISO 27001 and SOC 2 Type 2, documented logging and retention rules, and a contingency plan for provider outages. For on-premise the argument shifts to internal responsibilities: patch management of inference servers, network segmentation, access control on model endpoints, audit logging of requests, and regular security testing.
When each option makes sense — the decision matrix
The following decision matrix distills our consulting practice into four criteria. It does not replace an individual architecture consultation, but provides a solid first orientation.
| Criterion | Cloud-only makes sense | Hybrid makes sense | On-premise makes sense |
|---|---|---|---|
| Data sensitivity | predominantly public or low-confidentiality | mixed — some areas strictly confidential | predominantly strictly confidential or regulated |
| Monthly token consumption | under €4,000 cloud costs | €4,000–€10,000 cloud costs | over €10,000 cloud costs or constant 24/7 load |
| IT maturity and capacity | small IT team, limited ops experience | mid-sized IT team with Linux experience | in-house server operations and GPU expertise |
| Model recency | state-of-the-art strictly required | state-of-the-art for some tasks, adequate for others | modern open models are sufficient |
| Regulatory environment | GDPR satisfiable with EU region and DPA | mixed requirements | KRITIS, BaFin, high BSI baseline, or professional secrecy obligations |
The matrix works by majority logic: if three or more criteria clearly fall into one column, that is your recommendation. If criteria are spread across columns, hybrid is almost always the right choice — even if that tends to be the most uncomfortable answer, because it requires a somewhat more sophisticated architecture.
Reepa's experience with both models
Reepa operates a hybrid architecture itself in 2026 and has guided around two dozen mid-market projects in both directions over the past twelve months. A brief, honest assessment from that practice.
Cloud experience. We use Claude Enterprise as the primary model for generic tasks and the Anthropic API for embedded product features in our own tools. Model quality in 2026 is unmatched, the EU region and data processing agreement are cleanly documented, and ops overhead is essentially zero. The biggest pain point: provider pricing rounds are unpredictable, and vendor lock-in is real — we have therefore introduced an abstraction layer that lets us switch models with minimal effort.
On-premise experience. For client projects involving strictly confidential data — audit reports, R&D documents, contract analysis — we operate a GPU server with two L40S cards hosting Mixtral 8x22B and Mistral Medium. The economics versus cloud tip at a volume we calculate individually per client. The underestimated advantage is latency: code reviews and RAG queries feel noticeably more responsive on-premise, which increases user acceptance. The underestimated burden is model-update hygiene: every three to six months a meaningful new model version arrives, and anyone who does not actively update quickly falls behind the cloud standard.
Hybrid recommendation. For German mid-market companies, hybrid is in our experience the right choice in roughly three out of four projects. Pure cloud-only works well for small companies with manageable volumes and non-critical data. Pure on-premise makes sense with strict regulatory requirements or very high constant volumes. Everything in between benefits from separating sensitive and generic workloads. For tool selection within each layer, our cluster on AI tools comparison 2026 is worth a look.
Frequently asked questions
At what company size does running your own LLM server make sense?
There is no universal headcount threshold, because the economics depend on token volume rather than staff size. In our practice, a dedicated GPU server typically pays off when monthly cloud inference spend reaches around €4,000 to €6,000 — equivalent, depending on the model, to roughly 80 to 150 active AI users or a productive RAG system with high document throughput. Below that threshold, cloud is almost always more cost-effective. Above it, data residency considerations also come into play: many mid-market companies choose on-premise even at lower volumes when strictly confidential data is involved.
Which GPU do I need for Llama 3.3 70B or Mixtral 8x22B?
For Llama 3.3 70B in 4-bit quantization mode, a single NVIDIA H100 with 80 GB or two A100s with 80 GB each is sufficient. In FP16 mode, you need two H100s or four A100s. For Mixtral 8x22B in 4-bit mode, a single H100 80 GB is also enough; in FP16, at least two H100s are required. The NVIDIA L40S with 48 GB is a more affordable alternative for smaller models up to around 30 billion parameters and for inference-heavy workloads without training. Rule of thumb for memory sizing: model size in GB ≈ parameter count in billions × quantization factor, plus 20 to 40 percent headroom for KV cache and context window.
Is Schrems II a mandatory reason to go on-premise?
No. Schrems II concerns the transfer of personal data to third countries and has been legally de-escalated for many US providers since the EU-US Data Privacy Framework of 2023 — as long as they are certified under the framework and offer EU data residency. For most mid-market applications, a cloud LLM with an EU region, a data processing agreement, and documented data handling is sufficient. On-premise only becomes mandatory when sector regulators such as BaFin, BSI, or specific KRITIS rules explicitly require data to remain on-site, or when contractual confidentiality obligations toward customers preclude the use of any external provider.
What does a production-grade GPU server for LLM inference cost for mid-market companies?
The entry-level configuration for a production inference server in 2026 sits at around €15,000 to €25,000 in hardware costs — a server with one NVIDIA L40S 48 GB, 128 GB RAM, and fast NVMe storage. A mid-range configuration with one H100 80 GB or two L40S cards runs €35,000 to €55,000. A full configuration with two H100s or four A100s reaches €80,000 to €130,000. Add power and cooling costs of typically €2,000 to €5,000 per year per GPU, maintenance and software licenses, and internal staff time for operation and updates. A 3-year TCO analysis is needed to reveal the true economic breakeven versus cloud.
What does a sensible hybrid architecture look like in practice?
A proven split for mid-market companies: sensitive data — contracts, HR records, R&D documents, customer data — runs through an on-premise model such as Llama 3.3 or Mistral in an internal RAG system. Generic tasks — content writing, brainstorming, translation, general research — run through a cloud LLM with EU region, typically Claude or GPT-4 via an enterprise plan. Routing is handled by a component that decides per request whether data must stay on-premise or whether cloud is permissible. This router can be a simple classifier based on keywords and metadata, or a small local model that evaluates data sensitivity upfront.
Ready to make a clean architecture decision?
Let's talk for 45 minutes with no obligation. We assess your data sensitivity, realistic token volume, and regulatory requirements — and deliver a concrete architecture recommendation with a hardware list, licensing plan, or hybrid routing concept.
Schedule a 45-minute architecture call