Choosing the Right LLM Hosting Strategy

A Practical Comparison of APIs, Cloud Platforms, and Self-Hosted Models

AI R&D Center

03 Feb 2026

29 min

Content

Proprietary APIs Open-Weight API Providers LLM Cloud Platforms Self-Hosted LLM Cloud Hosting Serverless GPU Platforms Pricing Comparison Conclusion

Large language models are moving from experimentation to production across industries. As teams begin to rely on LLMs for customer-facing and mission-critical workflows, a previously secondary question becomes central: where and how should these models be hosted?

The choice of LLM hosting impacts not only cost, but also latency, scalability, data control, and compliance. While managed APIs offer speed and simplicity, cloud platforms and self-hosted deployments provide greater control at the expense of operational complexity. Understanding these trade-offs early can prevent costly re-architecting later.

For early MVPs with a small user base and unpredictable traffic, managed API providers – both proprietary (such as OpenAI, Gemini, and Anthropic) and open-weight platforms (including Groq and Baseten) – offer the fastest and most cost-efficient path to production. These options also meet common compliance requirements such as GDPR and SOC 2, making them suitable for many regulated use cases.

Managed cloud platforms such as Amazon Bedrock, Google Vertex AI, and Azure Foundry provide more flexibility for users who need regional deployment or private networking. However, more specific needs (e.g., deployment of unsupported models, high auditability, extensive customization, etc.) might require self-hosting of LLM. If this is the case, serverless GPU platforms offer users a low-cost option for accessing GPU resources compared to using traditional cloud virtual machines.

This article provides a practical overview of modern LLM hosting options, comparing proprietary and open-weight APIs with managed cloud platforms and self-hosted deployments. The goal is to help teams select configurations that strike a balance between cost, security, performance, and operational complexity.

Proprietary APIs

It's common for the MVP phase to use serverless API providers with a per-token pricing strategy, because they offer standard security and compliance programs, including SOC 2 and GDPR-aligned data processing terms. In the early-stage phase, we often expect a few users and unpredictable request patterns, so usage-based (per-token) pricing is usually the best option. Famous proprietary providers like OpenAI, Gemini, Anthropic, and Grok are widely seen as top-tier in real-world performance and also offer enterprise-grade security and compliance options.

Provider	GDPR	Data Protection Act 2018	No training	Encryption	No sharing of data
OpenAI	supports	supports	`+`	`+`	`?*`
Gemini	supports	supports	`+`	`+`	`?*`
Anthropic	supports	supports	`+`	`+`	`?*`
Grok	supports	supports	`+`	`+`	`?*`

*“No sharing of data” (with third parties) is not truly accurate for any of OpenAI, Anthropic, Gemini, or Grok because all of them rely on subprocessors (cloud infrastructure, monitoring), which are third parties that may process customer data to deliver the service. These proprietary providers don’t share user data with third parties except vetted subprocessors under contractual restrictions (DPA), only to provide the service.

Open-Weight API Providers

There are several API providers of token-based access to open-weight models. The key advantages of these providers are cost and latency: open models are typically more cost-effective than proprietary ones, and they tend to run significantly faster. However, proprietary models normally outperform open-weight counterparts in complex, multistep, or long-context reasoning.

Services, such as DeepInfra, Together AI, Novita, Groq, and Baseten provide inference endpoints—fully managed APIs for running open-weight models (e.g., gpt-oss, LLaMa, Qwen) — without managing GPUs or scaling infrastructure. They offer low latency and high throughput, as demonstrated by recent OpenRouter benchmarks. These platforms also meet enterprise compliance standards such as SOC 2 and GDPR, ensuring data security and privacy for production workloads.

Provider	GDPR	Data Protection Act 2018	VPC	Encryption	No sharing of data
Novita	`?*`	`?*`	`-`	`?*`	`?***`
TogetherAI	`?*`	`?*`	`+`	`+`	`?***`
DeepInfra	`+`	`+`	`-`	`+`	`?***`
Baseten	`+`	`+`	`-`	`+`	`?***`
Groq	`+`	`+`	`?**`	`+`	`?***`

*Not mentioned in the privacy policy

**You can set up a PSC endpoint on GCP, and traffic will be privately forwarded to Groq, but that adds extra costs and requires an architecture on GCP.

***All platforms mentioned above don’t share data with model providers; however, subprocessors may exist under an internal DPA.

Here is the latency comparison of open-weight API providers that host gpt-oss-20b on their side. gpt-oss-20b was chosen because it's one of the smallest models that is still being positioned as strong for tool use and reasoning. OpenAI highlights that gpt-oss-20b delivers similar results to o3-mini on common benchmarks.

Platform	Model	Latency	TPS	Input tokens (1M)	Output tokens (1M)
Novita	gpt-oss-20b	0.92s	215	$0.032	$0.12
TogetherAI	gpt-oss-20b	0.28s	163	$0.005	$0.2
DeepInfra	gpt-oss-20b	0.23s	150.6	$0.03	$0.14
Groq	gpt-oss-20b	0.27s	1356	$0.075	$0.3
Baseten	gpt-oss-120b*	0.09s	257.3	$0.1	$0.5

*-gpt-oss-120b was used here for comparison, because it’s the only OpenAI open-weight model Baseten provides.

Among the providers offering API access to open-weight LLMs, we recommend choosing Groq. While it has some of the highest per-token prices, for a relatively small number of requests, it's still more cost-effective than hosting your own model, and it delivers significantly low latency. Groq also adheres to the GDPR security standard.

We would not consider the availability of VPC networking a key decision factor, as it's not a core requirement under GDPR and always requires additional configuration and cost. For example, in AWS, Vertex AI, Azure, VPC connectivity is not enabled by default; it's an optional feature that lets you route traffic privately within the cloud provider instead of through the public internet, useful for strict network-isolation policies but unnecessary for most use cases.

LLM Cloud Platforms

To reduce the risk of data leakage when using proprietary LLMs, you can use proprietary models through major cloud providers. For example, with AWS Bedrock, AWS explicitly states that prompts and responses are not used for training and aren’t shared with third parties, and that model providers don’t have access to the model deployment accounts. Azure Foundry provides a similar approach: Microsoft hosts the OpenAI models inside Azure, and the service doesn’t interact with OpenAI-operated services. Google Vertex AI offers the same type of managed enterprise access for Gemini.

Platform	Model	Region	Input tokens (1M tokens)	Output price (1M tokens)	Latency	TPS
AWS Bedrock	Claude 4.5 Sonnet	eu-west-2	$3	$15	3.89s	85.86
Google Vertex AI	Gemini 2.5 Flash	europe-west2	$0.30	$2.5	0.53s	94.44
Azure Foundry	GPT-5.1	spain-central	$1.38	$11	N/A	N/A

Platform	GDPR	Data Protection Act 2018	Encryption at rest / in transit	Private network	No training	No sharing of data
AWS Bedrock	`+`	`+`	`+`	`+`	`+`	Not shared with model providers; subprocessors may exist under AWS terms.
Google Vertex AI	`+`	`+`	`+`	`+`	`+`	Subprocessors are possible under Google Cloud terms.
Azure Foundry	`+`	`+`	`+`	`+`	`+`	Operated by Microsoft; subprocessors may exist under Microsoft DPA.

Self-Hosted LLM

Of course, there is the option to host an LLM independently. To run an LLM in real time, we need a GPU because the model performs billions of math operations (matrix multiplications) per request, and GPUs are built to handle that in parallel. The most common GPUs are NVIDIA T4 (16GB), A10G (24GB), L4 (24GB), and A100 (80GB).

We shouldn’t choose NVIDIA T4 (the cheapest option) because it’s based on the older Turing generation, and many modern inference optimizations are designed for newer GPUs. In particular, FlashAttention and bf16 support require Ampere or newer, so with T4, you lose key speed features.

Another issue arises when deploying models that require more memory than available on a single GPU unit. Even if you provision 4 T4 (4 x 16GB) instances to increase total memory, deploying a single LLM across multiple GPUs requires model parallelism, which is complex to configure. In contrast, a single A100 (80GB) can host models that would otherwise require several T4, A10G, or L4 instances. And of course, A100 provides much higher memory bandwidth and throughput. Here you can review the benchmark of GPU performance running LLaMA 3.1 (with 8B params).

Cloud Hosting

When considering self-hosting LLMs, a fully managed setup typically refers to running the infrastructure within a cloud provider’s environment. This approach offers existing cloud security and monitoring tools. However, it also comes with drawbacks: significantly higher operational costs, complex GPU instance management, and VM configurations. In practice, maintaining such infrastructure often requires dedicated DevOps resources and continuous optimization of compute.

You can host the model on a dedicated VM (e.g., AWS EC2) and keep it running 24/7, but then you pay for the GPU even when there are zero users (+ storage and networking). Here is an approximate cost in AWS EU (London) for on-demand instances:

Example EC2	GPUs	$ / hour	$ / month
g6.12xlarge	4 x L4	$5.84	$4264.0
g5.12xlarge	4 x A10G	$7.20	$5256.0
p4d.24xlarge	8 x A100*	$28.55	$20838.0

*AWS does not provide fewer than 8 of A100s instances.

Serverless GPU Platforms

As you can see, the costs of running dedicated GPU infrastructure 24/7 can become very high, especially when usage is unpredictable, and the GPUs sit idle for long periods. A more flexible alternative is to use serverless GPU platforms that let you run GPU workloads without managing servers. The platform auto-scales up with traffic and scales to zero when idle. You’re billed only for active instance time, rather than 24/7 uptime. They provide GPU orchestration (queuing, concurrency, health checks), snapshot/warm-pool optimizations to cut cold starts. Auto-scaling feature is one of the crucial benefits of such platforms. On AWS or GCP, you’d need to manage scaling groups and Kubernetes capacity, configure startup settings, and ensure requests finish cleanly during scale-down. Here on serverless GPU platforms, auto-scaling is enabled by default.

The biggest savings come from zero-scaling, when the platform can fully turn off the GPU when there’s no traffic. A key setting here is the idle timeout: how long the system waits after the last request before shutting down the container. If the idle timeout is set too short, the system will shut down too aggressively. That leads to more frequent restarts, and the first request after each pause will be slower because the model needs time to start back up (cold start).

Platform	Avg cold start time	Worst-case cold start time	Auto-scaling	GDPR	Billing for the cold start
Baseten	10	60	`+`	`+`	paid
RunPod	4	40	`+`	`+`	paid
Cerebrium	2	10	`+`	`+`	free
Modal	5	30	`+`	`?*`	free
Koyeb	5	30	`+`	`?*`	paid
DeepInfra	15	60	`+**`	`+`	paid

*not specified on the website, can be requested from the sales team

**DeepInfra supports zero-scaling, but its own autoscaler makes the decision. It monitors traffic to the endpoint. When there are no active requests or queues within an internal idle window, the autoscaler reduces the number of instances to min_instances that you configure. This makes it difficult to estimate the budget and also doesn't allow for predicting the number of restarts per day.

Total monthly cost can vary significantly depending on whether your traffic is steady (GPU always on) or bursty (GPU sleeps between sessions). Therefore, you must estimate pricing under different activity scenarios to understand the real cost trade-off between latency, uptime, and budget.

Here are several scenarios and how their costs will be calculated.

№	Scenario	Description	Formula	Use Case
1	Always-On GPU	The model receives steady traffic, and the GPU never scales to zero.	`month_price = P_GPU * T_month` where `P_GPU` is the price per second for the GPU, and `T_month` is the total seconds in a billing month.	This represents the upper bound - maximum cost, but no cold-start latency.
2	Batch activity	Traffic comes in batches. For example, we have 10K requests split into 20 sessions, with complete silence between sessions.	The duration of one session is (10000 / 20), so `SESS_TIME = 500 × INFER_TIME`. After each session, the container waits for `IDLE_TIMEOUT` seconds before scaling down. Total monthly GPU time is `ACTIVE_TIME + IDLE_TIME`: `ACTIVE_TIME = 20 × SESS_TIME` `IDLE_TIME = 20 × IDLE_TIMEOUT` `month_price = (ACTIVE_TIME+ IDLE_TIME) × P_GPU`	This is a case where the GPU remains active for brief periods.
3	Mixed Pattern	Assume the GPU is active for about 12 hours per day and scaled down for the remaining 12 hours.	`12 hours × 30 days × 3600 = 1296000 seconds` `month_price = 1296000 × P_GPU`	This represents the typical use case, with some cold-start latency.

GPU - A100	Scenario 1	Scenario 2	Scenario 3
Baseten	$2880	$206	$1440
Cerebrium	$1482	$106	$741
Modal	$1798	$129	$899
RunPod	$1244	$89	$622
Koyeb	$1439	$103	$719
DeepInfra	$640	`-`	`-`

GPU - L4	Scenario 1	Scenario 2	Scenario 3
Baseten	$611	$44	$305
Cerebrium	$575	$41	$288
Modal	$578	$42	$289
RunPod	$311	$22	$155
Koyeb	$503	$36	$251

GPU - A10G	Scenario 1	Scenario 2	Scenario 3
Baseten	$869	$62	$434
Cerebrium	$793	$57	$396
Modal	$795	$58	$398

We recommend using Modal as the primary option. Although it's one of the most expensive solutions, it remains significantly cheaper than fully self-hosting on AWS or GCP. Modal, also widely adopted in the industry, offers strong community and enterprise support, and provides well-documented developer guides, making it a reliable choice.

As a second option, we suggest RunPod. It complies with security and privacy standards, offers lower pricing, and includes built-in support for vLLM, simplifying deployment. However, RunPod is still relatively new to the market, so we can't yet predict the responsiveness and consistency of their support in case of operational issues.

Pricing Comparison

API providers charge per number of tokens processed in a request. To get the monthly cost intuition, we assume average usage of (10K input + 10K output tokens) ╳ 20K requests per month:

Provider	Model	Input (1M tokens)	Output (1M tokens)	Cost per month
AWS Bedrock	Sonnet 4.5	$3	$15	$1600
Anthropic API	Sonnet 4.5	$3	$15	$1600
AWS Bedrock	gpt-oss-120b	$0.23	$0.93	$232
AWS Bedrock	gpt-oss-20b	$0.11	$0.47	$116
Google Vertex AI	Gemini-2.5-Flash	$0.3	$2.5	$560
Gemini API	Gemini-2.5-Flash	$0.3	$2.5	$560
Azure Foundry	GPT-5.1	$1.38	$11	$2476
OpenAI API	GPT-5.1	$1.25	$10	$2250
Groq	gpt-oss-20b	$0.075	$0.3	$75

For self-hosted setups, let's assume that we have a mixed pattern, and the GPU is active for about 12 hours per day. Then the minimal monthly cost (only one active GPU, one concurrent user request) would be:

Provider	GPUs	Provider type	Instance type	Cost per month
AWS EC2	8 x A100	VM Instance	p4d.24xlarge	$20838
GCP VMs	1 x A100	VM Instance	a2-highgpu-1g	$4765
Modal	1 x A100	Serverless GPU	`-`	$899
RunPod	1 x A100	Serverless GPU	`-`	$622
Cerebirum	1 x A100	Serverless GPU	`-`	$741
AWS EC2	4 x A10G	VM Instance	g5.12xlarge	$5256
Modal	1 x A10G	Serverless GPU	`-`	$398
Cerebrium	1 x A10G	Serverless GPU	`-`	$396
AWS EC2	4 x L4	VM Instance	g6.12xlarge	$4264
GCP VMs	1 x L4	VM Instance	g2-standard-4	$825
Modal	1 x L4	Serverless GPU	`-`	$289
RunPod	1 x L4	Serverless GPU	`-`	$155
Cerebirum	1 x L4	Serverless GPU	`-`	$288

Conclusion

There's no single “best” way to support LLMs: each team decides on its own approach, according to the current project phase, constraints, and long-term goals. Managed APIs facilitate immediate testing and development, with accelerated launch timeframes, while cloud service providers provide a balance between oversight and ease of operation for customers. By contrast, deploying on-premises provides ultimate freedom but entails increased responsibilities relating to both infrastructure and maintenance.

Importantly, hosting LLMs is frequently not a one-time decision for product teams. As models transition from prototype to production, the requirements for compliance and customization frequently differ, thus resulting in varying usage patterns. By selecting hosting models with full knowledge of the trade-offs associated with each, product teams can minimize future migration expenses and operational risk.

Because hosting LLMs is fundamentally an architectural decision and not merely an implementation issue, organizations can design systems that are scalable in a sustainable fashion, compliant with applicable regulatory requirements, and offer dependable and consistent performance across the full lifecycle of their AI workload from prototyping through production.

Important copyright notice © DataRoot Labs and datarootlabs.com, 2026. Unauthorized use and/or duplication of this material without express and written permission from this site’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to DataRoot Labs and datarootlabs.com with appropriate and specific direction to the original content.