Choosing the Right LLM Hosting Strategy
A Practical Comparison of APIs, Cloud Platforms, and Self-Hosted Models

Large language models are moving from experimentation to production across industries. As teams begin to rely on LLMs for customer-facing and mission-critical workflows, a previously secondary question becomes central: where and how should these models be hosted?
The choice of LLM hosting impacts not only cost, but also latency, scalability, data control, and compliance. While managed APIs offer speed and simplicity, cloud platforms and self-hosted deployments provide greater control at the expense of operational complexity. Understanding these trade-offs early can prevent costly re-architecting later.
For early MVPs with a small user base and unpredictable traffic, managed API providers – both proprietary (such as OpenAI, Gemini, and Anthropic) and open-weight platforms (including Groq and Baseten) – offer the fastest and most cost-efficient path to production. These options also meet common compliance requirements such as GDPR and SOC 2, making them suitable for many regulated use cases.
Managed cloud platforms such as Amazon Bedrock, Google Vertex AI, and Azure Foundry provide more flexibility for users who need regional deployment or private networking. However, more specific needs (e.g., deployment of unsupported models, high auditability, extensive customization, etc.) might require self-hosting of LLM. If this is the case, serverless GPU platforms offer users a low-cost option for accessing GPU resources compared to using traditional cloud virtual machines.
This article provides a practical overview of modern LLM hosting options, comparing proprietary and open-weight APIs with managed cloud platforms and self-hosted deployments. The goal is to help teams select configurations that strike a balance between cost, security, performance, and operational complexity.
Proprietary APIs
It's common for the MVP phase to use serverless API providers with a per-token pricing strategy, because they offer standard security and compliance programs, including SOC 2 and GDPR-aligned data processing terms. In the early-stage phase, we often expect a few users and unpredictable request patterns, so usage-based (per-token) pricing is usually the best option. Famous proprietary providers like OpenAI, Gemini, Anthropic, and Grok are widely seen as top-tier in real-world performance and also offer enterprise-grade security and compliance options.
| Provider | GDPR | Data Protection Act 2018 | No training | Encryption | No sharing of data |
OpenAI |
|
|
| ||
Gemini |
|
|
| ||
Anthropic |
|
|
| ||
Grok |
|
|
|
*“No sharing of data” (with third parties) is not truly accurate for any of OpenAI, Anthropic, Gemini, or Grok because all of them rely on subprocessors (cloud infrastructure, monitoring), which are third parties that may process customer data to deliver the service. These proprietary providers don’t share user data with third parties except vetted subprocessors under contractual restrictions (DPA), only to provide the service.
Open-Weight API Providers
There are several API providers of token-based access to open-weight models. The key advantages of these providers are cost and latency: open models are typically more cost-effective than proprietary ones, and they tend to run significantly faster. However, proprietary models normally outperform open-weight counterparts in complex, multistep, or long-context reasoning.
Services, such as DeepInfra, Together AI, Novita, Groq, and Baseten provide inference endpoints—fully managed APIs for running open-weight models (e.g., gpt-oss, LLaMa, Qwen) — without managing GPUs or scaling infrastructure. They offer low latency and high throughput, as demonstrated by recent OpenRouter benchmarks. These platforms also meet enterprise compliance standards such as SOC 2 and GDPR, ensuring data security and privacy for production workloads.
| Provider | GDPR | Data Protection Act 2018 | VPC | Encryption | No sharing of data |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
|
*Not mentioned in the privacy policy
**You can set up a PSC endpoint on GCP, and traffic will be privately forwarded to Groq, but that adds extra costs and requires an architecture on GCP.
***All platforms mentioned above don’t share data with model providers; however, subprocessors may exist under an internal DPA.
Here is the latency comparison of open-weight API providers that host gpt-oss-20b on their side. gpt-oss-20b was chosen because it's one of the smallest models that is still being positioned as strong for tool use and reasoning. OpenAI highlights that gpt-oss-20b delivers similar results to o3-mini on common benchmarks.
| Platform | Model | Latency | TPS | Input tokens (1M) | Output tokens (1M) |
gpt-oss-20b | 0.92s | 215 | $0.032 | $0.12 | |
gpt-oss-20b | 0.28s | 163 | $0.005 | $0.2 | |
gpt-oss-20b | 0.23s | 150.6 | $0.03 | $0.14 | |
gpt-oss-20b | 0.27s | 1356 | $0.075 | $0.3 | |
gpt-oss-120b* | 0.09s | 257.3 | $0.1 | $0.5 |
*-gpt-oss-120b was used here for comparison, because it’s the only OpenAI open-weight model Baseten provides.
Among the providers offering API access to open-weight LLMs, we recommend choosing Groq. While it has some of the highest per-token prices, for a relatively small number of requests, it's still more cost-effective than hosting your own model, and it delivers significantly low latency. Groq also adheres to the GDPR security standard.
We would not consider the availability of VPC networking a key decision factor, as it's not a core requirement under GDPR and always requires additional configuration and cost. For example, in AWS, Vertex AI, Azure, VPC connectivity is not enabled by default; it's an optional feature that lets you route traffic privately within the cloud provider instead of through the public internet, useful for strict network-isolation policies but unnecessary for most use cases.
LLM Cloud Platforms
To reduce the risk of data leakage when using proprietary LLMs, you can use proprietary models through major cloud providers. For example, with AWS Bedrock, AWS explicitly states that prompts and responses are not used for training and aren’t shared with third parties, and that model providers don’t have access to the model deployment accounts. Azure Foundry provides a similar approach: Microsoft hosts the OpenAI models inside Azure, and the service doesn’t interact with OpenAI-operated services. Google Vertex AI offers the same type of managed enterprise access for Gemini.
| Platform | Model | Region | Input tokens (1M tokens) | Output price (1M tokens) | Latency | TPS |
AWS Bedrock | Claude 4.5 Sonnet | eu-west-2 | $3 | $15 | 3.89s | 85.86 |
Google Vertex AI | Gemini 2.5 Flash | europe-west2 | $0.30 | $2.5 | 0.53s | 94.44 |
Azure Foundry | GPT-5.1 | spain-central | $1.38 | $11 | N/A | N/A |
| Platform | GDPR | Data Protection Act 2018 | Encryption at rest / in transit | Private network | No training | No sharing of data |
AWS Bedrock |
|
|
|
|
| Not shared with model providers; subprocessors may exist under AWS terms. |
Google Vertex AI |
|
|
|
|
| Subprocessors are possible under Google Cloud terms. |
Azure Foundry |
|
|
|
|
| Operated by Microsoft; subprocessors may exist under Microsoft DPA. |
Self-Hosted LLM
Of course, there is the option to host an LLM independently. To run an LLM in real time, we need a GPU because the model performs billions of math operations (matrix multiplications) per request, and GPUs are built to handle that in parallel. The most common GPUs are NVIDIA T4 (16GB), A10G (24GB), L4 (24GB), and A100 (80GB).
We shouldn’t choose NVIDIA T4 (the cheapest option) because it’s based on the older Turing generation, and many modern inference optimizations are designed for newer GPUs. In particular, FlashAttention and bf16 support require Ampere or newer, so with T4, you lose key speed features.
Another issue arises when deploying models that require more memory than available on a single GPU unit. Even if you provision 4 T4 (4 x 16GB) instances to increase total memory, deploying a single LLM across multiple GPUs requires model parallelism, which is complex to configure. In contrast, a single A100 (80GB) can host models that would otherwise require several T4, A10G, or L4 instances. And of course, A100 provides much higher memory bandwidth and throughput. Here you can review the benchmark of GPU performance running LLaMA 3.1 (with 8B params).
Cloud Hosting
When considering self-hosting LLMs, a fully managed setup typically refers to running the infrastructure within a cloud provider’s environment. This approach offers existing cloud security and monitoring tools. However, it also comes with drawbacks: significantly higher operational costs, complex GPU instance management, and VM configurations. In practice, maintaining such infrastructure often requires dedicated DevOps resources and continuous optimization of compute.
You can host the model on a dedicated VM (e.g., AWS EC2) and keep it running 24/7, but then you pay for the GPU even when there are zero users (+ storage and networking). Here is an approximate cost in AWS EU (London) for on-demand instances:
| Example EC2 | GPUs | $ / hour | $ / month |
g6.12xlarge | 4 x L4 | $5.84 | $4264.0 |
g5.12xlarge | 4 x A10G | $7.20 | $5256.0 |
p4d.24xlarge | 8 x A100* | $28.55 | $20838.0 |
*AWS does not provide fewer than 8 of A100s instances.
Serverless GPU Platforms
As you can see, the costs of running dedicated GPU infrastructure 24/7 can become very high, especially when usage is unpredictable, and the GPUs sit idle for long periods. A more flexible alternative is to use serverless GPU platforms that let you run GPU workloads without managing servers. The platform auto-scales up with traffic and scales to zero when idle. You’re billed only for active instance time, rather than 24/7 uptime. They provide GPU orchestration (queuing, concurrency, health checks), snapshot/warm-pool optimizations to cut cold starts. Auto-scaling feature is one of the crucial benefits of such platforms. On AWS or GCP, you’d need to manage scaling groups and Kubernetes capacity, configure startup settings, and ensure requests finish cleanly during scale-down. Here on serverless GPU platforms, auto-scaling is enabled by default.
The biggest savings come from zero-scaling, when the platform can fully turn off the GPU when there’s no traffic. A key setting here is the idle timeout: how long the system waits after the last request before shutting down the container. If the idle timeout is set too short, the system will shut down too aggressively. That leads to more frequent restarts, and the first request after each pause will be slower because the model needs time to start back up (cold start).
| Platform | Avg cold start time | Worst-case cold start time | Auto-scaling | GDPR | Billing for the cold start |
10 | 60 |
|
| paid | |
4 | 40 |
|
| paid | |
2 | 10 |
|
| free | |
5 | 30 |
|
| free | |
5 | 30 |
|
| paid | |
15 | 60 |
|
| paid |
*not specified on the website, can be requested from the sales team
**DeepInfra supports zero-scaling, but its own autoscaler makes the decision. It monitors traffic to the endpoint. When there are no active requests or queues within an internal idle window, the autoscaler reduces the number of instances to min_instances that you configure. This makes it difficult to estimate the budget and also doesn't allow for predicting the number of restarts per day.
Total monthly cost can vary significantly depending on whether your traffic is steady (GPU always on) or bursty (GPU sleeps between sessions). Therefore, you must estimate pricing under different activity scenarios to understand the real cost trade-off between latency, uptime, and budget.
Here are several scenarios and how their costs will be calculated.
| № | Scenario | Description | Formula | Use Case |
1 | Always-On GPU | The model receives steady traffic, and the GPU never scales to zero. |
| This represents the upper bound - maximum cost, but no cold-start latency. |
2 | Batch activity | Traffic comes in batches. For example, we have 10K requests split into 20 sessions, with complete silence between sessions. | The duration of one session is (10000 / 20), so
| This is a case where the GPU remains active for brief periods. |
3 | Mixed Pattern | Assume the GPU is active for about 12 hours per day and scaled down for the remaining 12 hours. |
| This represents the typical use case, with some cold-start latency. |
| GPU - A100 | Scenario 1 | Scenario 2 | Scenario 3 |
$2880 | $206 | $1440 | |
$1482 | $106 | $741 | |
$1798 | $129 | $899 | |
$1244 | $89 | $622 | |
$1439 | $103 | $719 | |
$640 |
|
|
| GPU - L4 | Scenario 1 | Scenario 2 | Scenario 3 |
$611 | $44 | $305 | |
$575 | $41 | $288 | |
$578 | $42 | $289 | |
$311 | $22 | $155 | |
$503 | $36 | $251 |
| GPU - A10G | Scenario 1 | Scenario 2 | Scenario 3 |
$869 | $62 | $434 | |
$793 | $57 | $396 | |
$795 | $58 | $398 |
We recommend using Modal as the primary option. Although it's one of the most expensive solutions, it remains significantly cheaper than fully self-hosting on AWS or GCP. Modal, also widely adopted in the industry, offers strong community and enterprise support, and provides well-documented developer guides, making it a reliable choice.
As a second option, we suggest RunPod. It complies with security and privacy standards, offers lower pricing, and includes built-in support for vLLM, simplifying deployment. However, RunPod is still relatively new to the market, so we can't yet predict the responsiveness and consistency of their support in case of operational issues.
Pricing Comparison
API providers charge per number of tokens processed in a request. To get the monthly cost intuition, we assume average usage of (10K input + 10K output tokens) ╳ 20K requests per month:
| Provider | Model | Input (1M tokens) | Output (1M tokens) | Cost per month |
AWS Bedrock | Sonnet 4.5 | $3 | $15 | $1600 |
Anthropic API | Sonnet 4.5 | $3 | $15 | $1600 |
AWS Bedrock | gpt-oss-120b | $0.23 | $0.93 | $232 |
AWS Bedrock | gpt-oss-20b | $0.11 | $0.47 | $116 |
Google Vertex AI | Gemini-2.5-Flash | $0.3 | $2.5 | $560 |
Gemini API | Gemini-2.5-Flash | $0.3 | $2.5 | $560 |
Azure Foundry | GPT-5.1 | $1.38 | $11 | $2476 |
OpenAI API | GPT-5.1 | $1.25 | $10 | $2250 |
Groq | gpt-oss-20b | $0.075 | $0.3 | $75 |
For self-hosted setups, let's assume that we have a mixed pattern, and the GPU is active for about 12 hours per day. Then the minimal monthly cost (only one active GPU, one concurrent user request) would be:
| Provider | GPUs | Provider type | Instance type | Cost per month |
AWS EC2 | 8 x A100 | VM Instance | p4d.24xlarge | $20838 |
GCP VMs | 1 x A100 | VM Instance | a2-highgpu-1g | $4765 |
Modal | 1 x A100 | Serverless GPU |
| $899 |
RunPod | 1 x A100 | Serverless GPU |
| $622 |
Cerebirum | 1 x A100 | Serverless GPU |
| $741 |
AWS EC2 | 4 x A10G | VM Instance | g5.12xlarge | $5256 |
Modal | 1 x A10G | Serverless GPU |
| $398 |
Cerebrium | 1 x A10G | Serverless GPU |
| $396 |
AWS EC2 | 4 x L4 | VM Instance | g6.12xlarge | $4264 |
GCP VMs | 1 x L4 | VM Instance | g2-standard-4 | $825 |
Modal | 1 x L4 | Serverless GPU |
| $289 |
RunPod | 1 x L4 | Serverless GPU |
| $155 |
Cerebirum | 1 x L4 | Serverless GPU |
| $288 |
Conclusion
There's no single “best” way to support LLMs: each team decides on its own approach, according to the current project phase, constraints, and long-term goals. Managed APIs facilitate immediate testing and development, with accelerated launch timeframes, while cloud service providers provide a balance between oversight and ease of operation for customers. By contrast, deploying on-premises provides ultimate freedom but entails increased responsibilities relating to both infrastructure and maintenance.
Importantly, hosting LLMs is frequently not a one-time decision for product teams. As models transition from prototype to production, the requirements for compliance and customization frequently differ, thus resulting in varying usage patterns. By selecting hosting models with full knowledge of the trade-offs associated with each, product teams can minimize future migration expenses and operational risk.
Because hosting LLMs is fundamentally an architectural decision and not merely an implementation issue, organizations can design systems that are scalable in a sustainable fashion, compliant with applicable regulatory requirements, and offer dependable and consistent performance across the full lifecycle of their AI workload from prototyping through production.


