Designing Parallel Processing with Bedrock TPM in Mind
Introduction
Hi, I'm @satococoa from the Acsim development team.
At Acsim, we use Amazon Bedrock as our LLM platform. When operating Bedrock in production, one constraint you cannot avoid is the TPM (Tokens per Minute) limit.
The goal is simple: process LLM calls as quickly as possible. But if you raise concurrency too much, you quickly hit the TPM limit and the whole pipeline stalls.
In this article, I'll share how we designed our parallel processing with TPM limits in mind, and what we learned through implementation.
Understanding Bedrock's rate limit model
Bedrock has several limits, but the most important one here is TPM (Tokens per Minute). This is the maximum number of tokens you can consume per minute, and it directly determines how much concurrency you can run.
TPM varies by model. You can check the exact values in Service Quotas.
How token consumption works
Before implementing anything, you need to understand how tokens are actually consumed.
When you call Bedrock, you specify the maximum number of output tokens (maxTokens). This controls the maximum response length.
According to the official docs, token consumption happens in three phases:
1. At request start
Total input tokens + maxTokens
As soon as you send the request, the sum of input tokens and maxTokens is deducted from TPM. If you're using prompt caching, CacheReadInputTokens and CacheWriteInputTokens are also included. If this deduction exceeds TPM, the request is throttled.
2. During processing
As the model generates output, consumption is periodically adjusted (the exact formula isn't publicly disclosed).
3. At request completion
InputTokenCount + CacheWriteInputTokens + (OutputTokenCount × burndown rate)
The final consumption is fixed, and any unused tokens are returned to TPM. Note that CacheReadInputTokens are not included in this final calculation.
Here's the flow as a diagram (for Claude models, output tokens are multiplied by the burndown rate; the example below assumes a burndown rate of 5, explained later):
What is the burndown rate?
Output tokens consume TPM with a multiplier called the burndown rate. As of January 2026, for recent Claude models (Opus 4, Sonnet 4, etc.), the multiplier is 5x. That means 100 output tokens count as 500 TPM.
This 5x multiplier only applies to TPM consumption. Billing is based on the actual input/output token counts, so you are not charged 5x just because the TPM consumption is 5x.
- Claude Opus 4 / 4.1: 5x
- Claude Sonnet 4 / 4.5: 5x
- Claude 3.7 Sonnet: 5x
- Claude Haiku 4.5: 5x
- Other models: 1x
maxTokens directly impacts concurrency
The key takeaway here is that maxTokens directly affects how many concurrent requests you can run.
If you set maxTokens too large, the deduction at request start grows, reducing concurrency. If you keep it close to actual output size, you can run more requests within the same TPM budget.
Optimizing maxTokens is the key to maximizing throughput.
Designing parallel processing
How to decide concurrency
The first question when designing concurrency is: “How many requests can we run at once?” A simple estimate is:
Concurrent requests ≒ TPM ÷ (input tokens + maxTokens)
This is only a rough approximation. In production, we estimate maxTokens from historical output sizes and sum across parallel workflow phases (details below).
Real workflows run multiple agents in parallel
We use Mastra to build workflows composed of multiple agents. Some steps run in parallel.
For example:
Here, Step 2 and Step 3 run concurrently. That means the effective maximum consumption in this phase is the sum of maxTokens for Agent B and Agent C.
Identify parallel “phases”
To estimate concurrency accurately, we need to identify where parallel execution occurs in the workflow.
We do this as follows:
- Identify the parallel phases in the workflow structure
- Sum
maxTokensfor agents running concurrently in each phase - Use the maximum sum across all phases as the worst-case budget
By planning against the worst case, we ensure we never exceed TPM.
Dynamically adjust maxTokens
If maxTokens is fixed, it tends to be set too high for safety, which reduces concurrency and throughput.
We dynamically adjust maxTokens based on historical output sizes:
- Record output token counts
- Remove outliers
- Use (max of remaining) × 1.5 as the next
maxTokens
Example: if the last 10 outputs are [800, 850, 900, 820, 3000, 870, 810, 890, 840, 860]:
- Remove the outlier (3000)
- Max of remaining is 900
- 900 × 1.5 = 1350 for next
maxTokens
This keeps safety margins while maximizing concurrency.
Retry when output is truncated
If maxTokens is too small, responses may be cut off. Bedrock responses include a stopReason field, and truncation is indicated by "max_tokens".
Our retry strategy:
- If
stopReasonis"max_tokens", retry withmaxTokensdoubled - Repeat until reaching the model's maximum output limit
- If it still truncates at the max, treat it as an error
This lets us start with conservative maxTokens to maximize concurrency, while automatically increasing it only when necessary.
Flow overview:
Scaling concurrency with SQS
In our setup, LLM calls are processed through an SQS queue.
With SQS, the ReceiveMessage API can fetch up to 10 messages at a time. That means to increase concurrency, you need more consumers.
Effective concurrency = Number of consumers × Batch size
We scale the number of consumers dynamically based on the concurrency derived from TPM.
Summary
When running Bedrock in production, designing parallel processing around TPM limits is essential. Here are the key points we learned:
Token consumption
- At request start,
input tokens + maxTokensare deducted from TPM (including cache tokens) - At request completion, actual consumption is finalized and unused tokens are returned
- Optimizing
maxTokensdirectly controls concurrency
Parallel processing design
- Estimate concurrency from TPM and maximum token consumption
- Identify parallel phases and plan for the worst-case sum
- Dynamically adjust
maxTokensbased on historical outputs - Retry with doubled
maxTokenswhen outputs are truncated - For SQS, scale consumers to increase concurrency
With these practices, you can sustain stable performance without constantly fighting TPM limits.
Acknowledgements
The research, design, and implementation described here were led by @technote-space from the Acsim development team. Thank you!
