Engineering

Designing Parallel Processing with Bedrock TPM in Mind

Designing Parallel Processing with Bedrock TPM in Mind
#LLM#AWS#Bedrock#SQS

Designing Parallel Processing with Bedrock TPM in Mind

Introduction

Hi, I'm @satococoa from the Acsim development team.

At Acsim, we use Amazon Bedrock as our LLM platform. When operating Bedrock in production, one constraint you cannot avoid is the TPM (Tokens per Minute) limit.

The goal is simple: process LLM calls as quickly as possible. But if you raise concurrency too much, you quickly hit the TPM limit and the whole pipeline stalls.

In this article, I'll share how we designed our parallel processing with TPM limits in mind, and what we learned through implementation.


Understanding Bedrock's rate limit model

Bedrock has several limits, but the most important one here is TPM (Tokens per Minute). This is the maximum number of tokens you can consume per minute, and it directly determines how much concurrency you can run.

TPM varies by model. You can check the exact values in Service Quotas.

How token consumption works

Before implementing anything, you need to understand how tokens are actually consumed.

When you call Bedrock, you specify the maximum number of output tokens (maxTokens). This controls the maximum response length.

According to the official docs, token consumption happens in three phases:

1. At request start

Total input tokens + maxTokens

As soon as you send the request, the sum of input tokens and maxTokens is deducted from TPM. If you're using prompt caching, CacheReadInputTokens and CacheWriteInputTokens are also included. If this deduction exceeds TPM, the request is throttled.

2. During processing

As the model generates output, consumption is periodically adjusted (the exact formula isn't publicly disclosed).

3. At request completion

InputTokenCount + CacheWriteInputTokens + (OutputTokenCount × burndown rate)

The final consumption is fixed, and any unused tokens are returned to TPM. Note that CacheReadInputTokens are not included in this final calculation.

Here's the flow as a diagram (for Claude models, output tokens are multiplied by the burndown rate; the example below assumes a burndown rate of 5, explained later):

What is the burndown rate?

Output tokens consume TPM with a multiplier called the burndown rate. As of January 2026, for recent Claude models (Opus 4, Sonnet 4, etc.), the multiplier is 5x. That means 100 output tokens count as 500 TPM.

This 5x multiplier only applies to TPM consumption. Billing is based on the actual input/output token counts, so you are not charged 5x just because the TPM consumption is 5x.

  • Claude Opus 4 / 4.1: 5x
  • Claude Sonnet 4 / 4.5: 5x
  • Claude 3.7 Sonnet: 5x
  • Claude Haiku 4.5: 5x
  • Other models: 1x

maxTokens directly impacts concurrency

The key takeaway here is that maxTokens directly affects how many concurrent requests you can run.

If you set maxTokens too large, the deduction at request start grows, reducing concurrency. If you keep it close to actual output size, you can run more requests within the same TPM budget.

Optimizing maxTokens is the key to maximizing throughput.


Designing parallel processing

How to decide concurrency

The first question when designing concurrency is: “How many requests can we run at once?” A simple estimate is:

Concurrent requests ≒ TPM ÷ (input tokens + maxTokens)

This is only a rough approximation. In production, we estimate maxTokens from historical output sizes and sum across parallel workflow phases (details below).

Real workflows run multiple agents in parallel

We use Mastra to build workflows composed of multiple agents. Some steps run in parallel.

For example:

Here, Step 2 and Step 3 run concurrently. That means the effective maximum consumption in this phase is the sum of maxTokens for Agent B and Agent C.

Identify parallel “phases”

To estimate concurrency accurately, we need to identify where parallel execution occurs in the workflow.

We do this as follows:

  1. Identify the parallel phases in the workflow structure
  2. Sum maxTokens for agents running concurrently in each phase
  3. Use the maximum sum across all phases as the worst-case budget

By planning against the worst case, we ensure we never exceed TPM.

Dynamically adjust maxTokens

If maxTokens is fixed, it tends to be set too high for safety, which reduces concurrency and throughput.

We dynamically adjust maxTokens based on historical output sizes:

  • Record output token counts
  • Remove outliers
  • Use (max of remaining) × 1.5 as the next maxTokens

Example: if the last 10 outputs are [800, 850, 900, 820, 3000, 870, 810, 890, 840, 860]:

  1. Remove the outlier (3000)
  2. Max of remaining is 900
  3. 900 × 1.5 = 1350 for next maxTokens

This keeps safety margins while maximizing concurrency.

Retry when output is truncated

If maxTokens is too small, responses may be cut off. Bedrock responses include a stopReason field, and truncation is indicated by "max_tokens".

Our retry strategy:

  1. If stopReason is "max_tokens", retry with maxTokens doubled
  2. Repeat until reaching the model's maximum output limit
  3. If it still truncates at the max, treat it as an error

This lets us start with conservative maxTokens to maximize concurrency, while automatically increasing it only when necessary.

Flow overview:

Scaling concurrency with SQS

In our setup, LLM calls are processed through an SQS queue.

With SQS, the ReceiveMessage API can fetch up to 10 messages at a time. That means to increase concurrency, you need more consumers.

Effective concurrency = Number of consumers × Batch size

We scale the number of consumers dynamically based on the concurrency derived from TPM.


Summary

When running Bedrock in production, designing parallel processing around TPM limits is essential. Here are the key points we learned:

Token consumption

  • At request start, input tokens + maxTokens are deducted from TPM (including cache tokens)
  • At request completion, actual consumption is finalized and unused tokens are returned
  • Optimizing maxTokens directly controls concurrency

Parallel processing design

  • Estimate concurrency from TPM and maximum token consumption
  • Identify parallel phases and plan for the worst-case sum
  • Dynamically adjust maxTokens based on historical outputs
  • Retry with doubled maxTokens when outputs are truncated
  • For SQS, scale consumers to increase concurrency

With these practices, you can sustain stable performance without constantly fighting TPM limits.

Acknowledgements

The research, design, and implementation described here were led by @technote-space from the Acsim development team. Thank you!

Share this article: