AWS adds cross-region AI inference to handle traffic surges

Dell enhances AI and analytics ecosystem through new integrations with NVIDIA, Elastic, and Starburst

Nordic Capital to buy market data and analytics firm BMLL in $250 million deal

AI boom fuels record growth in global cloud infrastructure spending

Amazon Web Services has initiated Global Cross-Region inference of Anthropic Claude Sonnet 4 in Amazon Bedrock, which makes it possible to direct the AI inference request to several AWS regions automatically. Such a new feature allows developers to handle other regions of unexpected traffic surges and use those regions to optimize throughput and performance. The feature mitigates the service quota constraints and peak usage, as was once hard to achieve without sophisticated manual routing remedies.

Cross-region inference enhances scalability

Anthropic’s Claude Sonnet 4 is now available with Global cross-Region inference in Amazon Bedrock, so you can now use the Global Claude Sonnet 4 inference profile to route your inference requests to any supported commercial AWS Region for processing, optimizing available resources, and enabling higher model throughput, according to AWS. Amazon Bedrock is a comprehensive, secure, and flexible service for building generative AI applications and agents.

When using on-demand and batch inference in Amazon Bedrock, your requests may be restricted by service quotas or during peak usage times. Cross-region inference enables you to seamlessly manage unplanned traffic bursts by utilizing compute across different AWS Regions. With cross-region inference, you can distribute traffic across multiple AWS Regions, enabling higher throughput.

Automatic traffic management system

Previously, you were able to choose cross-region inference profiles tied to a specific geography, such as the US, EU, or APAC, which automatically selected the optimal commercial AWS Region within that geography to process your inference requests. For your generative AI use cases that do not require you to choose inference profiles tied to a specific geography, you can now use the Global cross-region inference profile to further increase your model throughput.

According to the AWS Machine Learning blog, with the advent of generative AI solutions, a paradigm shift is underway across industries, driven by organizations embracing foundation models to unlock unprecedented opportunities. Amazon Bedrock has emerged as the preferred choice for numerous customers seeking to innovate and launch generative AI applications, leading to an exponential surge in demand for model inference capabilities.

Technical implementation details

Today, we are happy to announce the general availability of cross-region inference, a powerful feature that allows automatic cross-region inference routing for requests coming to Amazon Bedrock, according to AWS. This offers developers using on-demand inference mode a seamless solution for getting higher throughput and performance, while managing incoming traffic spikes of applications powered by Amazon Bedrock.

By opting in, developers no longer have to spend time and effort predicting demand fluctuations. Instead, cross-region inference dynamically routes traffic across multiple regions. Moreover, this capability prioritizes the connected Amazon Bedrock API source region when possible, helping to minimize latency and improve responsiveness. As a result, customers can enhance their applications’ reliability, performance, and efficiency.

Key features and benefits

Some of the key features of cross-region inference include utilizing capacity from multiple AWS regions allowing generative AI workloads to scale with demand, compatibility with existing Amazon Bedrock API, no additional routing or data transfer cost and you pay the same price per token for models as in your source region (the region you requested to), and ability to choose from a range of pre-configured AWS region sets tailored to your needs.

The Global Cross-Region inference by AWS is an important breakthrough in the AI infrastructure administration, as it is completely automated to perform the difficult work of traffic routing and still costs nothing. This attribute allows developers to concentrate on the development of creative AI applications instead of dealing with the issue of infrastructure scalability. The workloads of AI will only keep increasing exponentially, and with that, such automated solutions will be necessary to ensure that in the global deployments of AI.