Optimizing latency for Azure OpenAI Service
Table of Contents
Introduction #
In this post we’ll be looking into measuring and optimizing Azure OpenAI Service response latency by evaluating the deployed endpoints Azure OpenAI endpoints on a global scale. By optimizing latency, we can enable more real-time use cases, as well as maximize throughput for batch workloads. Our main goal in this exercise to reduce latency peaks that might show up here and there if any of the regions experiences significant load (noisy neighbors) or if we’re running into API rate limits.
This can be used to optimize latency for gpt-35-turbo
, but can also be applied to gpt-4
model series.
A word of caution, the solution discussed here won’t be perfect and won’t be able to avoid latency peaks completely. If you want to run latency sensitive use cases on Azure OpenAI where you can’t tolerate any peaks, I’d suggest to talk to your Microsoft Sales person regarding the Provisioned Throughput Model, which offers dedicated Azure OpenAI throughput capacity.
Ideas for optimizing latency #
If we want to build a “latency-optimized” app using Azure OpenAI, we could do the following approach:
- Measure latency against a range of worldwide regions using a short test prompt
- Based on the call’s status code, latency, and rolling average latency (for instance, a decay rate of
0.8
), select the fastest regions for the actual API call - Execute the API calls
- Repeat this check at intervals between 10 and 60 seconds
But what about the latency added by using an Azure region far from our application? Yes, this can cause additional latency. However, the main goal here is to prevent abrupt latency spikes. To give you some idea, here are a few quick tests:
- Latency from central Germany to
canadaeast
: <110ms - Latency from central Germany to
uksouth
: <20ms - Latency from central Germany to
japaneast
: <250ms
Even considering a long distance, such as from the East coast to Singapore, the worst-case scenario is ~300ms of latency. However, if your app runs on Azure, you should experience significantly lower latency due to the use of the Microsoft backbone, as opposed to the public internet.
In context, running a prompt with 1000 input tokens and 200 completion tokens likely takes between half a second and two seconds to complete, so adding 100ms, 200ms, or 300ms doesn’t significantly impact our aim to prevent spikes.
Access configuration #
First, let’s create an accounts.json
that holds the endpoints and access keys for all the regions we want to test. In this case, I’ve just created Azure OpenAI resources in all regions where I still had capacity left:
[
{
"endpoint": "https://canadaeast.api.cognitive.microsoft.com/",
"key": "..."
},
{
"endpoint": "https://eastus2.api.cognitive.microsoft.com/",
"key": "..."
},
{
"endpoint": "https://francecentral.api.cognitive.microsoft.com/",
"key": "..."
},
{
"endpoint": "https://japaneast.api.cognitive.microsoft.com/",
"key": "..."
},
{
"endpoint": "https://northcentralus.api.cognitive.microsoft.com/",
"key": "..."
},
{
"endpoint": "https://uksouth.api.cognitive.microsoft.com/",
"key": "..."
}
]
Testing latency via Python #
To begin, install requests
:
pip install requests
We opt for simple HTTP requests over the openai
SDK due to easier management of call timeouts and status codes.
Here’s a sample script for the job:
import json
import time
import requests
decay_rate = 0.8
http_timeout = 10
test_interval = 15
with open('accounts.json', 'r') as f:
accounts = json.load(f)
def get_latency_for_endpoint(endpoint, key):
url = f"{endpoint}/openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-05-15"
headers = {
"Content-Type": "application/json",
"api-key": key
}
data = {"max_tokens": 1, "messages":[{"role": "system", "content": ""},{"role": "user", "content": "Hi"}]}
try:
t_start = time.time()
response = requests.post(url, headers=headers, json=data, timeout=http_timeout)
latency = (time.time() - t_start)
status = response.status_code
except Exception as e:
status = 500
latency = http_timeout
# print(response.json())
print(f"Endpoint: {endpoint}, Status: {status}, Latency: {latency}s")
return {
"ts": time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),
"status": status,
"latency": latency,
}
stat = {}
for account in accounts:
stat[account['endpoint']] = {
'last_updated': None,
'status': None,
'latency': None,
'latency_ra': 0
}
while(True):
for account in accounts:
endpoint = account['endpoint']
key = account['key']
result = get_latency_for_endpoint(endpoint, key)
stat[endpoint]['last_updated'] = result['ts']
stat[endpoint]['status'] = result['status']
stat[endpoint]['latency'] = result['latency']
stat[endpoint]['latency_ra'] = decay_rate * result['latency'] + (1-decay_rate) * stat[endpoint]['latency_ra']
print(json.dumps(stat,indent=4))
time.sleep(test_interval)
In this script, endpoints are checked every 15
seconds, with timeouts set at 10
seconds. A rolling average with a decay rate of 0.8
is calculated.
A response from a single prompt will look like this:
{
"id":"chatcmpl-.....",
"object":"chat.completion",
"created":1690872556,
"model":"gpt-35-turbo",
"choices":[
{
"index":0,
"finish_reason":"length",
"message":{
"role":"assistant",
"content":"Hello"
}
}
],
"usage":{
"completion_tokens":1,
"prompt_tokens":14,
"total_tokens":15
}
}
Overall cost per call is 15 tokens, which would cost us 30 days * 24 hours * 60 minutes * 4 requests/minute * 15 tokens * $0.002 / 1000 tokens = $5.2 / month
per region in a month. Not sure if we need to test every every 15 seconds, or if every minute is sufficient. In terms of requests/minute, Azure OpenAI gives 1440 requests/minute per region and subscription, so sacrificing 4 calls is less than 0.3%.
Running the script over a period yields data such as:
{
"https://canadaeast.api.cognitive.microsoft.com/": {
"last_updated": "2023-08-01 09:36:25",
"status": 200,
"latency": 0.5866355895996094,
"latency_ra": 0.5867746781616211
},
"https://eastus2.api.cognitive.microsoft.com/": {
"last_updated": "2023-08-01 09:36:25",
"status": 200,
"latency": 0.5309584140777588,
"latency_ra": 0.5271010751342773
},
"https://francecentral.api.cognitive.microsoft.com/": {
"last_updated": "2023-08-01 09:36:26",
"status": 200,
"latency": 0.725212812423706,
"latency_ra": 0.6279167041015624
},
"https://japaneast.api.cognitive.microsoft.com/": {
"last_updated": "2023-08-01 09:36:27",
"status": 200,
"latency": 1.0203375816345215,
"latency_ra": 1.0150870689697267
},
"https://northcentralus.api.cognitive.microsoft.com/": {
"last_updated": "2023-08-01 09:36:28",
"status": 200,
"latency": 0.7335877418518066,
"latency_ra": 0.7090948748168945
},
"https://uksouth.api.cognitive.microsoft.com/": {
"last_updated": "2023-08-01 09:36:28",
"status": 200,
"latency": 0.2238612174987793,
"latency_ra": 0.22408719714355468
}
}
We can clearly see that japaneast
is the slowest, but as discussed before, latency from my machine to this region is already ~250ms, which probably explains it.
Moving forward #
While the above script is functional, a practical application should account for:
- Execution: The script could be executed with a timer in an Azure Function, persisting results into Azure Blob or Azure CosmosDB. The app would then query the current status periodically, caching responses and making regional choices based on current latency and the rolling average.
- Rate-limiting: Azure OpenAI Service defaults to 240k tokens per minute (TPMs) for gpt-35-turbo per region and subscription (as of 08/01/2023). If the test prompt encounters a limit for a region, it will be marked with status
429
. Consequently, the app should then pick the next best option. - Fallback measures: In case limits are reached across most regions, ensure it’s not because of the http timeout set for the
POST
request. In such unlikely scenarios, temporarily increase the http timeout value to identify regions still responding.
Summary #
This post has presented an easy approach to measure Azure OpenAI response latency across the globe. By sending a tiny prompt, waiting for its completion, and then choosing the best-performing region, we can optimize our actual API calls and hopefully minimize latency spikes. While this will likely reduce the latency spikes you’ll see, it won’t fully eliminate them. If your workload can’t tolerate any any spikes, I’d suggest you to talk to your Microsoft Sales person regarding the Provisioned Throughput Model.