How to Handle LLM API Rate Limits in Python (Retry Logic, Backoff, Fallback)
Production-ready Python patterns for handling 429 errors from any LLM API — exponential backoff, tenacity, key rotation, and model fallback strategies.
Why Rate Limits Exist
Every LLM API enforces rate limits to ensure fair usage across all users. You will encounter three types:
- RPM (Requests Per Minute): How many API calls you can make per minute. Most free tiers are 3–20 RPM.
- TPM (Tokens Per Minute): Total tokens (input + output) processed per minute. Less commonly hit than RPM.
- RPD (Requests Per Day): Some providers cap daily usage regardless of per-minute limits.
When you exceed any of these limits, the API returns a HTTP 429 Too Many Requests error. Without proper handling, this crashes your application. This guide gives you the patterns to handle it gracefully.
Understanding the 429 Error
from openai import OpenAI, RateLimitError
client = OpenAI(
base_url="https://aiapiv2.pekpik.com/v1",
api_key="sk-your-key-here"
)
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
except RateLimitError as e:
print(f"Rate limit hit: {e}")
# e.response.headers may contain 'retry-after' header
except Exception as e:
print(f"Other error: {e}")
Pattern 1 — Simple Retry with Exponential Backoff
The simplest approach: if you hit a 429, wait and try again. Double the wait time on each failure (exponential backoff).
import time
from openai import OpenAI, RateLimitError
client = OpenAI(
base_url="https://aiapiv2.pekpik.com/v1",
api_key="sk-your-key-here"
)
def call_with_backoff(messages, model="gpt-4o", max_retries=5):
wait = 1 # start with 1 second
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model=model,
messages=messages
)
except RateLimitError:
if attempt == max_retries - 1:
raise # give up after max retries
print(f"Rate limited. Waiting {wait}s... (attempt {attempt+1}/{max_retries})")
time.sleep(wait)
wait = min(wait * 2, 60) # cap at 60 seconds
response = call_with_backoff([{"role": "user", "content": "Hello!"}])
print(response.choices[0].message.content)
Pattern 2 — Using Tenacity for Cleaner Retry Logic
The tenacity library makes retry logic declarative and much cleaner:
pip install tenacity
from tenacity import (
retry, stop_after_attempt, wait_exponential, retry_if_exception_type
)
from openai import OpenAI, RateLimitError
client = OpenAI(
base_url="https://aiapiv2.pekpik.com/v1",
api_key="sk-your-key-here"
)
@retry(
retry=retry_if_exception_type(RateLimitError),
wait=wait_exponential(multiplier=1, min=1, max=60),
stop=stop_after_attempt(6)
)
def call_llm(prompt: str, model: str = "gpt-4o") -> str:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# Now just call it — retries happen automatically
result = call_llm("What is the capital of France?")
print(result)
Pattern 3 — Multi-Key Rotation
If you have multiple FreeLLMKeys, rotate through them when one hits a rate limit:
from openai import OpenAI, RateLimitError
import itertools
# Add multiple keys from FreeLLMKeys
API_KEYS = [
"sk-key-one-here",
"sk-key-two-here",
"sk-key-three-here",
]
BASE_URL = "https://aiapiv2.pekpik.com/v1"
key_cycle = itertools.cycle(API_KEYS)
current_key = next(key_cycle)
def get_client():
return OpenAI(base_url=BASE_URL, api_key=current_key)
def call_with_rotation(prompt: str, model: str = "gpt-4o") -> str:
global current_key
for _ in range(len(API_KEYS)):
try:
client = get_client()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
except RateLimitError:
print(f"Key {current_key[:12]}... rate limited. Rotating.")
current_key = next(key_cycle)
raise RuntimeError("All keys rate limited")
result = call_with_rotation("Explain recursion in one sentence.")
print(result)
Pattern 4 — Fallback to a Different Model
FALLBACK_MODELS = ["gpt-4o", "deepseek-chat", "gemini-2.5-flash"]
def call_with_model_fallback(prompt: str) -> str:
for model in FALLBACK_MODELS:
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
except RateLimitError:
print(f"Model {model} rate limited. Trying next.")
raise RuntimeError("All models rate limited")
Logging Rate Limit Hits for Debugging
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def logged_call(prompt: str, model: str = "gpt-4o") -> str:
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
except RateLimitError as e:
logger.warning(f"[{datetime.now().isoformat()}] Rate limit hit on model={model}: {e}")
raise
Combine these patterns based on your needs. For most development workflows with FreeLLMKeys, Pattern 1 or 2 is sufficient. Pattern 3 becomes valuable when you are doing batch processing and have multiple keys available.