TutorialsMay 30, 20258 min read

How to Handle LLM API Rate Limits in Python (Retry Logic, Backoff, Fallback)

Production-ready Python patterns for handling 429 errors from any LLM API — exponential backoff, tenacity, key rotation, and model fallback strategies.

Why Rate Limits Exist

Every LLM API enforces rate limits to ensure fair usage across all users. You will encounter three types:

RPM (Requests Per Minute): How many API calls you can make per minute. Most free tiers are 3–20 RPM.
TPM (Tokens Per Minute): Total tokens (input + output) processed per minute. Less commonly hit than RPM.
RPD (Requests Per Day): Some providers cap daily usage regardless of per-minute limits.

When you exceed any of these limits, the API returns a HTTP 429 Too Many Requests error. Without proper handling, this crashes your application. This guide gives you the patterns to handle it gracefully.

Understanding the 429 Error

from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://aiapiv2.pekpik.com/v1",
    api_key="sk-your-key-here"
)

try:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello"}]
    )
except RateLimitError as e:
    print(f"Rate limit hit: {e}")
    # e.response.headers may contain 'retry-after' header
except Exception as e:
    print(f"Other error: {e}")

Pattern 1 — Simple Retry with Exponential Backoff

The simplest approach: if you hit a 429, wait and try again. Double the wait time on each failure (exponential backoff).

import time
from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://aiapiv2.pekpik.com/v1",
    api_key="sk-your-key-here"
)

def call_with_backoff(messages, model="gpt-4o", max_retries=5):
    wait = 1  # start with 1 second
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model,
                messages=messages
            )
        except RateLimitError:
            if attempt == max_retries - 1:
                raise  # give up after max retries
            print(f"Rate limited. Waiting {wait}s... (attempt {attempt+1}/{max_retries})")
            time.sleep(wait)
            wait = min(wait * 2, 60)  # cap at 60 seconds

response = call_with_backoff([{"role": "user", "content": "Hello!"}])
print(response.choices[0].message.content)

Pattern 2 — Using Tenacity for Cleaner Retry Logic

The tenacity library makes retry logic declarative and much cleaner:

pip install tenacity

from tenacity import (
    retry, stop_after_attempt, wait_exponential, retry_if_exception_type
)
from openai import OpenAI, RateLimitError

client = OpenAI(
    base_url="https://aiapiv2.pekpik.com/v1",
    api_key="sk-your-key-here"
)

@retry(
    retry=retry_if_exception_type(RateLimitError),
    wait=wait_exponential(multiplier=1, min=1, max=60),
    stop=stop_after_attempt(6)
)
def call_llm(prompt: str, model: str = "gpt-4o") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Now just call it — retries happen automatically
result = call_llm("What is the capital of France?")
print(result)

Pattern 3 — Multi-Key Rotation

If you have multiple FreeLLMKeys, rotate through them when one hits a rate limit:

from openai import OpenAI, RateLimitError
import itertools

# Add multiple keys from FreeLLMKeys
API_KEYS = [
    "sk-key-one-here",
    "sk-key-two-here",
    "sk-key-three-here",
]

BASE_URL = "https://aiapiv2.pekpik.com/v1"
key_cycle = itertools.cycle(API_KEYS)
current_key = next(key_cycle)

def get_client():
    return OpenAI(base_url=BASE_URL, api_key=current_key)

def call_with_rotation(prompt: str, model: str = "gpt-4o") -> str:
    global current_key
    for _ in range(len(API_KEYS)):
        try:
            client = get_client()
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content
        except RateLimitError:
            print(f"Key {current_key[:12]}... rate limited. Rotating.")
            current_key = next(key_cycle)
    raise RuntimeError("All keys rate limited")

result = call_with_rotation("Explain recursion in one sentence.")
print(result)

Pattern 4 — Fallback to a Different Model

FALLBACK_MODELS = ["gpt-4o", "deepseek-chat", "gemini-2.5-flash"]

def call_with_model_fallback(prompt: str) -> str:
    for model in FALLBACK_MODELS:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content
        except RateLimitError:
            print(f"Model {model} rate limited. Trying next.")
    raise RuntimeError("All models rate limited")

Logging Rate Limit Hits for Debugging

import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def logged_call(prompt: str, model: str = "gpt-4o") -> str:
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
    except RateLimitError as e:
        logger.warning(f"[{datetime.now().isoformat()}] Rate limit hit on model={model}: {e}")
        raise

Combine these patterns based on your needs. For most development workflows with FreeLLMKeys, Pattern 1 or 2 is sufficient. Pattern 3 becomes valuable when you are doing batch processing and have multiple keys available.

FreeLLMKeys Team

Building tools for the AI developer community

PreviousBuild a Working AI Chatbot With Next.js and a Free LLM API Key (Zero Cost)NextFree Gemini API Key: Google AI Studio vs Community Keys — What's the Difference?