OAuth at Scale: The Architecture of Reliable Token Refreshes
OAuth token management is more than just storage. Learn how Truto handles concurrency, proactive refreshes, and race conditions for 100+ APIs at scale.
If you have ever built a Salesforce integration in a weekend, you know the feeling of triumph when that first 200 OK comes back. You store the access token, maybe toss the refresh token into a database column, and call it a day.
Then, three months later, your error logs light up.
Tokens are expiring mid-sync. Refresh requests are hitting race conditions because two background jobs tried to refresh the same token simultaneously. A customer changed their password, revoking all tokens, and your app is still hammering the API until you get rate-limited.
At Truto, we maintain connections to over 100 different SaaS platforms—from HRIS systems like Workday to CRMs like HubSpot. We process millions of API requests, and every single one relies on a valid, fresh credential.
We learned the hard way that managing OAuth token lifecycles is not a storage problem; it is a distributed systems problem.
Here is the engineering deep dive into how we architected a token refresh system that handles concurrency, proactive renewal, and graceful failure at scale.
The "Happy Path" is a Lie
The OAuth 2.0 spec provides a framework, but every provider implements it with their own chaotic flair.
- Expiry Times: Some tokens last 1 hour, some 24 hours, some never expire until used.
- Refresh Behavior: Some providers rotate the refresh token every time you use it (refreshing the refresh token). If you fail to capture the new one due to a network blip, you are locked out forever.
- Concurrency: If two processes try to refresh the same token at the exact same second, many providers will invalidate both requests, assuming a replay attack.
To handle this, we moved away from simple "check and refresh" logic to a multi-layered architecture involving proactive alarms, mutex locks, and self-healing state machines.
Layer 1: The Two-Pronged Refresh Strategy
Waiting for a token to expire before refreshing it is a recipe for latency and failed user requests. Conversely, refreshing it too aggressively wastes API quota. We use a hybrid approach: Proactive Alarms and Just-in-Time (JIT) Checks.
1. Proactive Refresh (The Background Worker)
Whenever a token is created or updated, we schedule a background job (using Cloudflare Durable Objects) to wake up 60 to 180 seconds before the token expires.
Why the randomization? To prevent thundering herds. If 10,000 accounts were connected at 9:00 AM, we don't want 10,000 refresh requests firing exactly at 9:59 AM. Spreading the load ensures stability.
When the alarm fires, it forces a refresh, updates the database, and re-encrypts the new credentials.
2. Just-in-Time Safety Net
We cannot rely solely on background jobs. Clocks drift, alarms fail, and sometimes a token expires faster than the provider claimed.
Before every single API request—whether it's a proxy call or a sync job—our infrastructure checks the token's validity. We use a 30-second buffer logic:
// Simplified logic: treat token as expired if it expires in the next 30s
if (token.expiresAt < (now + 30_seconds)) {
await refreshCredentials();
}This buffer is critical. Without it, a token might be valid when the check runs but expire 100ms later while the request is in flight to Salesforce.
Layer 2: Solving Concurrency with Mutex Locks
This is where most in-house integrations fail.
Imagine a scenario:
- A scheduled sync job starts for Customer A.
- Simultaneously, Customer A triggers a manual "Test Connection" in your UI.
- Both processes see the token is about to expire.
- Both processes fire a request to
POST /oauth/token.
The Result: The provider receives two refresh requests. It processes the first, invalidates the old refresh token, and issues a new one. Then it processes the second request (using the now-invalid old refresh token), throws an error, and potentially revokes the entire grant for security reasons. Your customer is now disconnected.
The Solution: Durable Object Mutexes
To prevent this, we wrap the refresh operation in a Mutex (Mutual Exclusion) Lock.
We use a dedicated Durable Object that acts as a serializer for each integrated account. When a refresh is requested:
- The request attempts to acquire a lock for that specific
integrated_account_id. - If no operation is in progress, it proceeds, sets a timeout alarm (e.g., 30s), and executes the refresh.
- If a refresh is already running, the second request simply awaits the promise of the first request.
This ensures that no matter how many concurrent threads need a fresh token, we only send one HTTP request to the provider. Everyone else gets the result of that single successful call.
Why Durable Objects? Unlike Redis locks or database row locking, Durable Objects provide strictly serialized access to in-memory state at the edge, making them incredibly fast and reliable for this specific pattern.
Layer 3: Handling "Invalid Grant" and Re-Auth
Sometimes, refresh fails. The user might have uninstalled the app, changed their password, or an admin might have revoked the token.
When we receive a fatal error (like HTTP 400 invalid_grant or HTTP 401 Unauthorized), retrying is futile. We need to involve the human.
The needs_reauth State Machine
- Detection: We catch
invalid_granterrors specifically. Transient errors (HTTP 500s) trigger a retry with exponential backoff. Auth errors trigger a state change. - State Update: The account status is flipped from
activetoneeds_reauth. - Notification: We fire a webhook event:
integrated_account:authentication_error.
This allows our customers to listen for this event and immediately show a "Reconnect" banner in their UI.
Auto-Reactivation
We also support auto-reactivation. If an account is in needs_reauth but a subsequent API call succeeds (perhaps the user fixed the issue on the provider side, or it was a false positive from the API), we automatically flip the status back to active and fire a integrated_account:reactivated event.
Security at Rest
Storing thousands of access and refresh tokens requires paranoia. As part of our strict security standards, we never store tokens in plain text.
- Encryption: All sensitive fields (
access_token,refresh_token,client_secret) are encrypted using AES-GCM before hitting the database. - Masking: When developers list accounts via our API, these fields are masked. They are only decrypted inside the secure enclave of the refresh service just before being used.
Summary: The Checklist for Reliable Auth
If you are building this in-house, ensure your architecture covers these bases:
- Buffer your expiry checks: Don't wait for
0seconds remaining. - Serialize your refreshes: Never let two threads refresh the same token.
- Handle revocation gracefully: Distinguish between "API is down" (retry) and "Token is dead" (alert user).
- Secure your storage: Encrypt at rest, always.
Or, you can offload this entirely. At Truto, we treat authentication infrastructure as a core product, so you can focus on the data, not the handshake.
FAQ
- How often should I refresh an OAuth token?
- Ideally, refresh tokens proactively 60-180 seconds before they expire. This provides a safety buffer for network latency and retries without wasting API limits.
- What is the 'thundering herd' problem in OAuth?
- This happens when multiple parallel processes try to refresh the same token simultaneously. It can cause providers to invalidate the token entirely, breaking the integration.
- How do I handle 'invalid_grant' errors?
- An 'invalid_grant' usually means the refresh token is revoked or expired. You cannot retry this; you must mark the account as disconnected and ask the user to re-authenticate.
- Should I encrypt OAuth tokens in the database?
- Yes. Always encrypt access and refresh tokens at rest (e.g., using AES-GCM) to prevent credential leakage if your database is compromised.