OAuth at Scale: The Architecture of Reliable Token Refreshes

OAuth token management is more than just storage. Learn how Truto handles concurrency, proactive refreshes, and race conditions for 100+ APIs at scale.

Roopendra Talekar · February 26, 2026 · 5 min read

If you have ever built a Salesforce integration in a weekend, you know the feeling of triumph when that first 200 OK comes back. You store the access token, maybe toss the refresh token into a database column, and call it a day.

Then, three months later, your error logs light up.

Tokens are expiring mid-sync. Refresh requests are hitting race conditions because two background jobs tried to refresh the same token simultaneously. A customer changed their password, revoking all tokens, and your app is still hammering the API until you get rate-limited.

At Truto, we maintain connections to over 100 different SaaS platforms—from HRIS systems like Workday to CRMs like HubSpot. We process millions of API requests, and every single one relies on a valid, fresh credential.

We learned the hard way that managing OAuth token lifecycles is not a storage problem; it is a distributed systems problem.

Here is the engineering deep dive into how we architected a token refresh system that handles concurrency, proactive renewal, and graceful failure at scale.

The "Happy Path" is a Lie

The OAuth 2.0 spec provides a framework, but every provider implements it with their own chaotic flair.

Expiry Times: Some tokens last 1 hour, some 24 hours, some never expire until used.
Refresh Behavior: Some providers rotate the refresh token every time you use it (refreshing the refresh token). If you fail to capture the new one due to a network blip, you are locked out forever.
Concurrency: If two processes try to refresh the same token at the exact same second, many providers will invalidate both requests, assuming a replay attack.

To handle this, we moved away from simple "check and refresh" logic to a multi-layered architecture involving proactive alarms, mutex locks, and self-healing state machines.

Layer 1: The Two-Pronged Refresh Strategy

Waiting for a token to expire before refreshing it is a recipe for latency and failed user requests. Conversely, refreshing it too aggressively wastes API quota. We use a hybrid approach: Proactive Alarms and Just-in-Time (JIT) Checks.

1. Proactive Refresh (The Background Worker)

Whenever a token is created or updated, we schedule a background job (using Cloudflare Durable Objects) to wake up 60 to 180 seconds before the token expires.

Why the randomization? To prevent thundering herds. If 10,000 accounts were connected at 9:00 AM, we don't want 10,000 refresh requests firing exactly at 9:59 AM. Spreading the load ensures stability.

When the alarm fires, it forces a refresh, updates the database, and re-encrypts the new credentials.

2. Just-in-Time Safety Net

We cannot rely solely on background jobs. Clocks drift, alarms fail, and sometimes a token expires faster than the provider claimed.

Before every single API request—whether it's a proxy call or a sync job—our infrastructure checks the token's validity. We use a 30-second buffer logic:

// Simplified logic: treat token as expired if it expires in the next 30s
if (token.expiresAt < (now + 30_seconds)) {
  await refreshCredentials();
}

This buffer is critical. Without it, a token might be valid when the check runs but expire 100ms later while the request is in flight to Salesforce.

Layer 2: Solving Concurrency with Mutex Locks

This is where most in-house integrations fail.

Imagine a scenario:

A scheduled sync job starts for Customer A.
Simultaneously, Customer A triggers a manual "Test Connection" in your UI.
Both processes see the token is about to expire.
Both processes fire a request to POST /oauth/token.

The Result: The provider receives two refresh requests. It processes the first, invalidates the old refresh token, and issues a new one. Then it processes the second request (using the now-invalid old refresh token), throws an error, and potentially revokes the entire grant for security reasons. Your customer is now disconnected.

The Solution: Durable Object Mutexes

To prevent this, we wrap the refresh operation in a Mutex (Mutual Exclusion) Lock.

We use a dedicated Durable Object that acts as a serializer for each integrated account. When a refresh is requested:

The request attempts to acquire a lock for that specific integrated_account_id.
If no operation is in progress, it proceeds, sets a timeout alarm (e.g., 30s), and executes the refresh.
If a refresh is already running, the second request simply awaits the promise of the first request.

This ensures that no matter how many concurrent threads need a fresh token, we only send one HTTP request to the provider. Everyone else gets the result of that single successful call.

Info

Why Durable Objects? Unlike Redis locks or database row locking, Durable Objects provide strictly serialized access to in-memory state at the edge, making them incredibly fast and reliable for this specific pattern.

Layer 3: Handling "Invalid Grant" and Re-Auth

Sometimes, refresh fails. The user might have uninstalled the app, changed their password, or an admin might have revoked the token.

When we receive a fatal error (like HTTP 400 invalid_grant or HTTP 401 Unauthorized), retrying is futile. We need to involve the human.

The `needs_reauth` State Machine

Detection: We catch invalid_grant errors specifically. Transient errors (HTTP 500s) trigger a retry with exponential backoff. Auth errors trigger a state change.
State Update: The account status is flipped from active to needs_reauth.
Notification: We fire a webhook event: integrated_account:authentication_error.

This allows our customers to listen for this event and immediately show a "Reconnect" banner in their UI.

Auto-Reactivation

We also support auto-reactivation. If an account is in needs_reauth but a subsequent API call succeeds (perhaps the user fixed the issue on the provider side, or it was a false positive from the API), we automatically flip the status back to active and fire a integrated_account:reactivated event.

Security at Rest

Storing thousands of access and refresh tokens requires paranoia. As part of our strict security standards, we never store tokens in plain text.

Encryption: All sensitive fields (access_token, refresh_token, client_secret) are encrypted using AES-GCM before hitting the database.
Masking: When developers list accounts via our API, these fields are masked. They are only decrypted inside the secure enclave of the refresh service just before being used.

Summary: The Checklist for Reliable Auth

If you are building this in-house, ensure your architecture covers these bases:

Buffer your expiry checks: Don't wait for 0 seconds remaining.
Serialize your refreshes: Never let two threads refresh the same token.
Handle revocation gracefully: Distinguish between "API is down" (retry) and "Token is dead" (alert user).
Secure your storage: Encrypt at rest, always.

Or, you can offload this entirely. At Truto, we treat authentication infrastructure as a core product, so you can focus on the data, not the handshake.

FAQ

How often should I refresh an OAuth token?

Ideally, refresh tokens proactively 60-180 seconds before they expire. This provides a safety buffer for network latency and retries without wasting API limits.

What is the 'thundering herd' problem in OAuth?

This happens when multiple parallel processes try to refresh the same token simultaneously. It can cause providers to invalidate the token entirely, breaking the integration.

How do I handle 'invalid_grant' errors?

An 'invalid_grant' usually means the refresh token is revoked or expired. You cannot retry this; you must mark the account as disconnected and ask the user to re-authenticate.

Should I encrypt OAuth tokens in the database?

Yes. Always encrypt access and refresh tokens at rest (e.g., using AES-GCM) to prevent credential leakage if your database is compromised.

Build vs. Buy: The True Cost of Building SaaS Integrations In-House

Deciding between building SaaS integrations in-house or buying a unified API? We break down the true costs, pros, cons, and the math behind the build vs. buy decision.

Roopendra Talekar · February 23, 2026 · 7 min read

Security

Security at Truto: How Truto Helps You and Your Customer Rest Easy

Safeguarding data isn't just a line item—it's a complex, critical task. Deep dive into the practices we follow at Truto to keep your data secure.

The Truto Team · August 24, 2023 · 3 min read

Educational

3 models for product integrations: a choice between control and velocity

Compare the 3 architectural models for B2B SaaS integrations: Direct, Unified APIs, and Embedded iPaaS. Learn the true costs, trade-offs, and which to choose.

Nachi Raman · February 24, 2026 · 6 min read

Educational

What is a Unified API?

Discover what a unified API is and how it normalizes data across SaaS platforms to accelerate your integration roadmap and reduce engineering overhead.

Uday Gajavalli · May 15, 2023 · 8 min read

Guides

Integrate Salesforce: Step-by-step guide by Truto

Integrate Salesforce in less than an hour with this detailed guide. Follow along using screenshots and direct links to get what you need.

Roopendra Talekar · January 12, 2023 · 3 min read

FAQ

More from our Blog

Build vs. Buy: The True Cost of Building SaaS Integrations In-House

Security at Truto: How Truto Helps You and Your Customer Rest Easy

3 models for product integrations: a choice between control and velocity

What is a Unified API?

Integrate Salesforce: Step-by-step guide by Truto