Connect Lobstr to AI Agents: Orchestrate Tasks and Scraped Exports
Learn how to connect Lobstr to AI agents using Truto's /tools endpoint. Build autonomous workflows to orchestrate web crawlers, execute runs, and extract data.
You want to connect Lobstr to an AI agent so your system can autonomously deploy web crawlers, orchestrate scraping tasks, and extract structured data exports without human intervention. Here is exactly how to do it using Truto's /tools endpoint and SDK, bypassing the need to hand-roll and maintain massive API wrappers for web automation platforms.
Giving a Large Language Model (LLM) read and write access to a web scraping infrastructure is an engineering headache. You either spend weeks building, hosting, and maintaining custom connectors that translate JSON tool calls into complex scraping pipeline configurations, or you use a managed infrastructure layer that handles the boilerplate for you. If your team uses ChatGPT, check out our guide on connecting Lobstr to ChatGPT, or if you are building on Anthropic's models, read our guide on connecting Lobstr to Claude. For developers building custom autonomous workflows, you need a programmatic way to fetch these tools and bind them to your agent framework.
This guide breaks down exactly how to fetch AI-ready tools for Lobstr, bind them natively to an LLM using LangChain (or any framework like LangGraph, CrewAI, or Vercel AI SDK), and execute complex data extraction workflows. For a deeper look at the architecture behind this approach, refer to our research on architecting AI agents and the SaaS integration bottleneck.
The Engineering Reality of Custom Lobstr Connectors
Building AI agents is easy. Connecting them to external SaaS APIs that operate asynchronously is hard. If you are evaluating the best integration platforms for LangChain and LlamaIndex, you'll find that giving an LLM access to external data sounds simple in a prototype. You write a Node.js function that makes a fetch request and wrap it in an @tool decorator. In production, this approach collapses entirely.
If you decide to build a custom integration for Lobstr, you own the entire API lifecycle. Lobstr's platform introduces several specific integration challenges that break standard LLM assumptions.
The Crawler-Squid-Task Hierarchy
Unlike standard CRUD applications where an agent can simply call POST /scrape-url, Lobstr enforces a strict architectural hierarchy. You cannot just command the system to scrape a URL. An AI agent must understand that it must first discover a Crawler (the script logic), instantiate a Squid (the configuration container linked to that crawler), inject Tasks (the target URLs or search queries) into the Squid, and finally trigger a Run.
If you hand-code these tools, you have to write incredibly complex system prompts to force the LLM to follow this exact sequential dependency. The LLM will frequently hallucinate and try to add tasks to a crawler directly, bypassing the squid entirely. Truto normalizes this by exposing precisely scoped tool descriptions that guide the LLM through the correct sequential pipeline.
Asynchronous Execution and Polling Loops
Web scraping takes time. When an agent triggers a run in Lobstr, it does not get the scraped data back in the immediate HTTP response. It receives a run ID and a status indicating the run has started.
Naive agent implementations fail catastrophically here. An LLM might assume the run is done immediately and attempt to fetch results, resulting in empty arrays. Or worse, it might enter an infinite while-loop, calling the results endpoint hundreds of times per minute, draining your tokens and rate-limiting your API. To solve this, your tools must expose explicit polling mechanisms, allowing the LLM to check percent_done or eta fields, and you must build systemic guardrails to force the agent to sleep or yield execution until the async job completes (a pattern we cover in our guide on handling long-running SaaS API tasks).
Strict Credit Tracking and HTTP 429 Handling
Every row scraped in Lobstr consumes credits, and different crawlers have different credits_per_row costs. If you give an autonomous agent unconstrained access to queue tasks, it can easily bankrupt your account balance on a poorly scoped search query.
Furthermore, when polling or executing heavy workloads, you will eventually hit API rate limits. It is a critical factual note that Truto does not retry, throttle, or apply backoff on rate limit errors. When the upstream Lobstr API returns an HTTP 429, Truto passes that error directly to the caller. Truto normalizes the upstream rate limit info into standardized headers (ratelimit-limit, ratelimit-remaining, ratelimit-reset) per the IETF specification.
Your agent framework is entirely responsible for reading these headers, pausing execution, and retrying. If you do not surface these 429 errors cleanly to the LLM or handle them at the framework level, the agent will hallucinate success and break the workflow.
Hero Tools for Lobstr
To safely orchestrate Lobstr, your agent does not need access to every single endpoint in the platform. It needs a curated set of high-leverage operations. Below are the core proxy tools Truto provides for Lobstr that enable end-to-end autonomous scraping.
Discover Available Crawlers
Before an agent can scrape anything, it needs to know what scripts are available in the platform.
Tool Name: list_all_lobstr_crawlers
Description: List all available crawlers on the lobstr platform. Returns standard IDs, names, availability status, and the credits_per_row cost for each crawler.
Context: Use this to allow the agent to map a user's natural language request (e.g., "scrape Google Maps") to a specific Lobstr crawler ID.
"Fetch the list of all available crawlers in our Lobstr account and find the ID for the LinkedIn company profile scraper. Note the cost per row."
Instantiate a Configuration Squid
Once the crawler is identified, the agent must create an isolated container for the upcoming job.
Tool Name: create_a_lobstr_squid
Description: Create a new squid (scraping job configuration) in lobstr. Returns the created squid object including its required id.
Context: The agent must call this to prepare a workspace before adding any target URLs.
"Create a new Squid named 'Competitor Landscape Q3' using the Google Search crawler ID you just found."
Queue Target Tasks
With a Squid created, the agent injects the actual work to be done.
Tool Name: create_a_lobstr_task
Description: Create lobstr tasks by adding URLs or search queries to a specific squid for scraping. Returns a list of queued tasks and a duplicated_count.
Context: The LLM must pass the squid ID and an array of tasks. This is where the agent maps its internal intelligence (e.g., a list of target companies) into actionable scraping inputs.
"Add these five competitor website URLs as tasks to the 'Competitor Landscape Q3' squid."
Trigger the Scraping Run
Queueing tasks does not start them. The agent must explicitly trigger execution.
Tool Name: create_a_lobstr_run
Description: Create a new run in lobstr for a specific squid. Returns the run ID, initial status, and boolean is_done flag.
Context: Instruct the agent to save the resulting run ID, as it will be required to poll for completion.
"Start a new run for the 'Competitor Landscape Q3' squid and give me the run ID."
Poll Run Statistics
Because scraping is asynchronous, the agent needs a way to check progress without guessing.
Tool Name: get_single_lobstr_run_stat_by_id
Description: Get detailed real-time statistics for a specific lobstr run using its hash. Returns percent_done, total_tasks_done, duration, eta, and is_done.
Context: This tool is critical for preventing endless loops. Instruct your LLM: "Do not fetch results until you call this tool and is_done is true."
"Check the status of run ID 'abc-123'. If it is not done, tell me the ETA and current percentage."
Fetch Scraped Results
Once the polling tool confirms completion, the agent retrieves the payload.
Tool Name: list_all_lobstr_results
Description: List scraped data results from lobstr runs, queryable per squid. Returns an array of scraped data rows where fields are specific to the chosen crawler.
Context: The agent can use this data to perform subsequent analysis, summarize findings, or pipe it into another SaaS tool via Truto.
"The run is complete. Fetch the results from the squid and summarize the pricing tiers found on the competitor pages."
Monitor Account Balance
To prevent runaway costs, the agent should audit available credits.
Tool Name: get_single_lobstr_balance_by_id
Description: Get the current lobstr account balance and credit information. Returns available, consumed, used_slots, and has_unpaid_bill.
Context: A safety tool. You can build agent guardrails that force the LLM to check this balance before triggering any run that contains more than 100 tasks.
"Check our Lobstr credit balance. Do we have at least 500 available credits before I execute this bulk Amazon product scrape?"
For the complete inventory of available methods, schemas, and parameter definitions, visit the Lobstr integration page.
Workflows in Action
Exposing individual tools to an LLM is only useful if the model can chain them together to solve complex problems. Here are two real-world operational workflows an AI agent can execute autonomously using the Lobstr toolset.
Scenario 1: Autonomous Competitor Pricing Scrape
Market research teams often need to pull fresh pricing data from competitor websites, but doing so manually requires navigating the Lobstr UI, configuring jobs, and downloading CSVs. An AI agent can orchestrate this entire pipeline from a single natural language command.
"Find a crawler suitable for extracting generic website text. Create a squid for it, add these three URLs (competitorA.com/pricing, competitorB.com/pricing, competitorC.com/pricing), run the scraper, and once it is finished, summarize their pricing plans for me."
Execution Steps:
- The agent calls
list_all_lobstr_crawlersand filters the response to find a crawler ID matching "website text extraction" or "generic scraper". - The agent calls
create_a_lobstr_squidusing the discovered crawler ID, generating a newsquid_hash. - The agent calls
create_a_lobstr_task, passing thesquid_hashand the array of three target URLs. - The agent calls
create_a_lobstr_runto execute the queue, receiving arun_hash. - The agent enters a loop, calling
get_single_lobstr_run_stat_by_id. It observesis_done: falseandeta: 45. It waits, then polls again untilis_done: true. - The agent calls
list_all_lobstr_resultsusing thesquid_hash, reads the scraped HTML/text data, and generates a clean pricing summary for the user.
Result: The user receives a structured, comparative analysis of competitor pricing directly in their chat interface, while the agent seamlessly handled the asynchronous infrastructure orchestration in the background.
Scenario 2: Credit Audit and Run Abort
When multiple teams use a shared Lobstr environment, long-running jobs can unexpectedly drain shared API credits. A DevOps or Ops agent can monitor ongoing jobs and kill runaway processes.
"Check our Lobstr balance. If we have consumed more than 80% of our monthly credits, find any active runs that are under 50% complete and abort them immediately to save credits."
Execution Steps:
- The agent calls
get_single_lobstr_balance_by_idto retrieve theavailableandconsumedcredit metrics. - It calculates the consumption ratio. If the threshold is breached, it calls
list_all_lobstr_squidsto get active squids, thenlist_all_lobstr_runsfor those squids to find active runs. - For any active run, it calls
get_single_lobstr_run_stat_by_idto inspectpercent_done. - If a run is large and less than 50% done, the agent calls
create_a_lobstr_run_abortusing therun_hash, immediately stopping the data collection.
Result: The agent acts as an autonomous cost-control mechanism. The user gets a confirmation message listing the exact runs that were terminated and how many credits were salvaged.
Building Multi-Step Workflows
To build these multi-step workflows, you need to connect Truto's proxy APIs to your agent framework. Below is the architectural approach for binding these tools using LangChain.js, though the exact same pattern applies to LangGraph, CrewAI, or the Vercel AI SDK.
When architecting this execution loop, you must account for the factual reality of rate limits. Truto passes upstream HTTP 429s directly to your application. Your agent loop must catch these errors, read the standard ratelimit-reset header, and implement the delay.
sequenceDiagram
participant App as Agent Application
participant LLM as LLM (OpenAI/Anthropic)
participant Truto as Truto Proxy
participant Lobstr as Lobstr API
App->>LLM: System Prompt + User Query + Tool Schemas
LLM-->>App: Tool Call (create_a_lobstr_run)
App->>Truto: POST /proxy/lobstr/runs
Truto->>Lobstr: POST /v1/runs
Lobstr-->>Truto: 429 Too Many Requests
Truto-->>App: 429 Error (ratelimit-reset: 10)
Note over App: App catches error, reads header, sleeps 10s
App->>Truto: POST /proxy/lobstr/runs (Retry)
Truto->>Lobstr: POST /v1/runs
Lobstr-->>Truto: 200 OK (Run ID)
Truto-->>App: Normalized JSON Response
App->>LLM: Tool Result (Run ID)
LLM-->>App: Final summarized responseHere is how you implement this in code. First, initialize the Truto toolset and bind it to your LLM.
import { ChatOpenAI } from "@langchain/openai";
import { TrutoToolManager } from "truto-langchainjs-toolset";
// 1. Initialize the LLM
const llm = new ChatOpenAI({
modelName: "gpt-4o",
temperature: 0,
});
// 2. Initialize Truto Tool Manager with your Lobstr Integrated Account ID
const trutoManager = new TrutoToolManager({
apiKey: process.env.TRUTO_API_KEY,
});
// 3. Fetch all Lobstr proxy tools dynamically
const lobstrTools = await trutoManager.getTools("your_lobstr_integrated_account_id");
// 4. Bind the tools to the LLM natively
const llmWithTools = llm.bindTools(lobstrTools);Next, you build the execution loop. Because Lobstr operations are complex, your system prompt must provide explicit instructions on the hierarchy and how to poll. Furthermore, your framework executor must handle the 429 errors.
import { SystemMessage, HumanMessage } from "@langchain/core/messages";
const systemPrompt = new SystemMessage(`
You are an autonomous scraping orchestrator managing Lobstr.
CRITICAL RULES:
1. HIERARCHY: You must find a crawler, create a squid, add tasks to the squid, and then start a run. You cannot skip steps.
2. ASYNC POLLING: After starting a run, you MUST use get_single_lobstr_run_stat_by_id to check the status.
3. Do NOT attempt to fetch results until 'is_done' is true.
`);
const messages = [systemPrompt, new HumanMessage("Scrape these URLs...")];
// Simplified execution loop demonstrating rate limit handling concept
async function executeAgentLoop(messages) {
while (true) {
const response = await llmWithTools.invoke(messages);
messages.push(response);
if (!response.tool_calls || response.tool_calls.length === 0) {
// The LLM has finished its task and provided a final answer
return response.content;
}
for (const toolCall of response.tool_calls) {
const selectedTool = lobstrTools.find(t => t.name === toolCall.name);
try {
// Execute the tool against Truto's proxy API
const toolResult = await selectedTool.invoke(toolCall.args);
messages.push({
role: "tool",
name: toolCall.name,
tool_call_id: toolCall.id,
content: JSON.stringify(toolResult),
});
} catch (error) {
// Truto passes upstream 429s directly to you. Handle them here.
if (error.status === 429) {
const resetTime = error.headers.get('ratelimit-reset') || 5;
console.log(`Rate limit hit. Sleeping for ${resetTime} seconds...`);
await new Promise(resolve => setTimeout(resolve, resetTime * 1000));
// Inform the LLM that a rate limit occurred and it should yield/retry
messages.push({
role: "tool",
name: toolCall.name,
tool_call_id: toolCall.id,
content: JSON.stringify({ error: `Rate limited. Please wait and try again.` }),
});
} else {
// Handle other standard API errors
messages.push({
role: "tool",
name: toolCall.name,
tool_call_id: toolCall.id,
content: JSON.stringify({ error: error.message }),
});
}
}
}
}
}By combining the dynamic tool schemas from Truto with a strict system prompt and a resilient framework executor, you transform your LLM from a static text generator into an autonomous agent capable of deploying and managing complex data extraction pipelines.
Elevate Your AI Agent Architecture
Connecting an AI agent to a complex, asynchronous platform like Lobstr requires more than a simple API key. It demands structured tool schemas, explicit dependency mapping, and rigorous error handling. By leveraging Truto's /tools endpoint, you bypass the massive engineering burden of hand-coding JSON schemas, maintaining API wrappers, and normalizing edge cases.
Whether you are building market research agents, autonomous competitive intelligence pipelines, or automated IT cost management tools, Truto provides the abstraction layer necessary to let your agents focus on reasoning, while we handle the execution.
FAQ
- How do AI agents handle Lobstr's asynchronous scraping runs?
- Agents must be provided with tools to start a run and a separate tool to poll for run statistics. You instruct the LLM to check the 'is_done' boolean or 'eta' field before attempting to fetch the final results.
- Does Truto automatically retry Lobstr API calls if the rate limit is hit?
- No. Truto normalizes standard IETF rate limit headers (ratelimit-limit, ratelimit-remaining, ratelimit-reset) and passes HTTP 429 errors directly to the caller. Your agent framework is responsible for handling the backoff and retry logic.
- Do I need to hardcode Lobstr crawler IDs in my agent prompts?
- No. By providing the 'list_all_lobstr_crawlers' tool, your agent can dynamically search for the correct crawler based on the user's intent, retrieve its ID, and use that ID to initialize the scraping Squid.
- Can I use Truto's Lobstr tools with LangChain?
- Yes. Truto's tools are framework-agnostic. You can fetch the OpenAPI definitions from the /tools endpoint and bind them directly using framework-native methods like .bindTools() in LangChain or LangGraph.