Site crawler & knowledge base
A Playwright-based crawler runs against your site to build the knowledge base that lets the agent answer factual questions inline without navigating the visitor anywhere.
Trigger
- Manual — Dashboard → Site detail → Start crawl (recommended after content updates).
- On install — the first crawl is auto-triggered when you add a site.
Scheduled recurring crawls are on the roadmap; today you re-run manually when your content changes.
What it does
BFS walk starting from the site root. For each page:
- Loads the page in a headless Chromium with
domcontentloaded+ a 1s settle wait so JS-rendered content finishes. - Extracts:
<title><meta name="description">- All headings (h1–h3) with text + level
- Visible text body (de-dup’d, scripts/styles stripped)
- All in-domain links (for the BFS frontier)
- Typography signals (heading-font, body-font, button-radius detected via computed styles)
- Dominant accent color sampled from the page palette
- Stores per-page in Firestore under
sites/<siteId>/pages/<docId>. - Aggregates site-level signals: detected primary color, button radius, font stack.
Limits
| Setting | Default |
|---|---|
| Page cap (Free) | 50 pages |
| Page cap (Pro) | 500 pages |
| Page cap (Enterprise) | 5,000 pages |
| Depth cap | None — bounded only by the page cap above. |
| Per-page text limit | Truncated past the document size that fits in Firestore’s 20MB doc cap. |
| User-Agent | Mozilla/5.0 (compatible; LeFluxCrawler/1.0; +https://leflux.xrlabs.app) |
| Concurrency | 1 (sequential). Polite — we don’t burn your host’s bandwidth. |
| Heartbeat | Every 8s. Stale jobs (no heartbeat > 90s) are auto-reaped. |
What gets ignored
- URLs that 4xx / 5xx.
- Non-HTML responses (PDFs, images, JSON, video).
- URLs outside your registered domain (cross-origin links).
Note: today’s crawler does NOT consult robots.txt or noindex meta tags. Pages reachable from your root via in-domain links are eligible to be crawled up to the plan cap. If you need to exclude specific paths, add them to the ignored-paths list in Settings → Crawler (planned — track this issue for status).
Live progress
While a crawl runs, the dashboard shows:
- Pages discovered (BFS frontier size)
- Pages walked (completed)
- Current URL
- Recent text excerpts
The crawler emits a heartbeat every 8s so a stalled run is detected and reaped within 60s (no zombie “scanning…” status forever).
Re-crawl
Re-running replaces the previous crawl. Pages no longer reachable from the root are pruned. URLs whose hash changed are re-extracted. URLs whose hash is identical are skipped (no wasted work).
To force a full re-extraction (e.g. after a major redesign), delete the site and re-create — gives you the cleanest state.
How the LLM uses it
On every visitor turn the server:
- Ranks crawled pages by relevance to the visitor’s latest message (field-weighted token scoring against title + headings + body text).
- Injects the top-3 most relevant pages’ summaries, headings, notes, AND raw body text excerpts (up to 2500 chars each) as a
# Cross-page knowledgeblock at the top of the prompt. - Extracts emails + phone numbers from those pages and prepends a
## Site contactsblock. - Always emits the full Sitemap (every crawled path + title) so the agent can reference any page by name.
For info questions (“what’s your phone number?”, “what does plan X include?”, “show me your team”) the agent answers from this block — no navigation. For navigational requests it pulls the URL verbatim from the Sitemap.
Inspecting the crawl
Site detail → Site Scan lists every crawled page with its title, URL, and a preview of the extracted text. Open one to see the full body text + headings + links — exactly what the LLM sees.
If a question isn’t being answered well, check whether the source page is crawled with the relevant content. Often the fix is a re-crawl after a content update, or expanding the page cap if your site has more than 60 pages worth of useful content.
Privacy
The crawler only fetches pages reachable from your site root — same as a search engine. It doesn’t access auth-gated URLs (no cookie store, no login flow), doesn’t bypass any access control. Content is stored in Firestore under your tenant’s site doc; no cross-tenant data sharing.
If your site has a robots.txt you want enforced, the recommended approach today is to keep sensitive content behind auth (the crawler can’t reach those pages). Per-path exclusions in the dashboard are on the roadmap.