GroundedResearchEngine — reference
When a chat prompt asks for biographical or factual content about real people, organizations, or current events, Midcore routes the request through a real public-source web search. Every claim carries a citation; insufficient evidence produces a refusal, not a fabrication; ambiguous names produce a disambiguation list, not a guess. This page documents the engine, its backends, the routing gate, and the anti-fabrication invariants we publicly commit to.
Why this exists
Free-generated biographical content from training-data memory is a category of model failure that produces defamation and misinformation. We treat regressions here as security incidents — see /security. The defense runs at four layers:
- TopicShiftDetector classifies every new chat prompt; biographical / external prompts get the
requires_groundingflag. - GroundedResearchEngine performs the real web search and returns a structured outcome with per-claim citations or a refusal.
- Anti-fabrication preamble is injected at every LLM entry point as defense-in-depth.
- Workspace-context gating drops codebase facts from the system prompt when the topic is external — so research answers don't get mixed with workspace data.
Search backends (priority order)
| Provider | Env var | Notes |
|---|---|---|
| Brave Search | BRAVE_SEARCH_API_KEY | Recommended. Fast, well-cited, generous free tier. |
| Tavily | TAVILY_API_KEY | Good fallback. Search-depth advanced; result snippets aligned to query. |
| SearXNG | SEARXNG_INSTANCE_URL | Self-hosted, no key. Use for air-gapped or compliance-bound deployments. |
| DuckDuckGo HTML | MIDCORE_ALLOW_DDG_HTML=1 | Last-resort dev fallback. Brittle (HTML scrape). |
The engine tries the first configured backend; on hard failure (network, 5xx), it falls back to the next. If none are configured, the engine refuses rather than fall back to LLM memory.
Outcome shape
Every call returns a ResearchOutcome:
{
"ok": true | false,
"query": "report on Marie Curie",
"summary": "Public-source summary for Marie Curie. Drew from 3 distinct domain(s) and 7 snippet(s). Provider: brave.",
"sections": [
{ "heading": "en.wikipedia.org", "claims": [ { "text": "...", "citation_index": 0 }, ... ] }
],
"citations": [
{
"url": "https://en.wikipedia.org/wiki/Marie_Curie",
"title": "Wikipedia — Marie Curie",
"snippet": "Marie Curie discovered radium and polonium...",
"retrieved_at": "2026-06-01T16:42:11+00:00",
"snippet_hash": "a3f0c2b1d4e5f607",
"rank": 1,
"domain": "en.wikipedia.org"
}
],
"unverified_notes": [],
"disambiguation_candidates": [],
"refusal_reason": null,
"coverage_score": 0.78,
"provider_used": "brave"
}snippet_hash is a SHA-256 of the snippet text (first 16 hex chars). Audits use it to detect silent rewrites of a quoted citation — same text → same hash; different text → different hash.
When the engine refuses
Refusal returns ok: false with a refusal_reason the chat surfaces verbatim. The four documented refusal modes:
- No backend configured — none of
BRAVE_SEARCH_API_KEY/TAVILY_API_KEY/SEARXNG_INSTANCE_URLset. The agent tells the user how to fix it. - Insufficient evidence —
coverage_score < 0.15. The agent asks for more context (employer, role, location) rather than guess. - Name is ambiguous — top hits describe distinct people sharing a name. The agent returns a candidate list with URL + title + domain so the user can pick.
- Provider hard failure — all configured providers returned 5xx or network errors. The agent surfaces the provider chain and the last error.
HTTP surface
Two paths touch the engine: the chat agent's grounded_research tool (most common), and direct invocation through the research route. The research route's grounding gate fires automatically when the prompt classifies as biographical:
POST /api/v1/autonomy/research/complete— any research-grade prompt is routed through the engine before any LLM call.POST /api/v1/agent/stream— the LLM agent calls thegrounded_researchtool; the dispatcher resolves and returns the outcome inside a tool_result chunk.
Anti-fabrication guarantee
We commit publicly to the following invariants. Any reproducer that breaks one is treated as a security incident:
- No path in
grounded_research_engine.pycalls an LLM provider. Composition is deterministic — claims come from search-engine and page-extract snippets, never free-generated. - The
grounded_researchtool description inTOOL_SCHEMAStells the LLM to never free-generate biographical content and to use this tool instead. - The
_ANTI_FABRICATION_PREAMBLEis prepended at three LLM entry points (research route, generic complete endpoint, agent runtime). - Workspace-context gating drops codebase facts when the topic is external — so a research answer never gets workspace data mixed in.