GroundedResearchEngine — reference

When a chat prompt asks for biographical or factual content about real people, organizations, or current events, Midcore routes the request through a real public-source web search. Every claim carries a citation; insufficient evidence produces a refusal, not a fabrication; ambiguous names produce a disambiguation list, not a guess. This page documents the engine, its backends, the routing gate, and the anti-fabrication invariants we publicly commit to.

Why this exists

Free-generated biographical content from training-data memory is a category of model failure that produces defamation and misinformation. We treat regressions here as security incidents — see /security. The defense runs at four layers:

TopicShiftDetector classifies every new chat prompt; biographical / external prompts get the requires_grounding flag.
GroundedResearchEngine performs the real web search and returns a structured outcome with per-claim citations or a refusal.
Anti-fabrication preamble is injected at every LLM entry point as defense-in-depth.
Workspace-context gating drops codebase facts from the system prompt when the topic is external — so research answers don't get mixed with workspace data.

Search backends (priority order)

Provider	Env var	Notes
Brave Search	`BRAVE_SEARCH_API_KEY`	Recommended. Fast, well-cited, generous free tier.
Tavily	`TAVILY_API_KEY`	Good fallback. Search-depth advanced; result snippets aligned to query.
SearXNG	`SEARXNG_INSTANCE_URL`	Self-hosted, no key. Use for air-gapped or compliance-bound deployments.
DuckDuckGo HTML	`MIDCORE_ALLOW_DDG_HTML=1`	Last-resort dev fallback. Brittle (HTML scrape).

The engine tries the first configured backend; on hard failure (network, 5xx), it falls back to the next. If none are configured, the engine refuses rather than fall back to LLM memory.

Outcome shape

Every call returns a ResearchOutcome:

{
  "ok": true | false,
  "query": "report on Marie Curie",
  "summary": "Public-source summary for Marie Curie. Drew from 3 distinct domain(s) and 7 snippet(s). Provider: brave.",
  "sections": [
    { "heading": "en.wikipedia.org", "claims": [ { "text": "...", "citation_index": 0 }, ... ] }
  ],
  "citations": [
    {
      "url": "https://en.wikipedia.org/wiki/Marie_Curie",
      "title": "Wikipedia — Marie Curie",
      "snippet": "Marie Curie discovered radium and polonium...",
      "retrieved_at": "2026-06-01T16:42:11+00:00",
      "snippet_hash": "a3f0c2b1d4e5f607",
      "rank": 1,
      "domain": "en.wikipedia.org"
    }
  ],
  "unverified_notes": [],
  "disambiguation_candidates": [],
  "refusal_reason": null,
  "coverage_score": 0.78,
  "provider_used": "brave"
}

snippet_hash is a SHA-256 of the snippet text (first 16 hex chars). Audits use it to detect silent rewrites of a quoted citation — same text → same hash; different text → different hash.

When the engine refuses

Refusal returns ok: false with a refusal_reason the chat surfaces verbatim. The four documented refusal modes:

No backend configured — none of BRAVE_SEARCH_API_KEY /TAVILY_API_KEY / SEARXNG_INSTANCE_URL set. The agent tells the user how to fix it.
Insufficient evidence — coverage_score < 0.15. The agent asks for more context (employer, role, location) rather than guess.
Name is ambiguous — top hits describe distinct people sharing a name. The agent returns a candidate list with URL + title + domain so the user can pick.
Provider hard failure — all configured providers returned 5xx or network errors. The agent surfaces the provider chain and the last error.

HTTP surface

Two paths touch the engine: the chat agent's grounded_research tool (most common), and direct invocation through the research route. The research route's grounding gate fires automatically when the prompt classifies as biographical:

POST /api/v1/autonomy/research/complete — any research-grade prompt is routed through the engine before any LLM call.
POST /api/v1/agent/stream — the LLM agent calls the grounded_research tool; the dispatcher resolves and returns the outcome inside a tool_result chunk.

Anti-fabrication guarantee

We commit publicly to the following invariants. Any reproducer that breaks one is treated as a security incident:

No path in grounded_research_engine.py calls an LLM provider. Composition is deterministic — claims come from search-engine and page-extract snippets, never free-generated.
The grounded_research tool description in TOOL_SCHEMAS tells the LLM to never free-generate biographical content and to use this tool instead.
The _ANTI_FABRICATION_PREAMBLE is prepended at three LLM entry points (research route, generic complete endpoint, agent runtime).
Workspace-context gating drops codebase facts when the topic is external — so a research answer never gets workspace data mixed in.