Core concepts
This page defines the ideas the rest of the documentation builds on: the Knowledge Node, the knowledge graph, provenance and confidence, the deterministic-first principle, trust tiers, and the exporter-module pattern.
The Knowledge Node
A Knowledge Node is the published, AI-consumable representation of one website. It is a set of static files (the node files) served from a CDN at a stable, versioned location. AI agents, LLMs, crawlers, and MCP clients read the node directly instead of scraping HTML. The node is regenerated and republished whenever the site's content changes.
The node is not the source of truth. It is a projection of the internal knowledge graph, which is the durable asset.
The knowledge graph
Every ingested content item is normalized into a typed knowledge graph stored in PostgreSQL. The graph is what makes bAInquet more than a file converter: it deduplicates, merges, resolves relationships, and tracks provenance and confidence across all of a site's content. Output formats come and go as exporter modules; the graph persists.
The graph is made of these record types. Each is a PostgreSQL table, and each row carries a website_id for tenant scoping.
| Concept | Table | What it is |
|---|---|---|
| Source | source | A provenance record for one ingested URL: url, checksum, last-seen, content type, and a derived trust score. Every exported record points back to a source. |
| Entity | entity | A thing the site is about: a Product, Person, Place, Organization, FAQ, and so on. Carries an open attributes JSONB bag, confidence, and language. |
| Fact | fact | A typed subject-predicate-value statement, for example "Single-Origin Ethiopia / price / 16.99 EUR". The value stays typed (number, string, bool, object) with optional unit and currency and a pre-rendered display string. |
| Relationship | relationship_edge | A typed edge between two entities, for example PRODUCED_BY or LOCATED_IN. Carries a resolutionStatus of resolved or dangling. |
| AIChunk | ai_chunk | A retrieval-sized text chunk with a stable id, referenced entity ids, a token estimate, and (when enabled) a pgvector embedding. |
| SyntheticQA | synthetic_qa | A generated question/answer pair derived from facts, with derivedFromFactIds provenance and a stale flag. |
The graph is traversed with recursive SQL (CTE) behind a GraphStore interface. There is no separate graph database.
Provenance and the confidence ladder
Every derived entity, fact, and relationship records how it was extracted and how much to trust it. The deterministic extractors run in a fixed priority order, each stamping a provenance_method and a confidence score.
| Provenance method | Confidence | Source of the value |
|---|---|---|
cms_field | 0.98 | Structured CMS fields supplied in the item json bag (priority 1) |
schema_org | 0.95 | schema.org JSON-LD on the page (priority 2) |
seo_meta | 0.85 | OpenGraph / SEO meta tags (priority 3) |
text_extraction | 0.75 | Deterministic extraction from plain text (priority 4) |
ai_enhanced | 0.60 to 0.90 | Model-scored, advanced tier only; never on the free tier |
INFO
The lower-confidence model-scored method is recorded as ai_enhanced in the published provenance so a consumer can separate AI-derived values from deterministic ones. See Free vs AI-enhanced.
The merge policy
When the same (subject, predicate) arrives more than once, the merge policy keeps the highest-precedence provenance method. The rank is cms_field > schema_org > seo_meta > text_extraction > model-inferred. Ties break on higher confidence. The superseded fact is recorded in history. A pinned fact is exempt from auto-merge demotion, and a hidden fact is never emitted by any exporter.
Deterministic-first, zero-LLM free tier
Extraction is deterministic by default. The four deterministic extractors use no LLM. The free tier triggers zero LLM work anywhere in the pipeline, and a free-tier job that attempts LLM work is a contract violation rejected at enqueue. This makes free-tier output fully reproducible and auditable: given identical graph content, exporter output is byte-identical.
LLM enrichment, where it exists, is opt-in, paid (advanced tier), budgeted, and provenance-tagged with the lower confidence. It never blocks the deterministic path.
A prompt-injection guard neutralizes hostile instructions found in source content and tags them as untrusted without ever writing that tag into a persisted provenance column.
Trust tiers
A website must prove ownership before its node is trusted. Verification has a state machine (pending, verified, failed, grace, revoked) and a method that determines a trust tier. The tier ranking, highest to lowest, is:
dns_txt(highest)plugin_signedwell_knownmeta_tagmanual(reviewed)
The tier feeds the Source.trust score and is published in trust.json, and mirrored into manifest.json.trust_tier, so an AI consumer can weight a node by how its domain ownership was proven. See Verifying ownership.
The exporter-module pattern
Every output format is a self-registering module implementing the Exporter interface. Each module declares its name, version, tier (free or advanced), the output file paths it produces, a pure supports(ctx) gate, and a streaming generate(ctx) that yields NDJSON lines or file segments.
- Streaming only. No module materializes the whole graph; it reads via a paging
GraphReader. - Self-registration. Modules register into a registry at import; the publish pipeline runs
registry.all().filter(e => e.supports(ctx)). - Deterministic order. The registry emits in a fixed
(phase, name)order somanifest.jsonis written last and references only files already produced. - Adding a format equals adding a module. The pipeline core never changes.
See Node files for which exporter modules are built today.
Stable ids vs uuids
Internally every row has a PostgreSQL uuid primary key. uuids never appear in node files. Exported records carry a stable, namespaced string id that is stable across versions for the same logical record:
| Prefix | Record |
|---|---|
entity_ | entity |
fact_ | fact |
rel_ | relationship |
chunk_ | AIChunk |
qa_ | synthetic QA |
src_ | source |
Cross-references inside the node use the stable id, so a reference like facts.jsonl.subject_entity_id to entities.jsonl.id stays valid across versions.
Idempotency and versioning
- Ingestion is idempotent. A
(website, item.id, checksum)already seen is acked asskippedwithout reprocessing; a repeatedIdempotency-Keyreplays the cached response. See Ingestion and signing. - Publishing is versioned. Each publish writes an immutable
versions/{ulid}/directory and atomically flips a mutablelatest/pointer last. The ULID is the monotonic ordering key; the human label is display-only.
Multi-tenancy
The organization is the tenant root. Every operational and knowledge row scopes back to one organization, directly or via its website. Tenant scoping is enforced in services and queries (every knowledge query is WHERE website_id = $1). A cross-tenant read returns 404 (so existence is not leaked) and a cross-tenant mutation returns 403.