Core concepts

This page defines the ideas the rest of the documentation builds on: the Knowledge Node, the knowledge graph, provenance and confidence, the deterministic-first principle, trust tiers, and the exporter-module pattern.

The Knowledge Node

A Knowledge Node is the published, AI-consumable representation of one website. It is a set of static files (the node files) served from a CDN at a stable, versioned location. AI agents, LLMs, crawlers, and MCP clients read the node directly instead of scraping HTML. The node is regenerated and republished whenever the site's content changes.

The node is not the source of truth. It is a projection of the internal knowledge graph, which is the durable asset.

The knowledge graph

Every ingested content item is normalized into a typed knowledge graph stored in PostgreSQL. The graph is what makes bAInquet more than a file converter: it deduplicates, merges, resolves relationships, and tracks provenance and confidence across all of a site's content. Output formats come and go as exporter modules; the graph persists.

The graph is made of these record types. Each is a PostgreSQL table, and each row carries a website_id for tenant scoping.

Concept	Table	What it is
Source	`source`	A provenance record for one ingested URL: url, checksum, last-seen, content type, and a derived `trust` score. Every exported record points back to a source.
Entity	`entity`	A thing the site is about: a Product, Person, Place, Organization, FAQ, and so on. Carries an open `attributes` JSONB bag, `confidence`, and `language`.
Fact	`fact`	A typed subject-predicate-value statement, for example "Single-Origin Ethiopia / price / 16.99 EUR". The value stays typed (number, string, bool, object) with optional `unit` and `currency` and a pre-rendered `display` string.
Relationship	`relationship_edge`	A typed edge between two entities, for example `PRODUCED_BY` or `LOCATED_IN`. Carries a `resolutionStatus` of `resolved` or `dangling`.
AIChunk	`ai_chunk`	A retrieval-sized text chunk with a stable id, referenced entity ids, a token estimate, and (when enabled) a pgvector embedding.
SyntheticQA	`synthetic_qa`	A generated question/answer pair derived from facts, with `derivedFromFactIds` provenance and a `stale` flag.

The graph is traversed with recursive SQL (CTE) behind a GraphStore interface. There is no separate graph database.

Provenance and the confidence ladder

Every derived entity, fact, and relationship records how it was extracted and how much to trust it. The deterministic extractors run in a fixed priority order, each stamping a provenance_method and a confidence score.

Provenance method	Confidence	Source of the value
`cms_field`	0.98	Structured CMS fields supplied in the item `json` bag (priority 1)
`schema_org`	0.95	schema.org JSON-LD on the page (priority 2)
`seo_meta`	0.85	OpenGraph / SEO meta tags (priority 3)
`text_extraction`	0.75	Deterministic extraction from plain text (priority 4)
`ai_enhanced`	0.60 to 0.90	Model-scored, advanced tier only; never on the free tier

INFO

The lower-confidence model-scored method is recorded as ai_enhanced in the published provenance so a consumer can separate AI-derived values from deterministic ones. See Free vs AI-enhanced.

The merge policy

When the same (subject, predicate) arrives more than once, the merge policy keeps the highest-precedence provenance method. The rank is cms_field > schema_org > seo_meta > text_extraction > model-inferred. Ties break on higher confidence. The superseded fact is recorded in history. A pinned fact is exempt from auto-merge demotion, and a hidden fact is never emitted by any exporter.

Deterministic-first, zero-LLM free tier

Extraction is deterministic by default. The four deterministic extractors use no LLM. The free tier triggers zero LLM work anywhere in the pipeline, and a free-tier job that attempts LLM work is a contract violation rejected at enqueue. This makes free-tier output fully reproducible and auditable: given identical graph content, exporter output is byte-identical.

LLM enrichment, where it exists, is opt-in, paid (advanced tier), budgeted, and provenance-tagged with the lower confidence. It never blocks the deterministic path.

A prompt-injection guard neutralizes hostile instructions found in source content and tags them as untrusted without ever writing that tag into a persisted provenance column.

Trust tiers

A website must prove ownership before its node is trusted. Verification has a state machine (pending, verified, failed, grace, revoked) and a method that determines a trust tier. The tier ranking, highest to lowest, is:

dns_txt (highest)
plugin_signed
well_known
meta_tag
manual (reviewed)

The tier feeds the Source.trust score and is published in trust.json, and mirrored into manifest.json.trust_tier, so an AI consumer can weight a node by how its domain ownership was proven. See Verifying ownership.

The exporter-module pattern

Every output format is a self-registering module implementing the Exporter interface. Each module declares its name, version, tier (free or advanced), the output file paths it produces, a pure supports(ctx) gate, and a streaming generate(ctx) that yields NDJSON lines or file segments.

Streaming only. No module materializes the whole graph; it reads via a paging GraphReader.
Self-registration. Modules register into a registry at import; the publish pipeline runs registry.all().filter(e => e.supports(ctx)).
Deterministic order. The registry emits in a fixed (phase, name) order so manifest.json is written last and references only files already produced.
Adding a format equals adding a module. The pipeline core never changes.

See Node files for which exporter modules are built today.

Stable ids vs uuids

Internally every row has a PostgreSQL uuid primary key. uuids never appear in node files. Exported records carry a stable, namespaced string id that is stable across versions for the same logical record:

Prefix	Record
`entity_`	entity
`fact_`	fact
`rel_`	relationship
`chunk_`	AIChunk
`qa_`	synthetic QA
`src_`	source

Cross-references inside the node use the stable id, so a reference like facts.jsonl.subject_entity_id to entities.jsonl.id stays valid across versions.

Idempotency and versioning

Ingestion is idempotent. A (website, item.id, checksum) already seen is acked as skipped without reprocessing; a repeated Idempotency-Key replays the cached response. See Ingestion and signing.
Publishing is versioned. Each publish writes an immutable versions/{ulid}/ directory and atomically flips a mutable latest/ pointer last. The ULID is the monotonic ordering key; the human label is display-only.

Multi-tenancy

The organization is the tenant root. Every operational and knowledge row scopes back to one organization, directly or via its website. Tenant scoping is enforced in services and queries (every knowledge query is WHERE website_id = $1). A cross-tenant read returns 404 (so existence is not leaked) and a cross-tenant mutation returns 403.

Core concepts ​

The Knowledge Node ​

The knowledge graph ​

Provenance and the confidence ladder ​

The merge policy ​

Deterministic-first, zero-LLM free tier ​

Trust tiers ​

The exporter-module pattern ​

Stable ids vs uuids ​

Idempotency and versioning ​

Multi-tenancy ​