The knowledge graph
Every piece of ingested content is normalized into a typed knowledge graph stored in PostgreSQL; the published node files are a projection of that graph, not the source of truth.
The graph is what makes bAInquet more than a file converter: it deduplicates, merges, resolves relationships, and tracks provenance and confidence across all of a site's content. Output formats come and go as exporter modules; the graph is the durable asset.
What the graph contains
The graph is made of six record types. Each is a Postgres table, each carries a website_id for tenant scoping, and each is the canonical row behind one or more node files.
| Concept | Table | Node file | What it is |
|---|---|---|---|
| Source | source | sources.jsonl | A provenance record for one ingested URL: url, checksum, last-seen, content type, and a derived trust score. Every exported record points back to a source. |
| Entity | entity | entities.jsonl | A thing the site is about: a Product, Person, Place, Organization, FAQ, and so on. Carries an open attributes JSONB bag, a mandatory confidence, and a language. |
| Fact | fact | facts.jsonl | A typed subject-predicate-value statement, for example "Single-Origin Ethiopia / price / 16.99 EUR". The value stays typed (number, string, boolean, object) with optional unit/currency and a pre-rendered display. |
| Relationship | relationship_edge | relationships.jsonl | A typed edge between two entities, for example offers or located_at. Carries a resolutionStatus of resolved or dangling. |
| AIChunk | ai_chunk | chunks.jsonl | A retrieval-sized text chunk with a stable id, referenced entity ids, a token estimate, and, when enabled, a pgvector embedding. |
| SyntheticQA | synthetic_qa | qa.jsonl | A generated question/answer pair derived from facts, with derivedFromFactIds provenance and a stale flag. |
The graph is traversed with recursive SQL (CTE) in PostgreSQL behind a GraphStore interface. There is no separate graph database, and embedding vectors live in-engine via pgvector, not in an external vector store.
How the graph is built
Ingested content lands as a raw_content_item. A deterministic normalization pipeline extracts entities, facts, relationships, and chunks from it, then a merge step folds new facts into the existing graph and a resolution step connects relationship edges. Each step runs as a worker job; without the worker running, content is accepted and persisted but the graph is not built.
The four extractors run in a fixed priority order, each stamping a provenance_method and a confidence score. None of them use an LLM.
| Provenance method | Confidence | Extractor priority | Source of the value |
|---|---|---|---|
cms_field | 0.98 | 1 | Structured CMS fields supplied in the item json bag |
schema_org | 0.95 | 2 | schema.org JSON-LD on the page |
seo_meta | 0.85 | 3 | OpenGraph and SEO meta tags |
text_extraction | 0.75 | 4 | Deterministic extraction from plain text |
llm_inferred | 0.60-0.90 | advanced tier only | Model-scored; never on the free tier |
A prompt-injection guard neutralizes hostile instructions found in source content and tags them as untrusted, without ever writing that tag into a persisted provenance column.
Provenance and the confidence ladder
Every derived entity, fact, and relationship records how it was extracted (provenance_method) and how much to trust it (confidence, a number in [0,1]). The confidence values above form a ladder used by both merge and trust scoring.
When the same (subject, predicate) arrives more than once, the merge policy keeps the highest-precedence provenance method: cms_field > schema_org > seo_meta > text_extraction > llm_inferred. Ties break on higher confidence. The superseded fact is recorded in history, not deleted. Two editorial overrides modify this:
- A pinned fact is exempt from auto-merge demotion.
- A hidden fact is never emitted by any exporter.
Only the surviving fact per (subject, predicate) reaches the node. Relationship edges that are still dangling (their target entity not yet present) are re-resolved on the next graph build and are suppressed from export until then. Q&A whose source fact was edited or hidden is marked stale and excluded from export until regenerated.
Deterministic versus AI-enhanced
Extraction is deterministic by default. The free tier triggers zero LLM work anywhere in the pipeline; a free-tier job that attempts LLM work is a contract violation rejected at enqueue. Free-tier output is fully reproducible and auditable: every fact traces to a cms_field, schema_org, seo_meta, or text_extraction source.
Paid (advanced) tiers add LLM enrichment: gap-fill, inferred relationships, grounded Q&A, and tuned chunks. That enrichment is opt-in, budgeted, and never blocks the deterministic path. Every enriched record is tagged provenance_method: "llm_inferred" with the lower confidence band, so a consumer can separate machine-inferred data from owner-supplied data at read time.
Stable ids
Internally every row has a PostgreSQL uuid primary key. uuids never appear in node files. Exported records carry a stable, namespaced string id (entity_*, fact_*, rel_*, chunk_*, qa_*, src_*) that is stable across versions for the same logical record, so cross-references inside a node stay valid between publishes. See Node files for the namespace table.
The trust tier
A website must prove ownership before its node is trusted. Verification has a state machine (pending, verified, failed, grace, revoked) and a method that determines a trust tier. The tier ranking, highest to lowest, is:
dns_txt(highest)plugin_signedwell_knownmeta_tag(medium-low)manual(reviewed)
The tier feeds the Source.trust score and is published in trust.json, then mirrored into manifest.json.trust_tier, so a consumer can weight a node by how its domain ownership was proven without fetching the advanced file.
Two trust-tier vocabularies
The internal WebsiteVerification.method enum uses well_known_file and plugin; the published trust.json and manifest.json.trust_tier use well_known and plugin_signed. When reading a node, use the published-file values.
Related
- Node files: the published projection of this graph.
- Consuming a node: how an AI consumer reads it.
- Status and scope: which exporters and capabilities are shipped.