← All Case Studies
Data Infrastructure

Knowledge Graph + Entity Intelligence

by Aryan Arora

12 entity types, auto-enrichment pipeline, 315+ nodes and 597+ edges. Every project artifact mapped and traversable.

PostgreSQL asyncpg FastAPI Mermaid.js
Impact

Knowledge graph with 12 entity types and auto-enrichment tracking 315+ nodes and 597+ edges across all project artifacts

The Problem

Project data was scattered across tables with no connective tissue. Tasks existed in one table, sprints in another, research documents elsewhere, revenue entries isolated. You could query individual entities but couldn't answer relational questions: "What research informed this sprint?" or "Which tasks generated this revenue?" The relationships existed implicitly in people's heads but not in the data.

For AI agents operating autonomously, this was fatal. Agents can't reason about project context when the context graph doesn't exist.

What I Built

A knowledge graph layer on top of PostgreSQL that tracks 12 entity types and their relationships. The graph builds itself — every entity creation automatically triggers relation enrichment, and a boot-time backfill catches anything missed. No manual graph maintenance required.

The key architectural decision: build the graph inside PostgreSQL rather than introducing a dedicated graph database. The relations table with entity_type + entity_id pairs handles our traversal patterns through recursive CTEs, and we avoid the operational overhead of a separate database.

Architecture

Why PostgreSQL over Neo4j: At 315+ nodes and 597+ edges, our graph is well within PostgreSQL's comfort zone. Neo4j would require a separate database process, separate backup strategy, separate monitoring, and Cypher query language — all for a graph that PostgreSQL handles with a single relations table and recursive CTEs. The operational simplicity dividend is enormous. If the graph grows past 100K edges, we'd reconsider.

Why asyncpg over SQLAlchemy: Graph traversal queries hit the database frequently during enrichment. Raw asyncpg with prepared statements gives us 2-3x throughput over SQLAlchemy's async adapter for the batch INSERT patterns that bulk_enrich_graph_internal() executes on boot.

12 entity types: task, sprint, session, app, research, revenue, content, user, contact, presentation, skill, tool

Relation types: belongs_to, depends_on, created_in, part_of

Auto-enrichment pipeline:
1. Entity created (task, session, sprint, app) with project_id
2. auto_enrich_relation() fires post-INSERT
3. Graph edge inserted: (entity_type, entity_id) --[relation]--> (target_type, target_id)
4. No manual step, no cron job, no batch process

Boot enrichment: bulk_enrich_graph_internal() runs on application startup, scanning all entities and backfilling any missing edges. Idempotent — safe to run repeatedly.

Key Decisions

Auto-enrich on create over batch enrichment — real-time graph updates mean the graph is always current. Batch enrichment introduces a window where the graph is stale, which breaks agent reasoning.

Simple relation types over rich edge properties — four relation types (belongs_to, depends_on, created_in, part_of) cover 95% of our traversal queries. Adding edge weights, timestamps, or metadata would complicate queries without proportional benefit at this scale.

Boot-time backfill as safety net — auto-enrichment should catch everything, but defensive programming means we verify on startup. The backfill adds under 2 seconds to boot time and guarantees graph completeness.

Impact

  • 315+ nodes and 597+ edges built entirely through auto-enrichment — zero manual graph curation
  • 12 entity types covering every project artifact from tasks to tools
  • Agents query the graph to understand project context before taking action
  • Graph visualization (Mermaid.js) provides instant project topology overview
  • Boot enrichment guarantees no orphaned entities

Trade-offs

PostgreSQL graph queries are O(edges) for traversal, not O(1) like native graph databases. At our scale this is irrelevant, but it would matter at 100K+ edges. The simple relation type system (4 types) occasionally forces awkward modeling — "this research informed this sprint" uses belongs_to, which isn't semantically precise.