A design document written for an AI coding agent (Claude Code, Codex, Cursor, etc.). The user’s friend is a philosophy PhD student with ~1000 PDF papers. She needs a knowledge base that surfaces connections between papers and captures her own ideas โ without reading everything first. This document tells you, the agent, exactly what to build. No design decisions needed. Execute.
Overview
Build a local-first knowledge constellation system with three layers:
- Batch processing โ ingest ~1000 PDFs, extract structure, generate connections
- A visual constellation โ an interactive web UI showing paper clusters and links
- A thinking space โ where she can write ideas and link them to papers
No API keys needed except optionally OpenAI/Anthropic for LLM summarization. Everything runs locally.
Layer 1: Batch PDF Processing
Step 1.1: Project setup
Create a directory knowledge-constellation/ with a Python virtual environment.
Dependencies: pypdf2, sentence-transformers, scikit-learn, networkx, flask, discern (a small local LLM or API wrapper).
Step 1.2: Ingestion
The user has a folder of PDFs. She runs:
python ingest.py --pdf-dir /path/to/papers
For each PDF, the script:
- Extracts text using PyPDF2 (or pypdfium2 for scanned PDFs)
- Chunks the text into segments of ~500 words with overlap
- For each chunk, asks an LLM to extract:
- Core claim / thesis statement (1-2 sentences)
- Key concepts (list of 3-10)
- References to other works mentioned in-text
- School of thought / tradition (if identifiable)
- Extracts the bibliography section and parses references (regex-based, not perfect โ good enough for constellation purposes)
Store everything in a SQLite database with this schema:
CREATE TABLE papers (
id TEXT PRIMARY KEY,
title TEXT,
authors TEXT,
year INTEGER,
abstract TEXT,
file_path TEXT
);
CREATE TABLE chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
paper_id TEXT REFERENCES papers(id),
text TEXT,
core_claim TEXT,
key_concepts TEXT, -- JSON array
tradition TEXT
);
CREATE TABLE references_table (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_paper_id TEXT REFERENCES papers(id),
target_title TEXT,
target_authors TEXT,
target_year INTEGER
);
CREATE TABLE ideas (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT,
content TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE idea_paper_links (
idea_id INTEGER REFERENCES ideas(id),
paper_id TEXT REFERENCES papers(id),
relationship TEXT -- 'supports', 'challenges', 'inspired_by', 'context_for'
);
Step 1.3: Embedding and clustering
For each paper, compute an embedding vector using sentence-transformers/all-MiniLM-L6-v2 (runs locally, no GPU needed, 384-dim vectors, fast).
Then:
- Cluster all papers using HDBSCAN (handles noise/outliers well, no need to pre-specify cluster count)
- Extract cluster topics: for each cluster, ask the LLM to examine the top 10 papers’ claims and generate a 1-sentence cluster label
- Find bridges: papers whose embedding is equidistant between two clusters (cosine distance to both centroids < threshold) โ these are cross-domain connection points
Step 1.4: Citation graph
Using the references_table, build a directed graph:
- Inbound citation count = how many papers in this collection cite this paper โ foundation paper signal
- Co-citation clusters = papers that are frequently cited together โ they form a sub-conversation even if they aren’t directly linked
Merge citation graph + embedding clusters into a single graph database (NetworkX).
Layer 2: Visual Constellation (Web UI)
Build a single-page web app with Flask backend + D3.js or vis-network frontend.
Page 1: The Constellation View
An interactive force-directed graph where:
- Nodes = papers (size = influence score from citation count + centrality)
- Color = cluster membership (from HDBSCAN)
- Edges = citation links or high embedding similarity (>0.85)
- Hover shows paper title + cluster label
- Click opens the Paper Detail panel in Page 2
The graph must support:
- Zoom and pan
- Click-to-select with highlight of direct neighbors
- Search bar (type a concept โ highlight all papers containing it)
- Filter by cluster (toggle clusters on/off)
Page 2: Paper Detail Panel
When a paper node is clicked, slide in a right panel showing:
Title, Authors, Year
Core Claims (from extraction)
Key Concepts
Tradition / School
Cited by X papers in this collection
Cites Y papers in this collection
[AI-generated] "This paper is at the intersection of: [cluster names]"
[AI-generated] "If you like this, also look at: [3-5 paper titles from same cluster or citation neighborhood]"
[Her Ideas section]
- List of her ideas linked to this paper
- Button: "Add Idea" โ opens inline editor
Page 3: Ideas Space
A list/grid view of all her ideas, sorted by recency. Each idea card shows:
- Title + preview (first 100 chars)
- Number of linked papers
- Tags (auto-extracted key concepts, or manually added)
Click opens the full idea editor where she can:
- Write/edit the idea freely (markdown supported)
- Link to papers via a search box that autocompletes paper titles
- Add relationship type (inspired_by / challenges / supports / context_for)
Layer 3: Trust The Agent To Scaffold
The folder structure the agent should create:
knowledge-constellation/
โโโ requirements.txt
โโโ ingest.py # Step 1.2: PDF ingestion
โโโ embed_cluster.py # Step 1.3: embedding + clustering
โโโ build_graph.py # Step 1.4: citation graph
โโโ pipeline.sh # One script to run all above: ingest โ embed โ cluster โ graph
โโโ app.py # Flask web app serving Pages 1-3
โโโ templates/
โ โโโ index.html # Single page app (Vue.js or vanilla JS + D3/vis)
โโโ static/
โ โโโ constellation.js # D3 force graph logic
โ โโโ style.css
โโโ database/
โ โโโ papers.db # SQLite database (created by ingest.py)
โโโ data/
โโโ pdfs/ # Symlink or copy of her PDF folder
โโโ embeddings/ # Cached embeddings (avoid recomputing)
What the agent must NOT do:
- Do NOT require her to install CUDA, GPU drivers, or any non-trivial system dependencies
- Do NOT assume she has an API key for a paid LLM service โ make the LLM extraction step configurable: local models (default) or API-based (optional)
- Do NOT build a complex build system. Python venv + pip install -r requirements.txt + run pipeline.sh
- Do NOT generate fake/placeholder data. If the pipeline errors on real PDFs, error loudly with a clear message so she knows what’s wrong
What the agent SHOULD do:
- The entire system starts with
pip install -r requirements.txt && bash pipeline.sh && python app.py - The web UI opens at
http://localhost:8080 - Make the cluster labels sensible for a philosophy corpus, not computer science
- Handle non-English titles and authors properly (UTF-8 everywhere)
- Write a minimal README.md that she can skim in 30 seconds
If Single-Agent Implementation Is Too Slow
If the agent building this estimates it will take more than 2 hours:
- Focus on ingest.py + pipeline.sh + embed_cluster.py first โ these produce the raw data
- The web UI can be a minimal starter (just the constellation graph) โ she can ask her agent again to add the Ideas Space later
- Citation parsing is the hardest part and can be simplified: skip bibliography parsing, rely purely on embedding similarity for connections. 80% of the value with 20% of the work
User Instructions (for her to give to her agent)
Copy-paste this entire document into a message to Claude Code, Codex CLI, Cursor, or similar with this prompt:
“Read the file DESIGN_FOR_AGENT.md and implement everything described. Start by creating the project structure and the ingestion pipeline. Test as you go on a small set of PDFs (5-10 files) before scaling to 1000. Ask me if you need any API keys or clarifications.”