Building a Knowledge Constellation for Philosophy Research

A design document written for an AI coding agent (Claude Code, Codex, Cursor, etc.). The user’s friend is a philosophy PhD student with ~1000 PDF papers. She needs a knowledge base that surfaces connections between papers and captures her own ideas — without reading everything first. This document tells you, the agent, exactly what to build. No design decisions needed. Execute.

Overview

Build a local-first knowledge constellation system with three layers:

Batch processing — ingest ~1000 PDFs, extract structure, generate connections
A visual constellation — an interactive web UI showing paper clusters and links
A thinking space — where she can write ideas and link them to papers

No API keys needed except optionally OpenAI/Anthropic for LLM summarization. Everything runs locally.

Layer 1: Batch PDF Processing

Step 1.1: Project setup

Create a directory knowledge-constellation/ with a Python virtual environment. Dependencies: pypdf2, sentence-transformers, scikit-learn, networkx, flask, discern (a small local LLM or API wrapper).

Step 1.2: Ingestion

The user has a folder of PDFs. She runs:

python ingest.py --pdf-dir /path/to/papers

For each PDF, the script:

Extracts text using PyPDF2 (or pypdfium2 for scanned PDFs)
Chunks the text into segments of ~500 words with overlap
For each chunk, asks an LLM to extract:
- Core claim / thesis statement (1-2 sentences)
- Key concepts (list of 3-10)
- References to other works mentioned in-text
- School of thought / tradition (if identifiable)
Extracts the bibliography section and parses references (regex-based, not perfect — good enough for constellation purposes)

Store everything in a SQLite database with this schema:

CREATE TABLE papers (
    id TEXT PRIMARY KEY,
    title TEXT,
    authors TEXT,
    year INTEGER,
    abstract TEXT,
    file_path TEXT
);

CREATE TABLE chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    paper_id TEXT REFERENCES papers(id),
    text TEXT,
    core_claim TEXT,
    key_concepts TEXT,  -- JSON array
    tradition TEXT
);

CREATE TABLE references_table (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_paper_id TEXT REFERENCES papers(id),
    target_title TEXT,
    target_authors TEXT,
    target_year INTEGER
);

CREATE TABLE ideas (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT,
    content TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE idea_paper_links (
    idea_id INTEGER REFERENCES ideas(id),
    paper_id TEXT REFERENCES papers(id),
    relationship TEXT  -- 'supports', 'challenges', 'inspired_by', 'context_for'
);

Step 1.3: Embedding and clustering

For each paper, compute an embedding vector using sentence-transformers/all-MiniLM-L6-v2 (runs locally, no GPU needed, 384-dim vectors, fast).

Then:

Cluster all papers using HDBSCAN (handles noise/outliers well, no need to pre-specify cluster count)
Extract cluster topics: for each cluster, ask the LLM to examine the top 10 papers’ claims and generate a 1-sentence cluster label
Find bridges: papers whose embedding is equidistant between two clusters (cosine distance to both centroids < threshold) → these are cross-domain connection points

Step 1.4: Citation graph

Using the references_table, build a directed graph:

Inbound citation count = how many papers in this collection cite this paper → foundation paper signal
Co-citation clusters = papers that are frequently cited together → they form a sub-conversation even if they aren’t directly linked

Merge citation graph + embedding clusters into a single graph database (NetworkX).

Layer 2: Visual Constellation (Web UI)

Build a single-page web app with Flask backend + D3.js or vis-network frontend.

Page 1: The Constellation View

An interactive force-directed graph where:

Nodes = papers (size = influence score from citation count + centrality)
Color = cluster membership (from HDBSCAN)
Edges = citation links or high embedding similarity (>0.85)
Hover shows paper title + cluster label
Click opens the Paper Detail panel in Page 2

The graph must support:

Zoom and pan
Click-to-select with highlight of direct neighbors
Search bar (type a concept → highlight all papers containing it)
Filter by cluster (toggle clusters on/off)

Page 2: Paper Detail Panel

When a paper node is clicked, slide in a right panel showing:

Title, Authors, Year
Core Claims (from extraction)
Key Concepts
Tradition / School
Cited by X papers in this collection
Cites Y papers in this collection

[AI-generated] "This paper is at the intersection of: [cluster names]"
[AI-generated] "If you like this, also look at: [3-5 paper titles from same cluster or citation neighborhood]"

[Her Ideas section]
  - List of her ideas linked to this paper
  - Button: "Add Idea" → opens inline editor

Page 3: Ideas Space

A list/grid view of all her ideas, sorted by recency. Each idea card shows:

Title + preview (first 100 chars)
Number of linked papers
Tags (auto-extracted key concepts, or manually added)

Click opens the full idea editor where she can:

Write/edit the idea freely (markdown supported)
Link to papers via a search box that autocompletes paper titles
Add relationship type (inspired_by / challenges / supports / context_for)

Layer 3: Trust The Agent To Scaffold

The folder structure the agent should create:

knowledge-constellation/
├── requirements.txt
├── ingest.py              # Step 1.2: PDF ingestion
├── embed_cluster.py       # Step 1.3: embedding + clustering
├── build_graph.py         # Step 1.4: citation graph
├── pipeline.sh            # One script to run all above: ingest → embed → cluster → graph
├── app.py                 # Flask web app serving Pages 1-3
├── templates/
│   └── index.html         # Single page app (Vue.js or vanilla JS + D3/vis)
├── static/
│   ├── constellation.js   # D3 force graph logic
│   └── style.css
├── database/
│   └── papers.db          # SQLite database (created by ingest.py)
└── data/
    ├── pdfs/              # Symlink or copy of her PDF folder
    └── embeddings/        # Cached embeddings (avoid recomputing)

What the agent must NOT do:

Do NOT require her to install CUDA, GPU drivers, or any non-trivial system dependencies
Do NOT assume she has an API key for a paid LLM service — make the LLM extraction step configurable: local models (default) or API-based (optional)
Do NOT build a complex build system. Python venv + pip install -r requirements.txt + run pipeline.sh
Do NOT generate fake/placeholder data. If the pipeline errors on real PDFs, error loudly with a clear message so she knows what’s wrong

What the agent SHOULD do:

The entire system starts with pip install -r requirements.txt && bash pipeline.sh && python app.py
The web UI opens at http://localhost:8080
Make the cluster labels sensible for a philosophy corpus, not computer science
Handle non-English titles and authors properly (UTF-8 everywhere)
Write a minimal README.md that she can skim in 30 seconds

If Single-Agent Implementation Is Too Slow

If the agent building this estimates it will take more than 2 hours:

Focus on ingest.py + pipeline.sh + embed_cluster.py first — these produce the raw data
The web UI can be a minimal starter (just the constellation graph) — she can ask her agent again to add the Ideas Space later
Citation parsing is the hardest part and can be simplified: skip bibliography parsing, rely purely on embedding similarity for connections. 80% of the value with 20% of the work

User Instructions (for her to give to her agent)

Copy-paste this entire document into a message to Claude Code, Codex CLI, Cursor, or similar with this prompt:

“Read the file DESIGN_FOR_AGENT.md and implement everything described. Start by creating the project structure and the ingestion pipeline. Test as you go on a small set of PDFs (5-10 files) before scaling to 1000. Ask me if you need any API keys or clarifications.”