๐Ÿ  Overview ๐Ÿ“ Posts ๐Ÿ“š Publications ๐Ÿ‘ฅ Friends

Building a Knowledge Constellation for Philosophy Research

A design document written for an AI coding agent (Claude Code, Codex, Cursor, etc.). The user’s friend is a philosophy PhD student with ~1000 PDF papers. She needs a knowledge base that surfaces connections between papers and captures her own ideas โ€” without reading everything first. This document tells you, the agent, exactly what to build. No design decisions needed. Execute.


Overview

Build a local-first knowledge constellation system with three layers:

  1. Batch processing โ€” ingest ~1000 PDFs, extract structure, generate connections
  2. A visual constellation โ€” an interactive web UI showing paper clusters and links
  3. A thinking space โ€” where she can write ideas and link them to papers

No API keys needed except optionally OpenAI/Anthropic for LLM summarization. Everything runs locally.


Layer 1: Batch PDF Processing

Step 1.1: Project setup

Create a directory knowledge-constellation/ with a Python virtual environment. Dependencies: pypdf2, sentence-transformers, scikit-learn, networkx, flask, discern (a small local LLM or API wrapper).

Step 1.2: Ingestion

The user has a folder of PDFs. She runs:

python ingest.py --pdf-dir /path/to/papers

For each PDF, the script:

  1. Extracts text using PyPDF2 (or pypdfium2 for scanned PDFs)
  2. Chunks the text into segments of ~500 words with overlap
  3. For each chunk, asks an LLM to extract:
    • Core claim / thesis statement (1-2 sentences)
    • Key concepts (list of 3-10)
    • References to other works mentioned in-text
    • School of thought / tradition (if identifiable)
  4. Extracts the bibliography section and parses references (regex-based, not perfect โ€” good enough for constellation purposes)

Store everything in a SQLite database with this schema:

CREATE TABLE papers (
    id TEXT PRIMARY KEY,
    title TEXT,
    authors TEXT,
    year INTEGER,
    abstract TEXT,
    file_path TEXT
);

CREATE TABLE chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    paper_id TEXT REFERENCES papers(id),
    text TEXT,
    core_claim TEXT,
    key_concepts TEXT,  -- JSON array
    tradition TEXT
);

CREATE TABLE references_table (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_paper_id TEXT REFERENCES papers(id),
    target_title TEXT,
    target_authors TEXT,
    target_year INTEGER
);

CREATE TABLE ideas (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT,
    content TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE idea_paper_links (
    idea_id INTEGER REFERENCES ideas(id),
    paper_id TEXT REFERENCES papers(id),
    relationship TEXT  -- 'supports', 'challenges', 'inspired_by', 'context_for'
);

Step 1.3: Embedding and clustering

For each paper, compute an embedding vector using sentence-transformers/all-MiniLM-L6-v2 (runs locally, no GPU needed, 384-dim vectors, fast).

Then:

  1. Cluster all papers using HDBSCAN (handles noise/outliers well, no need to pre-specify cluster count)
  2. Extract cluster topics: for each cluster, ask the LLM to examine the top 10 papers’ claims and generate a 1-sentence cluster label
  3. Find bridges: papers whose embedding is equidistant between two clusters (cosine distance to both centroids < threshold) โ†’ these are cross-domain connection points

Step 1.4: Citation graph

Using the references_table, build a directed graph:

  • Inbound citation count = how many papers in this collection cite this paper โ†’ foundation paper signal
  • Co-citation clusters = papers that are frequently cited together โ†’ they form a sub-conversation even if they aren’t directly linked

Merge citation graph + embedding clusters into a single graph database (NetworkX).


Layer 2: Visual Constellation (Web UI)

Build a single-page web app with Flask backend + D3.js or vis-network frontend.

Page 1: The Constellation View

An interactive force-directed graph where:

  • Nodes = papers (size = influence score from citation count + centrality)
  • Color = cluster membership (from HDBSCAN)
  • Edges = citation links or high embedding similarity (>0.85)
  • Hover shows paper title + cluster label
  • Click opens the Paper Detail panel in Page 2

The graph must support:

  • Zoom and pan
  • Click-to-select with highlight of direct neighbors
  • Search bar (type a concept โ†’ highlight all papers containing it)
  • Filter by cluster (toggle clusters on/off)

Page 2: Paper Detail Panel

When a paper node is clicked, slide in a right panel showing:

Title, Authors, Year
Core Claims (from extraction)
Key Concepts
Tradition / School
Cited by X papers in this collection
Cites Y papers in this collection

[AI-generated] "This paper is at the intersection of: [cluster names]"
[AI-generated] "If you like this, also look at: [3-5 paper titles from same cluster or citation neighborhood]"

[Her Ideas section]
  - List of her ideas linked to this paper
  - Button: "Add Idea" โ†’ opens inline editor

Page 3: Ideas Space

A list/grid view of all her ideas, sorted by recency. Each idea card shows:

  • Title + preview (first 100 chars)
  • Number of linked papers
  • Tags (auto-extracted key concepts, or manually added)

Click opens the full idea editor where she can:

  • Write/edit the idea freely (markdown supported)
  • Link to papers via a search box that autocompletes paper titles
  • Add relationship type (inspired_by / challenges / supports / context_for)

Layer 3: Trust The Agent To Scaffold

The folder structure the agent should create:

knowledge-constellation/
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ ingest.py              # Step 1.2: PDF ingestion
โ”œโ”€โ”€ embed_cluster.py       # Step 1.3: embedding + clustering
โ”œโ”€โ”€ build_graph.py         # Step 1.4: citation graph
โ”œโ”€โ”€ pipeline.sh            # One script to run all above: ingest โ†’ embed โ†’ cluster โ†’ graph
โ”œโ”€โ”€ app.py                 # Flask web app serving Pages 1-3
โ”œโ”€โ”€ templates/
โ”‚   โ””โ”€โ”€ index.html         # Single page app (Vue.js or vanilla JS + D3/vis)
โ”œโ”€โ”€ static/
โ”‚   โ”œโ”€โ”€ constellation.js   # D3 force graph logic
โ”‚   โ””โ”€โ”€ style.css
โ”œโ”€โ”€ database/
โ”‚   โ””โ”€โ”€ papers.db          # SQLite database (created by ingest.py)
โ””โ”€โ”€ data/
    โ”œโ”€โ”€ pdfs/              # Symlink or copy of her PDF folder
    โ””โ”€โ”€ embeddings/        # Cached embeddings (avoid recomputing)

What the agent must NOT do:

  • Do NOT require her to install CUDA, GPU drivers, or any non-trivial system dependencies
  • Do NOT assume she has an API key for a paid LLM service โ€” make the LLM extraction step configurable: local models (default) or API-based (optional)
  • Do NOT build a complex build system. Python venv + pip install -r requirements.txt + run pipeline.sh
  • Do NOT generate fake/placeholder data. If the pipeline errors on real PDFs, error loudly with a clear message so she knows what’s wrong

What the agent SHOULD do:

  • The entire system starts with pip install -r requirements.txt && bash pipeline.sh && python app.py
  • The web UI opens at http://localhost:8080
  • Make the cluster labels sensible for a philosophy corpus, not computer science
  • Handle non-English titles and authors properly (UTF-8 everywhere)
  • Write a minimal README.md that she can skim in 30 seconds

If Single-Agent Implementation Is Too Slow

If the agent building this estimates it will take more than 2 hours:

  1. Focus on ingest.py + pipeline.sh + embed_cluster.py first โ€” these produce the raw data
  2. The web UI can be a minimal starter (just the constellation graph) โ€” she can ask her agent again to add the Ideas Space later
  3. Citation parsing is the hardest part and can be simplified: skip bibliography parsing, rely purely on embedding similarity for connections. 80% of the value with 20% of the work

User Instructions (for her to give to her agent)

Copy-paste this entire document into a message to Claude Code, Codex CLI, Cursor, or similar with this prompt:

“Read the file DESIGN_FOR_AGENT.md and implement everything described. Start by creating the project structure and the ingestion pipeline. Test as you go on a small set of PDFs (5-10 files) before scaling to 1000. Ask me if you need any API keys or clarifications.”