Architecture
CiteKit is designed as a two-phase local-first system. It separates the expensive, slow process of specific finding content from the cheap, fast process of retrieving it.
High-Level Design
The system consists of three main components:
- Ingestion Engine: Interfaces with a multimodal mapper (Gemini by default) to generate semantic maps.
- Map Storage: A local JSON-based storage for persisting content structure.
- Resolution Engine: A local media processing layer (FFmpeg, PyMuPDF) for extracting content.
1. The Ingestion Engine
The ingestion process enables "blind" agents to understand file content.
- Input: A file path (PDF, MP4, MP3, JPG, PY, MD, TXT).
- Process:
- The file is uploaded to the configured mapper API (temporarily).
- CiteKit prompts the mapper with a domain-specific schema instruction.
- The model returns a hierarchical JSON structure representing the file's content (Topics, Chapters, Scenes).
- The remote file is deleted from Google's servers (depending on retention policy, usually immediate for temp files).
2. Resource Maps (.resource_maps)
The core data structure in CiteKit is the Resource Map. It is a normalized JSON format that describes where information lives within a file.
Example Structure:
{
"resource_id": "research_paper_v1",
"source_path": "abs/path/to/paper.pdf",
"metadata": { "title": "Attention Is All You Need", "type": "pdf" },
"nodes": [
{
"id": "intro",
"title": "Introduction",
"location": { "pages": [1] },
"context": "Discussion of RNNs and CNNs..."
},
{
"id": "architecture",
"title": "Model Architecture",
"location": { "pages": [3, 4, 5] },
"context": "Detailed diagram of Transformer..."
}
]
}This map allows an agent to "read" the entire structure of a 50-page document in ~500 tokens.
3. The Resolution Engine
The resolution engine is completely local and offline. It supports two distinct modes:
Physical Resolution
Uses specialized libraries to extract byte-perfect segments based on the map's coordinates into new files.
| Modality | Backend Engine | Operation |
|---|---|---|
PyMuPDF / pdf-lib | Extracts page ranges into a new single-file PDF. | |
| Video | ffmpeg | Performs stream copy (-c copy) for instant cutting without re-encoding. |
| Audio | ffmpeg | Trims audio streams to specified timestamps. |
| Image | Pillow / sharp | Crops images to specified bounding boxes. |
| Text | Native (Python/JS) | Slices file content by line range. |
Virtual Resolution (Zero-Binary)
Skips the extraction entirely and returns the conceptual coordinates (timestamps, page numbers, or bounding boxes) along with a virtual address. This mode has zero external binary dependencies (no FFmpeg needed) and is ideal for serverless environments.
Data Flow
- Developer calls
client.ingest("video.mp4"). - CiteKit generates
video.jsonmap. - Agent reads
video.jsonand decides it needs the "Demo" section. - Agent calls
client.resolve("video", "demo", { virtual: true }). - CiteKit identifies timestamps:
180.5to210.0. - Agent receives the metadata directly, using it to point an LLM (GenAI File API) at the original file's specific segment.
The same flow applies to Images (Metadata -> Crop), PDFs (TOC -> Page Extraction), and Code (Structure -> Line Slicing).