Python Client API Reference
The CiteKitClient is the main entry point for the CiteKit Python SDK. It provides a single interface for ingesting resources, managing resource maps locally, and resolving specific content into extracted evidence.
Constructor
from citekit import CiteKitClient
from citekit.mapper.gemini import GeminiMapper
# Using default Gemini mapper
client = CiteKitClient(api_key="YOUR_GEMINI_API_KEY")
# Using custom mapper
from my_mapper import OllamaMapper
client = CiteKitClient(mapper=OllamaMapper(model="llama3"))
# Full options
client = CiteKitClient(
mapper=None, # Will auto-initialize GeminiMapper if api_key is provided
base_dir=".",
storage_dir=".resource_maps",
output_dir=".citekit_output",
concurrency_limit=5,
api_key=None, # Reads GEMINI_API_KEY env var if not provided
model="gemini-2.0-flash",
max_retries=3
)Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
mapper | MapperProvider | None | None | Custom mapper instance. If None and api_key is provided, auto-initializes GeminiMapper. |
base_dir | str | "." | Root directory for all CiteKit operations. Useful for serverless environments. |
storage_dir | str | ".resource_maps" | Relative path (from base_dir) where resource maps are persisted as JSON files. |
output_dir | str | ".citekit_output" | Relative path (from base_dir) where resolved clips/extracts are written. |
concurrency_limit | int | 5 | Maximum number of parallel mapper calls (ingestion). Prevents rate-limiting. |
api_key | str | None | None | Gemini API Key (only used if mapper=None). Falls back to GEMINI_API_KEY environment variable. |
model | str | "gemini-2.0-flash" | Gemini model ID to use (only used if mapper=None). |
max_retries | int | 3 | Retry attempts for failed mapper API calls (only used if mapper=None). |
Raises
RuntimeError: If neithermappernorapi_key(orGEMINI_API_KEYenv) is provided when callingingest().
Methods
async ingest(resource_path, resource_type, resource_id=None)
Analyzes a file using the configured mapper and generates a ResourceMap. This is the primary entry point for structuring your resources.
async def ingest(
resource_path: str,
resource_type: str,
resource_id: str | None = None
) -> ResourceMapParameters
| Parameter | Type | Description |
|---|---|---|
resource_path | str | Absolute or relative path to the resource file (e.g., "lecture.mp4", "/data/paper.pdf"). |
resource_type | str | The resource modality: "document", "video", "audio", "image", or "text". |
resource_id | str | None | Optional custom ID for the resource. If not provided, defaults to the filename stem (e.g., "lecture" from "lecture.mp4"). |
Returns
ResourceMap: The generated resource structure containing nodes, metadata, and location data.
Ingestion Workflow
The ingestion process is atomic and idempotent:
- Path Normalization: Converts the path to absolute.
- SHA-256 Hashing: Computes a content hash for deduplication.
- Cache Lookup: Scans
storage_dirfor an existing map with the same hash (skips LLM call if found). - Concurrency Gate: Waits for a semaphore slot (respects
concurrency_limit). - Mapper Generation: Calls the configured
MapperProvider.generate_map(). - JSON Repair: Automatically extracts and validates JSON from the LLM response.
- Persistence: Saves the map as
<resource_id>.jsoninstorage_dir. - Metadata Injection: Adds
source_hashandsource_sizeto the map.
Examples
Basic ingestion:
import asyncio
from citekit import CiteKitClient
async def main():
client = CiteKitClient(api_key="YOUR_GEMINI_API_KEY")
# Ingest a lecture video
resource_map = await client.ingest("lecture_01.mp4", "video")
print(f"Mapped '{resource_map.resource_id}' with {len(resource_map.nodes)} top-level nodes")
asyncio.run(main())Explicit type and custom ID:
async def main():
client = CiteKitClient(api_key="YOUR_GEMINI_API_KEY")
# Force modality and use custom ID
resource_map = await client.ingest(
resource_path="src/main.py",
resource_type="text",
resource_id="codebase_v2"
)
print(resource_map.resource_id) # "codebase_v2"
asyncio.run(main())Using a custom mapper:
from my_mapper import OllamaMapper
async def main():
client = CiteKitClient(mapper=OllamaMapper(model="llama3"))
# Ingest with local LLM (no API calls)
resource_map = await client.ingest("docs/README.md", "text")
print(f"Mapped locally in {len(resource_map.nodes)} sections")
asyncio.run(main())Raises
FileNotFoundError: Ifresource_pathdoes not exist.RuntimeError: If no mapper is configured.ValueError: Ifresource_typeis not recognized.
resolve(resource_id, node_id, virtual=False)
Resolves a node to extracted evidence. Extracts the physical segment from the resource (video clip, PDF pages, image crop, etc.) or returns a metadata-only reference.
def resolve(
resource_id: str,
node_id: str,
virtual: bool = False
) -> ResolvedEvidenceParameters
| Parameter | Type | Description |
|---|---|---|
resource_id | str | The resource ID (from ingest() or list_maps()). |
node_id | str | The node ID to resolve (e.g., "chapter_1.scene_2"). Use get_map(resource_id).list_node_ids() or citekit list <resource_id> to discover available nodes. |
virtual | bool | If True, returns only metadata without extracting physical files (no FFmpeg/PDF library calls). Defaults to False. |
Returns
ResolvedEvidence: An object containing:output_path(str or None): Path to the extracted file (or None ifvirtual=True)address(str): CiteKit URI address (e.g.,video://lecture_01#t=10-20)modality(str): The node's modality (e.g., "video", "document")node(Node): The resolved node objectresource_id(str): The resource ID
Resolution Workflow
- Map Lookup: Loads the resource map from
storage_dir. - Node Search: Finds the node by ID in the hierarchical structure.
- Address Building: Generates a CiteKit URI based on the node's location.
- Virtual Check: If
virtual=True, returns address without extraction. - Modality Dispatch: Selects the appropriate resolver (VideoResolver, DocumentResolver, etc.).
- Physical Extraction: Resolver writes the extracted segment to
output_dir.
Examples
Virtual resolution (metadata only):
client = CiteKitClient(api_key="YOUR_GEMINI_API_KEY")
# Get address without extracting
evidence = client.resolve(
"lecture_01",
"chapter_1.intro",
virtual=True
)
print(evidence.address) # e.g. "video://lecture_01#t=145-285"
print(evidence.output_path) # NonePhysical resolution (extracts file):
client = CiteKitClient(api_key="YOUR_GEMINI_API_KEY")
# Extracts video segment to .citekit_output/
evidence = client.resolve("lecture_01", "chapter_1.intro")
print(evidence.output_path) # e.g. ".citekit_output/lecture_01_chapter_1_intro.mp4"
print(evidence.modality) # "video"Document page extraction:
evidence = client.resolve("textbook", "chapter_2.definition")
# Output: ".citekit_output/textbook_chapter_2_definition.pdf"
# Contains only pages 12-15 (as specified in the node's location)Raises
FileNotFoundError: If the resource map doesn't exist.ValueError: If the node ID is not found in the resource.RuntimeError: If no resolver is available for the node's modality.
get_map(resource_id)
Loads a previously ingested resource map from local storage.
def get_map(resource_id: str) -> ResourceMapParameters
| Parameter | Type | Description |
|---|---|---|
resource_id | str | The resource ID to retrieve. |
Returns
ResourceMap: The deserialized resource structure.
Example
client = CiteKitClient(api_key="YOUR_GEMINI_API_KEY")
# Load an existing map
resource_map = client.get_map("lecture_01")
print(f"Resource: {resource_map.title}")
print(f"Nodes: {len(resource_map.nodes)}")Raises
FileNotFoundError: If no map exists for the givenresource_id.
list_maps()
Returns all resource IDs (ingested maps) currently stored locally.
def list_maps() -> list[str]Returns
list[str]: Array of resource IDs.
Example
client = CiteKitClient(api_key="YOUR_GEMINI_API_KEY")
maps = client.list_maps()
print(f"Available resources: {maps}")
# Output: ['lecture_01', 'textbook', 'codebase_v2']get_structure(resource_id)
Retrieves a resource map as a plain dictionary (JSON-serializable). Commonly used by MCP servers and integrations.
def get_structure(resource_id: str) -> dictParameters
| Parameter | Type | Description |
|---|---|---|
resource_id | str | The resource ID to retrieve. |
Returns
dict: The resource map as a plain dictionary (Pydantic model in JSON mode).
Example
client = CiteKitClient(api_key="YOUR_GEMINI_API_KEY")
structure = client.get_structure("lecture_01")
# Can be serialized directly to JSON
import json
json_str = json.dumps(structure)Raises
FileNotFoundError: If no map exists for the givenresource_id.
save_map(resource_map)
Manually persists a ResourceMap to local storage. Useful for programmatically constructed or modified maps.
def save_map(self, resource_map: ResourceMap) -> NoneParameters
| Parameter | Type | Description |
|---|---|---|
resource_map | ResourceMap | The resource map to persist. |
Example
from citekit.models import ResourceMap, Node, Location
client = CiteKitClient(api_key="YOUR_GEMINI_API_KEY")
# Create a custom map
custom_map = ResourceMap(
resource_id="custom_resource",
type="text",
title="My Custom Map",
source_path="/path/to/file.txt",
nodes=[
Node(
id="intro",
title="Introduction",
type="section",
location=Location(modality="text", lines=(1, 10))
),
Node(
id="body",
title="Main Content",
type="section",
location=Location(modality="text", lines=(11, 50))
)
]
)
client.save_map(custom_map)
print(f"Saved '{custom_map.resource_id}' to storage")Data Models
See Core Data Models for unified definitions across all implementations.
Quick Reference (Python):
from datetime import datetime
from typing import Literal
from pydantic import BaseModel, Field
ResourceType = Literal["document", "video", "audio", "image", "text", "virtual"]
class Location(BaseModel):
modality: ResourceType
pages: list[int] | None = None # Document (list of pages)
lines: tuple[int, int] | None = None # Text
start: float | None = None # Video/Audio start (seconds)
end: float | None = None # Video/Audio end (seconds)
bbox: tuple[float, float, float, float] | None = None # Image [x1, y1, x2, y2]
virtual_address: str | None = None # Virtual reference URI
class Node(BaseModel):
id: str
title: str | None = None
type: str
location: Location
summary: str | None = None
children: list["Node"] = Field(default_factory=list)
class ResourceMap(BaseModel):
resource_id: str
type: ResourceType
title: str
source_path: str
metadata: dict[str, str | int | float | None] | None = None
nodes: list[Node] = Field(default_factory=list)
created_at: datetime
class ResolvedEvidence(BaseModel):
output_path: str | None = None # None if virtual
modality: str
address: str # e.g., "video://lecture_01#t=145.5-285.0"
node: Node
resource_id: strAll field names use snake_case (e.g., resource_id, not resourceId) for consistency with JSON serialization.
Error Handling
Common Errors
Missing mapper or API key:
try:
client = CiteKitClient() # No mapper, no api_key
await client.ingest("file.mp4")
except RuntimeError as e:
print(f"Error: {e}") # "No mapper provider configured..."Resource not found:
try:
resource_map = client.get_map("nonexistent")
except FileNotFoundError as e:
print(f"Error: {e}") # "No map found for resource 'nonexistent'..."Node not found:
try:
evidence = client.resolve("lecture_01", "invalid.node.id")
except ValueError as e:
print(f"Error: {e}") # "Node 'invalid.node.id' not found..."Complete Example: Multi-Modal RAG Pipeline
import asyncio
from citekit import CiteKitClient
async def rag_pipeline():
# Initialize client
client = CiteKitClient(api_key="YOUR_GEMINI_API_KEY")
# 1. Ingest resources
print("Ingesting lecture...")
video_map = await client.ingest("lecture.mp4", "video", "lecture_01")
print("Ingesting textbook...")
doc_map = await client.ingest("textbook.pdf", "document", "textbook")
# 2. List all resources
all_resources = client.list_maps()
print(f"Mapped resources: {all_resources}")
# 3. Resolve specific nodes
for node_id in ["chapter_1.intro", "chapter_1.definition"]:
print(f"\nResolving {node_id}...")
# Virtual resolution (metadata only)
virtual_evidence = client.resolve("lecture_01", node_id, virtual=True)
print(f" Address: {virtual_evidence.address}")
# Physical extraction
physical_evidence = client.resolve("lecture_01", node_id, virtual=False)
print(f" Extracted to: {physical_evidence.output_path}")
asyncio.run(rag_pipeline())