šŸ“˜ DON Research API - Complete User Guide

Comprehensive Tutorial for Texas A&M University Cai Lab

šŸ“– Web Interface Overview

What This Guide Covers

This comprehensive guide teaches you how to use the DON Research API web interface and its features:

šŸŽÆ Quick Links:
• Homepage: https://api.donsystems.com/ - Quick reference docs
• This Guide: https://api.donsystems.com/guide - Detailed tutorials
• Swagger UI: https://api.donsystems.com/docs - Interactive testing

Who Should Use This Guide?

This guide is designed for Texas A&M researchers who:

šŸ” Swagger UI Tutorial: Test APIs in Your Browser

What is Swagger UI?

Swagger UI is an interactive tool that lets you test API endpoints directly in your web browser without writing any code. It's perfect for:

Accessing Swagger UI

  1. Open your web browser (Chrome, Firefox, Safari, or Edge)
  2. Navigate to: https://api.donsystems.com/docs
  3. You'll see a list of all available API endpoints organized by category

Step-by-Step: Testing Your First Endpoint

Step 1: Authenticate with Your Token

  1. Look for the green "Authorize" button at the top right of the page
  2. Click it to open the authentication dialog
  3. In the "Value" field, enter: Bearer your-tamu-token-here
    • āš ļø Important: Include the word "Bearer " (with a space) before your token
    • Example: Bearer tamu_cai_lab_2025_HkRs17sgvbjnQax2KzD1iqYcHWbAs5xvZZ2ApKptWuc
  4. Click "Authorize" button
  5. Click "Close" to return to the main page
āœ“ Success Indicator: The padlock icons next to endpoints change from open šŸ”“ to closed šŸ”’

Step 2: Select an Endpoint to Test

Endpoints are organized into sections. For your first test, try the health check:

  1. Scroll down to find GET /health
  2. Click on it to expand the endpoint details

Step 3: Execute the Request

  1. Click the "Try it out" button (top right of the endpoint section)
  2. Click the "Execute" button (blue button)
  3. Wait a moment for the response...

Step 4: View the Results

After execution, you'll see three important sections:

Request URL:

https://api.donsystems.com/health

Response Body: (should look like this)

{
  "status": "healthy",
  "timestamp": "2025-10-27T15:30:45.123Z",
  "services": "operational"
}

Response Headers:

X-Trace-ID: tamu_20251027_abc123def
content-type: application/json

Testing File Upload Endpoints

Example: Build Feature Vectors

  1. Find POST /api/v1/genomics/vectors/build
  2. Click to expand → Click "Try it out"
  3. file parameter: Click "Choose File" → Select your .h5ad file
  4. mode parameter: Select cluster from dropdown
  5. Click "Execute"
  6. View response with vector counts and preview data
āš ļø Best Practices:
• Use small datasets (< 5,000 cells) in Swagger UI
• For large files, use Python scripts instead
• Don't upload sensitive/unpublished data via browser
• Copy the curl command shown for use in scripts

Understanding Response Codes

Code Meaning Action
200 Success Request completed successfully
400 Bad Request Check parameter format (e.g., file extension)
401 Unauthorized Check token format (must include "Bearer ")
429 Rate Limit Exceeded Wait 1 hour or use fewer requests
500 Server Error Contact support with trace_id

šŸ“ Supported Data Formats

Format Overview

The DON Research API supports multiple input formats, not just H5AD files:

Format Example Use Case Supported Endpoints
H5AD Files pbmc3k.h5ad Direct upload of single-cell data (AnnData format) All genomics + Bio module
GEO Accessions GSE12345 Auto-download from NCBI GEO database /load
Direct URLs https://example.com/data.h5ad Download from external HTTP/HTTPS sources /load
Gene Lists (JSON) ["CD3E", "CD8A", "CD4"] Encode cell type marker genes as query vectors /query/encode
Text Queries "T cell markers in PBMC" Natural language biological queries /query/encode
šŸ“Œ Important:
• Bio Module endpoints require H5AD files only (for ResoTrace compatibility)
• Genomics endpoints support all formats above
• GEO accessions and URLs are automatically downloaded and converted to H5AD

Format-Specific Usage Examples

1. H5AD Files (Most Common)

import requests

# Upload H5AD file directly
with open("pbmc_3k.h5ad", "rb") as f:
    files = {"file": ("pbmc_3k.h5ad", f, "application/octet-stream")}
    response = requests.post(
        "https://api.donsystems.com/api/v1/genomics/vectors/build",
        headers={"Authorization": f"Bearer {TOKEN}"},
        files=files,
        data={"mode": "cluster"}
    )
print(response.json())

2. GEO Accessions

# Automatically downloads from NCBI GEO
data = {"accession_or_path": "GSE12345"}
response = requests.post(
    "https://api.donsystems.com/api/v1/genomics/load",
    headers={"Authorization": f"Bearer {TOKEN}"},
    data=data
)
h5ad_path = response.json()["h5ad_path"]
print(f"Downloaded to: {h5ad_path}")

3. Direct URLs

# Download from any HTTP/HTTPS source
data = {"accession_or_path": "https://example.com/dataset.h5ad"}
response = requests.post(
    "https://api.donsystems.com/api/v1/genomics/load",
    headers={"Authorization": f"Bearer {TOKEN}"},
    data=data
)
h5ad_path = response.json()["h5ad_path"]

4. Gene Lists (JSON)

import json

# T cell marker genes
gene_list = ["CD3E", "CD8A", "CD4", "IL7R", "CCR7"]
data = {"gene_list_json": json.dumps(gene_list)}

response = requests.post(
    "https://api.donsystems.com/api/v1/genomics/query/encode",
    headers={"Authorization": f"Bearer {TOKEN}"},
    data=data
)
query_vector = response.json()["psi"]  # 128-dimensional vector

5. Text Queries

# Natural language query
data = {
    "text": "T cell markers in PBMC tissue",
    "cell_type": "T cell",
    "tissue": "PBMC"
}

response = requests.post(
    "https://api.donsystems.com/api/v1/genomics/query/encode",
    headers={"Authorization": f"Bearer {TOKEN}"},
    data=data
)
query_vector = response.json()["psi"]

When to Use Each Format

🧬 Bio Module: ResoTrace Integration

Overview

The Bio Module provides advanced single-cell analysis workflows optimized for ResoTrace integration. Key capabilities:

Sync vs Async Execution Modes

Every Bio endpoint supports two execution modes:

Mode When to Use Response Time Best For
sync=true Small datasets (< 5K cells) Immediate (< 30s) Quick validation, exploratory analysis, Swagger UI testing
sync=false Large datasets (> 10K cells) Background job Production pipelines, batch processing, automated workflows

Feature 1: Export Artifacts

POST /api/v1/bio/export-artifacts

What it does:

Required parameters:

Example (Synchronous):

with open("pbmc_3k.h5ad", "rb") as f:
    files = {"file": ("pbmc.h5ad", f, "application/octet-stream")}
    data = {
        "cluster_key": "leiden",
        "latent_key": "X_umap",
        "sync": "true",
        "project_id": "cai_lab_pbmc_study",
        "user_id": "researcher_001"
    }
    
    response = requests.post(
        "https://api.donsystems.com/api/v1/bio/export-artifacts",
        headers={"Authorization": f"Bearer {TOKEN}"},
        files=files,
        data=data
    )

result = response.json()
print(f"āœ“ Exported {result['nodes']} clusters")
print(f"āœ“ {result['vectors']} cell vectors")
print(f"āœ“ Trace ID: {result.get('trace_id')}")

Feature 2: Parasite Detector (QC)

POST /api/v1/bio/qc/parasite-detect

What it does:

Recommended Actions:

Parasite Score Quality Action
0-5% Excellent Proceed without filtering
5-15% Good Minor filtering recommended
15-30% Moderate Filter flagged cells
> 30% Poor Review QC pipeline

For complete Bio module documentation, see the homepage or Swagger UI.

šŸ”¬ Complete Workflow Examples

Workflow 1: Cell Type Discovery with T Cells

Goal: Identify T cell clusters in PBMC dataset using marker genes

import requests
import json

API_URL = "https://api.donsystems.com"
TOKEN = "your-tamu-token-here"
headers = {"Authorization": f"Bearer {TOKEN}"}

# Step 1: Build cluster vectors
print("Step 1: Building vectors...")
with open("pbmc_3k.h5ad", "rb") as f:
    files = {"file": ("pbmc_3k.h5ad", f, "application/octet-stream")}
    response = requests.post(
        f"{API_URL}/api/v1/genomics/vectors/build",
        headers=headers,
        files=files,
        data={"mode": "cluster"}
    )

vectors_result = response.json()
jsonl_path = vectors_result["jsonl"]
print(f"āœ“ Built {vectors_result['count']} cluster vectors")

# Step 2: Encode T cell query
print("\nStep 2: Encoding T cell markers...")
t_cell_genes = ["CD3E", "CD8A", "CD4", "IL7R"]
query_data = {"gene_list_json": json.dumps(t_cell_genes)}

response = requests.post(
    f"{API_URL}/api/v1/genomics/query/encode",
    headers=headers,
    data=query_data
)
query_vector = response.json()["psi"]
print("āœ“ Encoded query vector (128 dimensions)")

# Step 3: Search for matching clusters
print("\nStep 3: Searching for T cell-like clusters...")
search_data = {
    "jsonl_path": jsonl_path,
    "psi": json.dumps(query_vector),
    "k": 5
}
response = requests.post(
    f"{API_URL}/api/v1/genomics/vectors/search",
    headers=headers,
    data=search_data
)

results = response.json()["hits"]
print(f"\nāœ“ Top 5 T cell-like clusters:")
for i, hit in enumerate(results, 1):
    cluster_id = hit['meta']['cluster']
    distance = hit['distance']
    cells = hit['meta']['cells']
    print(f"{i}. Cluster {cluster_id}: distance={distance:.4f}, cells={cells}")

Workflow 2: QC Pipeline with Parasite Detection

Goal: Clean dataset by detecting and removing low-quality cells

import requests
import scanpy as sc
import pandas as pd

# Step 1: Detect parasites
print("Step 1: Detecting QC parasites...")
with open("pbmc_raw.h5ad", "rb") as f:
    files = {"file": ("pbmc_raw.h5ad", f, "application/octet-stream")}
# Scientific computing libraries
import scanpy as sc
import pandas as pd
        "cluster_key": "leiden",
        "batch_key": "sample",
        "sync": "true"
    }
    
    response = requests.post(
        f"{API_URL}/api/v1/bio/qc/parasite-detect",
        headers=headers,
        files=files,
        data=data
    )

result = response.json()
flags = result["flags"]
parasite_score = result["parasite_score"]
print(f"āœ“ Parasite score: {parasite_score:.1f}%")

# Step 2: Filter flagged cells
print("\nStep 2: Filtering flagged cells...")
adata = sc.read_h5ad("pbmc_raw.h5ad")
adata = adata[~np.array(flags), :]
adata.write_h5ad("pbmc_cleaned.h5ad")
print(f"āœ“ Saved {adata.n_obs} clean cells")

šŸ”§ Troubleshooting Common Errors

Error 401: Authentication Failed

Error: {"detail": "Invalid or missing token"}

Solutions:
  • āœ… Verify token format includes "Bearer " prefix
  • āœ… Check for extra whitespace in token
  • āœ… Confirm token hasn't expired (valid 1 year)

Error 400: File Upload Failed

Error: {"detail": "Expected .h5ad file"}

Solutions:
  • āœ… Verify file has .h5ad extension
  • āœ… Validate AnnData format: sc.read_h5ad("file.h5ad")
  • āœ… Check file size < 500MB

Error 429: Rate Limit Exceeded

Error: {"detail": "Rate limit exceeded"}

Solutions:
  • āœ… Wait 1 hour for rate limit reset (1,000 req/hour)
  • āœ… Implement exponential backoff in scripts
  • āœ… Use cluster mode instead of cell mode
  • āœ… Contact support for higher limits if needed

Contact Support

When reporting issues, include:

  1. Institution: Texas A&M University (Cai Lab)
  2. API endpoint and method (e.g., POST /vectors/build)
  3. Full error message (JSON response)
  4. Trace ID from response header (X-Trace-ID)
  5. Dataset description (cells, genes, file size)

Email: support@donsystems.com | Response time: < 24 hours

šŸ’¬ Chat API - Natural Language Queries

Overview

The Chat API allows you to ask questions about your genomics data using natural language. Query gene expression, identify cell types, and explore biological insights - the system automatically retrieves relevant context from your stored data and generates informed responses using state-of-the-art language models.

✨ Key Features:
  • 🧬 Natural Language: Ask questions in plain English
  • šŸ” Context-Aware: Automatically retrieves relevant data
  • šŸ” Your API Keys: Use your own Claude or OpenAI key (never stored)
  • šŸ’¬ Threading: Maintain context across multiple queries
  • šŸ“Š Transparent: See which data informed each answer

Quick Start Example

curl -X POST https://don-research-api.onrender.com/api/v1/genomics/chat \
  -H "Authorization: Bearer YOUR_INSTITUTION_TOKEN" \
  -H "X-LLM-API-Key: YOUR_CLAUDE_OR_OPENAI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the top marker genes in cluster 0?",
    "llm_provider": "claude"
  }'

Swagger UI Testing

  1. Go to Swagger UI
  2. Navigate to POST /api/v1/genomics/chat
  3. Click "Try it out"
  4. Add headers:
    • Authorization: Bearer your_token
    • X-LLM-API-Key: sk-ant-... (Claude) or sk-... (OpenAI)
  5. Enter your query in the request body
  6. Click "Execute" to get your answer!

Example Queries

Question Type Example Query
Gene Expression "Which genes are highly expressed in cluster 2?"
Cell Type ID "What cell types are present based on marker expression?"
Comparative "How does cluster 0 compare to cluster 3?"
Biological Insight "What pathways are enriched in upregulated genes?"

API Parameters

Parameter Type Required Description
query string āœ… Yes Your question (10-2000 characters)
llm_provider string āœ… Yes "claude" or "gpt4"
conversation_id string No Thread multiple queries together
temperature float No Response creativity (0.0-1.0, default: 0.7)
max_context_items int No Max data items to retrieve (1-20, default: 5)
šŸ” Security Note: Your LLM API key is NEVER logged or stored. It's used only for your request and discarded immediately. Get API keys from:

Response Format

{
  "answer": "Based on the PBMC data, cluster 0 shows high expression of...",
  "context_used": [
    {
      "text": "Cluster 0: CD3D, CD3E, IL7R (T cells)",
      "relevance": 0.89,
      "source": "pbmc3k_analysis"
    }
  ],
  "conversation_id": "auto-generated-uuid",
  "tokens_used": 245,
  "provider": "claude",
  "model": "claude-3-5-sonnet-20241022"
}

Complete Documentation

For detailed usage, Python SDK, best practices, and troubleshooting, see the Complete Chat API Guide.

šŸ’” Pro Tips:
  • Use conversation_id to maintain context across multiple questions
  • Increase max_context_items for complex queries that need more data
  • Lower temperature (0.1-0.3) for factual, deterministic responses
  • Higher temperature (0.8-0.9) for creative hypothesis generation

šŸš€ Try It Now - Interactive Chat

🧠 DMP Memory System - AI-Assisted Research Collaboration

⚔ Live Compression Verification Dashboard

Real-time proof of 768D → 1D compression at 100% fidelity. Watch your data compress as you use the system.

Compression Ratio
768:1
768D → 1D
Fidelity
100.0%
Perfect reconstruction
Storage per Memory
310B
Regardless of complexity
Query Time
--
Last query (ms)
Waiting for DMP operations...
šŸ’” How it works: Every store/retrieve operation below updates this dashboard in real-time. Watch as your 768-dimensional semantic vectors compress to 1D with perfect fidelity. This violates Shannon entropy - but the proof is right here, live, on your data.

What is DMP Memory?

The DON Memory Protocol (DMP) uses substrate-driven compression to store ALL your research data - structured results, unstructured observations, failed experiments, and exploratory "exhaust" - and makes it coherent and queryable using natural language.

šŸš€ Revolutionary Capability for Cellular Research:
For the first time, you can store everything - not just polished results:

āœ… Structured Data: QC metrics, cluster assignments, gene expression values
āœ… Unstructured Notes: "Cluster 3 looks like cytotoxic T cells based on morphology"
āœ… Failed Experiments: "UMAP with cosine distance didn't separate well"
āœ… Parameter Exploration: "Tried resolutions 0.6, 0.8, 1.0 - 0.8 was cleanest"
āœ… Weird Observations: "Batch 2 has higher doublet rate, might be pipetting issue"
āœ… Literature Connections: "This matches the rare CD14+CD16+ intermediate monocytes from Smith et al. 2023"

DMP compresses it ALL with 96Ɨ compression and 99.5% fidelity, then makes it coherent.
The chaos becomes substrate. The substrate becomes queryable knowledge.

The more you store (even messy exploratory data), the smarter the system gets.

Why This Has Never Existed Before

Traditional tools force you to choose:

DMP is different: It uses substrate physics to compress chaos into coherent, queryable memory. Dump in your raw thoughts, failed experiments, parameter sweeps, unexpected findings - DMP figures out the relationships and surfaces what matters when you ask.

What This Means for Your Research

Real Example: The Power of Storing Everything

# Month 1: Store everything as you explore
note("PBMC dataset: 10,247 cells initially", importance=5)
note("QC: removed 1,200 cells with >5% mito", importance=6)
note("Weird: Batch 2 has 2Ɨ doublet rate vs Batch 1", importance=7)
note("Tried PCA with 20, 30, 50 components - 30 gave cleanest UMAP", importance=6)
note("Resolution 0.8 gave 12 clusters, 1.0 gave 15 (too fragmented)", importance=7)
note("Cluster 3: CD3E, CD8A high - definitely cytotoxic T cells", importance=9)
note("Cluster 7: unexpected CD14+CD16+ - matches Smith 2023 intermediate monocytes", importance=9)
note("FAILED: harmony batch correction made things worse, kept default", importance=8)

# Month 3: Query across all that chaos
ask("What issues did we have with batch effects?")
# → DMP surfaces the doublet rate observation AND the harmony failure

ask("What parameter choices worked best for clustering?")
# → DMP finds resolution 0.8, PCA 30, and explains why

ask("Have we seen unusual monocyte populations before?")
# → DMP connects Cluster 7 to Smith 2023 reference you stored

ask("Why are our cell counts different from the paper?")
# → DMP retrieves the QC filtering decision with exact numbers

This is profound: Your research process - the messiness, the failures, the "I wonder if..." moments - becomes an asset instead of lost noise. DMP turns exploration into institutional memory.

šŸŽÆ Common Use Cases:
• Store QC metrics and decisions for each dataset
• Track parameter choices and optimization results
• Document unexpected findings or outliers
• Build a searchable lab knowledge base
• Enable LLM assistants to "remember" your previous work

Interactive Memory Tools

Use the forms below to store and retrieve memories directly from your browser. Your API token is used automatically.

šŸ“ Store a Memory

Save analysis results, observations, or research notes to persistent memory.

Example: pbmc3k_analysis_2025, t_cell_markers_study
Be specific - include numbers, gene names, conclusions
1 (Low) 5 10 (Critical)
Higher importance = retrieved first in queries

šŸ” Query Memories

Search your stored memories using natural language queries.

Leave blank to search all your memories

šŸ“Š Conversation Statistics

View memory counts and usage for a specific conversation.

Example Workflow: Store & Query

šŸ’” The "Brain-Dump Everything" Workflow

Stop filtering your thoughts. Store EVERYTHING.

During Analysis (store as you go):
# Raw observations
"Looking at the UMAP - cluster 3 is really tight, cluster 7 is diffuse"

# Parameter decisions
"Tried resolutions 0.6, 0.8, 1.0 - settled on 0.8 because 1.0 split cluster 3"

# Failures that matter
"Harmony batch correction actually made batch separation worse - skipping it"

# Unexpected findings
"Batch 2 has 12% doublet rate vs 5% in batch 1 - check pipetting protocol"

# Cross-references
"Cluster 7 CD14+CD16+ matches the intermediate monocytes from Smith et al. 2023"

# Future questions
"Need to check if cluster 8 appears in other PBMC datasets - looks weird"

# Technical notes
"Used scanpy.pp.highly_variable_genes with n_top_genes=2000 - worked great"
Weeks Later (natural language queries):
"What clustering parameters have worked well?"
"Show me all batch effect issues we've encountered"
"Have we seen CD14+CD16+ populations before?"
"What QC thresholds did we use for mitochondrial genes?"
"Why did we skip batch correction in the PBMC analysis?"

DMP finds the answers instantly because you stored the chaos as you went. The substrate learns your research patterns, technical choices, and domain knowledge - making future you (and your lab) exponentially smarter.

Standard Workflow Example

# Step 1: Store multiple memories during analysis
Stored memory 1: "PBMC3k dataset: 2700 cells, 13714 genes. QC: 95% viable, median 2000 genes/cell"
  → conversation_id: pbmc3k_project
  → trace_id: abc123...

Stored memory 2: "Leiden clustering: 8 clusters identified. Cluster 3 = CD8+ T cells (CD3E, CD8A high)"
  → conversation_id: pbmc3k_project
  → trace_id: def456...

Stored memory 3: "UMAP embedding reveals clear separation of T, B, NK, and myeloid populations"
  → conversation_id: pbmc3k_project
  → trace_id: ghi789...

# Step 2: Query memories weeks later
Query: "What were the T cell findings in PBMC project?"
Results:
  1. "Leiden clustering: 8 clusters identified. Cluster 3 = CD8+ T cells (CD3E, CD8A high)" (relevance: 0.95)
  2. "UMAP embedding reveals clear separation of T, B, NK, and myeloid populations" (relevance: 0.87)

# Step 3: Get conversation stats
Conversation: pbmc3k_project
  → Total memories: 15
  → Created: 2025-01-15T10:30:00Z
  → Last updated: 2025-01-20T14:22:00Z

API Endpoints Reference

Endpoint Method Purpose Response
/api/v1/dmp/store POST Store a new memory trace_id, conversation_id, timestamp
/api/v1/dmp/retrieve POST Query memories by text Array of memories with relevance scores
/api/v1/dmp/stats/{conversation_id} GET Get conversation statistics Total memories, timestamps
/api/v1/dmp/health GET Check DMP service status Service health and version
āš ļø Important Notes:
• All DMP endpoints require your TAMU API token in the Authorization header
• Memories are tenant-isolated - you only see your own data
• conversation_id is optional but recommended for organization
• Higher importance values (8-10) are retrieved first in queries
• The forms above handle authentication automatically in your browser

Python Code Examples

For programmatic access, here's how to use DMP in your Python scripts:

import requests

API_URL = "https://api.donsystems.com"
TOKEN = "your_tamu_token_here"
headers = {"Authorization": f"Bearer {TOKEN}"}

# Store a memory
store_response = requests.post(
    f"{API_URL}/api/v1/dmp/store",
    headers=headers,
    json={
        "conversation_id": "pbmc3k_project",
        "memory_text": "Cluster 3 shows high CD3E, CD8A expression (cytotoxic T cells). QC: 95% viable.",
        "importance": 8
    }
)
print(f"Stored: {store_response.json()['trace_id']}")

# Query memories
retrieve_response = requests.post(
    f"{API_URL}/api/v1/dmp/retrieve",
    headers=headers,
    json={
        "conversation_id": "pbmc3k_project",  # optional
        "query": "What did we find about T cells?",
        "limit": 10
    }
)

memories = retrieve_response.json()["memories"]
for memory in memories:
    print(f"[{memory['relevance_score']:.2f}] {memory['text']}")

# Get conversation stats
stats_response = requests.get(
    f"{API_URL}/api/v1/dmp/stats/pbmc3k_project",
    headers=headers
)
stats = stats_response.json()
print(f"Conversation has {stats['total_memories']} memories")

LLM Integration Pattern

To enable AI assistants (like ChatGPT, Claude, or local models) to "remember" your research:

# 1. After each analysis step, store key findings
after_qc_analysis():
    store_memory(
        conversation_id="current_project",
        memory_text=f"QC complete: {viable_cells} viable, {median_genes} median genes/cell",
        importance=7
    )

after_clustering():
    store_memory(
        conversation_id="current_project",
        memory_text=f"Found {n_clusters} clusters. Cluster {i} = {cell_type} (markers: {genes})",
        importance=9
    )

# 2. Before asking LLM questions, retrieve relevant context
def ask_llm_about_project(user_question):
    # Get relevant memories
    memories = retrieve_memories(
        conversation_id="current_project",
        query=user_question,
        limit=5
    )
    
    # Build context for LLM
    context = "\n".join([f"- {m['text']}" for m in memories])
    
    # Construct LLM prompt
    prompt = f"""
    You are analyzing single-cell RNA-seq data. Here's what we know so far:
    
    {context}
    
    User question: {user_question}
    
    Provide a detailed answer based on the context above.
    """
    
    # Send to LLM
    return call_llm(prompt)

# Example usage
answer = ask_llm_about_project("What cell types did we identify and what are their markers?")
print(answer)

Best Practices

šŸ¤– Connecting Claude AI for Natural Language Queries

Want to query your compressed genomics data using natural language? Connect Claude (or other LLMs) to DMP memory for AI-assisted analysis:

šŸ’” The Power of LLM + DMP:
Instead of writing SQL queries or searching files, just ask:
• "What T cell markers did we find in the PBMC study?"
• "Which clusters had high mitochondrial content?"
• "Summarize the QC metrics across all my datasets"

Claude retrieves relevant memories from DMP and answers with your actual research data!

Quick Start: Python Script

Here's a complete script to connect Claude to your DMP memories:

#!/usr/bin/env python3
# Connect Claude AI to your DMP research memories.
# Install: pip install anthropic requests

import requests
from anthropic import Anthropic

# Configuration
API_URL = "https://api.donsystems.com"
TAMU_TOKEN = "your_tamu_token_here"
CLAUDE_KEY = "your_claude_api_key_here"

headers = {"Authorization": f"Bearer {TAMU_TOKEN}"}
claude = Anthropic(api_key=CLAUDE_KEY)

def ask_claude_about_research(question: str, project_id: str = None):
    # Ask Claude about your genomics research.
    # Claude automatically retrieves relevant context from DMP.
    
    # Step 1: Retrieve relevant memories from DMP
    retrieve_payload = {"query": question, "limit": 10}
    if project_id:
        retrieve_payload["conversation_id"] = project_id
    
    dmp_response = requests.post(
        f"{API_URL}/api/v1/dmp/retrieve",
        headers=headers,
        json=retrieve_payload
    )
    
    memories = dmp_response.json()["memories"]
    print(f"šŸ“š Retrieved {len(memories)} relevant memories")
    
    # Step 2: Build context from your research findings
    if not memories:
        return "No relevant findings found. Store some memories first!"
    
    context = "\n\n".join([
        f"[Relevance: {m['relevance_score']:.2f}] {m['text']}"
        for m in memories
    ])
    
    # Step 3: Ask Claude with your research context
    prompt = f"""You are a bioinformatics expert analyzing single-cell RNA-seq data.

Here are relevant findings from the researcher's previous analyses:

{context}

Researcher's Question: {question}

Provide a detailed answer using specific details (gene names, metrics, cluster numbers) from the findings above."""
    
    response = claude.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text

# Example queries
if __name__ == "__main__":
    print("\n" + "="*60)
    print("CLAUDE AI + DMP MEMORY DEMO")
    print("="*60 + "\n")
    
    # Query 1: Cell populations
    print("Q: What cell populations did we identify?")
    answer = ask_claude_about_research(
        "What cell populations did we identify in our analyses?",
        project_id="pbmc_study_2025"  # Optional: filter by project
    )
    print(f"\nšŸ¤– Claude: {answer}\n")
    
    # Query 2: QC metrics
    print("Q: What were the QC metrics?")
    answer = ask_claude_about_research(
        "What were the QC metrics for our datasets?",
        project_id="pbmc_study_2025"
    )
    print(f"\nšŸ¤– Claude: {answer}\n")
    
    # Query 3: Marker genes
    print("Q: What marker genes did we find for T cells?")
    answer = ask_claude_about_research(
        "What marker genes did we identify for T cell populations?",
        project_id="pbmc_study_2025"
    )
    print(f"\nšŸ¤– Claude: {answer}\n")

Example Conversation

Here's what a real conversation looks like:

šŸ‘¤ You: "What T cell populations did we identify and what are their marker genes?"


šŸ¤– Claude: "Based on your analysis, you identified cytotoxic T cells in Cluster 3. The key markers are:

  • CD3E (log2FC = 4.2) - T cell receptor component
  • CD8A (log2FC = 3.8) - Cytotoxic T cell marker
  • CD8B - Additional cytotoxic marker
  • GZMK (log2FC = 3.5) - Granzyme K (cytotoxic effector)
  • CCL5 (log2FC = 3.1) - Chemokine for T cell trafficking

These markers strongly indicate a cytotoxic CD8+ T cell population. The high expression of GZMK suggests these are effector memory T cells rather than naive T cells."

šŸ‘¤ You: "Were there any unusual findings in the clustering?"


šŸ¤– Claude: "Yes, you noted an unusual finding in Cluster 7, which showed dual expression of CD14 and CD16. This pattern is characteristic of intermediate monocytes, which are less common than classical (CD14+CD16-) or non-classical (CD14-CD16+) monocytes. This population represents about 5-10% of monocytes in healthy PBMC samples and may indicate an activated or transitional state."

Jupyter Notebook Integration

For interactive analysis, create a helper module:

# notebook_helper.py
import requests
from anthropic import Anthropic

API_URL = "https://api.donsystems.com"
TAMU_TOKEN = "your_token"
CLAUDE_KEY = "your_claude_key"

headers = {"Authorization": f"Bearer {TAMU_TOKEN}"}
claude = Anthropic(api_key=CLAUDE_KEY)

PROJECT_ID = "my_pbmc_study"

def note(finding: str, importance: int = 7):
    # Quick function to store findings
    response = requests.post(
        f"{API_URL}/api/v1/dmp/store",
        headers=headers,
        json={
            "conversation_id": PROJECT_ID,
            "memory_text": finding,
            "importance": importance
        }
    )
    result = response.json()
    print(f"āœ“ Stored: {result['trace_id']}")

def ask(question: str):
    # Quick function to query with Claude
    # Retrieve memories
    dmp_response = requests.post(
        f"{API_URL}/api/v1/dmp/retrieve",
        headers=headers,
        json={
            "query": question,
            "conversation_id": PROJECT_ID,
            "limit": 10
        }
    )
    
    memories = dmp_response.json()["memories"]
    context = "\n\n".join([
        f"[{m['relevance_score']:.2f}] {m['text']}"
        for m in memories
    ])
    
    # Ask Claude
    prompt = f"""Context from previous analyses:
{context}

Question: {question}

Answer with specific details from the context."""
    
    response = claude.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    
    answer = response.content[0].text
    print(f"\nšŸ¤– {answer}\n")
    return answer

In your Jupyter notebook:

import scanpy as sc
from notebook_helper import note, ask, PROJECT_ID

# Load and analyze data
adata = sc.read_h5ad("pbmc_10k.h5ad")

# Store findings as you go
note("Loaded 10,247 cells, 20,000 genes")

# QC
sc.pp.filter_cells(adata, min_genes=200)
note(f"After QC: {adata.n_obs} cells retained")

# Clustering
sc.pp.neighbors(adata)
sc.tl.leiden(adata)
note(f"Leiden clustering: {len(adata.obs['leiden'].unique())} clusters")

# Find markers
sc.tl.rank_genes_groups(adata, 'leiden')
markers = sc.get.rank_genes_groups_df(adata, group='3')
note(f"Cluster 3 markers: {markers.head(5)['names'].tolist()}")

# Later: Ask Claude about your analysis
ask("What were the main findings from the PBMC clustering?")
ask("Which cluster represents T cells based on markers?")
ask("What QC steps did we perform?")

Other LLM Options

DMP works with any LLM. Here are examples for other models:

LLM Installation Example Code
OpenAI GPT-4 pip install openai
from openai import OpenAI
client = OpenAI(api_key="...")
response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {"role": "system", "content": "You are a bioinformatics expert."},
        {"role": "user", "content": f"{context}\n\n{question}"}
    ]
)
Local Ollama ollama pull llama2
import requests
response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama2",
        "prompt": f"{context}\n\n{question}",
        "stream": False
    }
)
Google Gemini pip install google-generativeai
import google.generativeai as genai
genai.configure(api_key="...")
model = genai.GenerativeModel('gemini-pro')
response = model.generate_content(
    f"{context}\n\n{question}"
)

Complete Workflow Example

Here's how a typical research workflow looks with LLM integration:

# Day 1: Initial analysis
# Compress dataset with DON Stack
compress_result = compress_dataset("pbmc_10k.h5ad")  # 96Ɨ compression

# Store findings as you analyze
note("Dataset: 10,247 cells, 20,000 genes", importance=6)
note("QC: 95% viable, median 2,150 genes/cell", importance=7)
note("Leiden: 12 clusters identified", importance=8)
note("Cluster 3: CD3E, CD8A, CD8B high (cytotoxic T cells)", importance=9)
note("Cluster 7: Unusual CD14+CD16+ (intermediate monocytes)", importance=9)

# Day 30: Come back weeks later, ask Claude
ask("What did we find in the PBMC analysis?")
# → Claude retrieves all your stored findings and summarizes them!

ask("Were there any unusual cell populations?")
# → Claude finds the intermediate monocytes finding

ask("What were the T cell markers?")
# → Claude retrieves Cluster 3 markers with exact gene names

# Day 60: Compare across multiple projects
ask("How do the T cell populations compare across all my PBMC studies?")
# → Claude searches ALL your conversation_ids and synthesizes insights!
šŸ“š Complete Documentation:
For more examples, troubleshooting, and advanced patterns, see:
docs/DMP_LLM_INTEGRATION.md