DON Research API - Texas A&M Lab User Guide

🚀 Quick Start

Prerequisites

Python 3.11+ installed
API Token (provided via secure email)
Data Formats: Multiple input formats supported (see below)

Supported Data Formats

Format	Example	Use Case	Endpoints
H5AD Files	`pbmc3k.h5ad`	Direct upload of single-cell data	All genomics + Bio module
GEO Accessions	`GSE12345`	Auto-download from NCBI GEO	`/load`
Direct URLs	`https://example.com/data.h5ad`	Download from external sources	`/load`
Gene Lists (JSON)	`["CD3E", "CD8A", "CD4"]`	Encode cell type markers as queries	`/query/encode`
Text Queries	`"T cell markers"`	Natural language searches	`/query/encode`

📌 Format Notes:
• Genomics endpoints accept all formats above
• Bio module requires H5AD files only (for ResoTrace integration)
• GEO accessions and URLs are automatically converted to H5AD format

Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
pip install requests scanpy anndata pandas numpy

First API Call

import requests

API_URL = "https://don-research-api.onrender.com"
TOKEN = "your-texas-am-token-here"  # Replace with your token

headers = {"Authorization": f"Bearer {TOKEN}"}

# Health check
response = requests.get(f"{API_URL}/health", headers=headers)
print(response.json())
# Expected: {"status": "ok", "timestamp": "2025-10-24T..."}

✓ System Ready! If health check returns {"status": "ok"}, you're connected and authenticated.

Format-Specific Examples

Example 1: H5AD File Upload

with open("pbmc3k.h5ad", "rb") as f:
    files = {"file": ("pbmc3k.h5ad", f, "application/octet-stream")}
    response = requests.post(
        f"{API_URL}/api/v1/genomics/vectors/build",
        headers=headers,
        files=files,
        data={"mode": "cluster"}
    )
print(response.json())

Example 2: GEO Accession

# Automatically downloads from NCBI GEO
data = {"accession_or_path": "GSE12345"}
response = requests.post(
    f"{API_URL}/api/v1/genomics/load",
    headers=headers,
    data=data
)
h5ad_path = response.json()["h5ad_path"]

Example 3: Gene List Query

import json

# T cell markers
gene_list = ["CD3E", "CD8A", "CD4", "IL7R"]
data = {"gene_list_json": json.dumps(gene_list)}
response = requests.post(
    f"{API_URL}/api/v1/genomics/query/encode",
    headers=headers,
    data=data
)
query_vector = response.json()["psi"]  # 128-dimensional vector

Example 4: Text Query

# Natural language query
data = {
    "text": "T cell markers in PBMC tissue",
    "cell_type": "T cell",
    "tissue": "PBMC"
}
response = requests.post(
    f"{API_URL}/api/v1/genomics/query/encode",
    headers=headers,
    data=data
)
query_vector = response.json()["psi"]

📊 System Overview

What is the DON Research API?

The DON (Distributed Order Network) Research API provides access to proprietary quantum-enhanced algorithms for genomics data compression and feature extraction. The system combines classical preprocessing with quantum-inspired compression to generate high-quality 128-dimensional feature vectors from single-cell RNA-seq data.

Core Technologies

DON-GPU: Fractal clustering processor with 8×-128× compression ratios
QAC (Quantum Adjacency Code): Multi-layer quantum error correction
TACE: Temporal Adjacency Collapse Engine for quantum-classical feedback

Validated Performance (PBMC3k Dataset)

Input Data

2,700

cells × 13,714 genes

Compression Ratio

28,928×

37M → 1.3K values

Processing Time

<30s

on standard hardware

Information Retention

85-90%

biological signal preserved

🔐 Authentication

API Token

Rate Limit: 1,000 requests per hour
Token Format: Bearer token (JWT-style)
Security: Never commit tokens to Git or share publicly

Using Your Token

HTTP Headers (curl):

curl -H "Authorization: Bearer YOUR_TOKEN" \
     https://don-research-api.onrender.com/health

Python requests:

headers = {"Authorization": f"Bearer {YOUR_TOKEN}"}
response = requests.get(url, headers=headers)

Environment Variable (recommended):

export DON_API_TOKEN="your-token-here"

import os
TOKEN = os.environ.get("DON_API_TOKEN")

🔌 API Endpoints

Base URLs

Production: https://don-research-api.onrender.com
Interactive Docs: /docs (Swagger UI)

1. Health Check

GET /health

Verify API availability and authentication.

2. Build Feature Vectors

POST /api/v1/genomics/vectors/build

Generate 128-dimensional feature vectors from single-cell h5ad files.

Parameters:

file (required): .h5ad file upload (AnnData format)
mode (optional): "cluster" (default) or "cell"
- cluster: One vector per cell type cluster (recommended)
- cell: One vector per individual cell (for detailed analysis)

Example Request:

import requests

with open("data/pbmc3k.h5ad", "rb") as f:
    files = {"file": ("pbmc3k.h5ad", f, "application/octet-stream")}
    data = {"mode": "cluster"}
    response = requests.post(
        "https://don-research-api.onrender.com/api/v1/genomics/vectors/build",
        headers={"Authorization": f"Bearer {TOKEN}"},
        files=files,
        data=data
    )

result = response.json()
print(f"Built {result['count']} vectors")
print(f"Saved to: {result['jsonl']}")

Response:

{
  "ok": true,
  "mode": "cluster",
  "jsonl": "/path/to/pbmc3k.cluster.jsonl",
  "count": 10,
  "preview": [
    {
      "vector_id": "pbmc3k.h5ad:cluster:0",
      "psi": [0.929, 0.040, ...],  // 128 dimensions
      "space": "X_pca",
      "metric": "cosine",
      "type": "cluster",
      "meta": {
        "file": "pbmc3k.h5ad",
        "cluster": "0",
        "cells": 560,
        "cell_type": "NA",
        "tissue": "NA"
      }
    }
  ]
}

3. Encode Query Vector

POST /api/v1/genomics/query/encode

Convert biological queries (gene lists, cell types, tissues) into 128-dimensional vectors for searching.

Parameters:

gene_list_json: JSON array of gene symbols, e.g., ["CD3E", "CD8A", "CD4"]
cell_type: Cell type name, e.g., "T cell"
tissue: Tissue name, e.g., "PBMC"

Example:

import json

# T cell marker query
t_cell_genes = ["CD3E", "CD8A", "CD4"]
data = {"gene_list_json": json.dumps(t_cell_genes)}

response = requests.post(
    "https://don-research-api.onrender.com/api/v1/genomics/query/encode",
    headers={"Authorization": f"Bearer {TOKEN}"},
    data=data
)

query_vector = response.json()["psi"]  # 128-dimensional vector

4. Search Vectors

POST /api/v1/genomics/vectors/search

Find similar cell clusters or cells using cosine similarity search.

Parameters:

jsonl_path: Path to vectors JSONL file from /vectors/build
psi: JSON array of 128 floats (query vector from /query/encode)
k: Number of results to return (default: 10)

Distance Interpretation:

Distance	Interpretation
0.0 - 0.2	Very similar (same cell type)
0.2 - 0.5	Similar (related cell types)
0.5 - 0.8	Moderately similar (different lineages)
0.8+	Dissimilar (unrelated cell types)

5. Generate Entropy Map

POST /api/v1/genomics/entropy-map

Visualize cell-level entropy (gene expression diversity) on UMAP embeddings.

Parameters:

file: .h5ad file upload
label_key: Cluster column in adata.obs (default: auto-detect)

Entropy Interpretation:
• Higher entropy: More diverse/complex expression patterns
• Lower entropy: More specialized/differentiated cell states
• Collapse metric: Quantum-inspired cell state stability measure

🔬 Workflow Examples

Example 1: Basic Cell Type Discovery

import requests
import json

API_URL = "https://don-research-api.onrender.com"
TOKEN = "your-token-here"
headers = {"Authorization": f"Bearer {TOKEN}"}

# Step 1: Build vectors
with open("my_dataset.h5ad", "rb") as f:
    files = {"file": ("my_dataset.h5ad", f, "application/octet-stream")}
    response = requests.post(
        f"{API_URL}/api/v1/genomics/vectors/build",
        headers=headers,
        files=files,
        data={"mode": "cluster"}
    )

vectors_result = response.json()
jsonl_path = vectors_result["jsonl"]
print(f"✓ Built {vectors_result['count']} cluster vectors")

# Step 2: Encode T cell query
t_cell_genes = ["CD3E", "CD8A", "CD4", "IL7R"]
query_data = {"gene_list_json": json.dumps(t_cell_genes)}
response = requests.post(
    f"{API_URL}/api/v1/genomics/query/encode",
    headers=headers,
    data=query_data
)
query_vector = response.json()["psi"]
print(f"✓ Encoded T cell query")

# Step 3: Search for matching clusters
search_data = {
    "jsonl_path": jsonl_path,
    "psi": json.dumps(query_vector),
    "k": 5
}
response = requests.post(
    f"{API_URL}/api/v1/genomics/vectors/search",
    headers=headers,
    data=search_data
)

results = response.json()["hits"]
print(f"\n✓ Top 5 T cell-like clusters:")
for i, hit in enumerate(results, 1):
    cluster_id = hit['meta']['cluster']
    distance = hit['distance']
    cells = hit['meta']['cells']
    print(f"{i}. Cluster {cluster_id}: distance={distance:.4f}, cells={cells}")

Common Cell Type Markers

Cell Type	Marker Genes
T cells	CD3E, CD8A, CD4, IL7R
B cells	MS4A1, CD79A, CD19, IGHM
NK cells	NKG7, GNLY, KLRD1, NCAM1
Monocytes	CD14, FCGR3A, CST3, LYZ
Dendritic cells	FCER1A, CST3, CLEC10A

🧬 Bio Module - ResoTrace Integration

The Bio module provides advanced single-cell analysis workflows including artifact export, signal synchronization, QC parasite detection, and evolution tracking. All endpoints support both synchronous (immediate) and asynchronous (background job) execution modes.

Key Features:
• Sync/Async Modes: Choose immediate results or background processing
• Memory Logging: All operations tracked with trace_id for audit trails
• Job Polling: Monitor long-running async jobs via /bio/jobs/{job_id}
• Project Tracking: Retrieve all traces for a project via /bio/memory/{project_id}

1. Export Artifacts

POST /api/v1/bio/export-artifacts

Convert H5AD files into ResoTrace-compatible collapse maps and vector collections. Exports cluster graphs, PAGA connectivity, and per-cell embeddings.

Parameters:

file (required): .h5ad file upload (AnnData format)
cluster_key (required): Column in adata.obs with cluster labels (e.g., "leiden")
latent_key (required): Embedding key in adata.obsm (e.g., "X_umap", "X_pca")
paga_key (optional): PAGA connectivity key (default: None)
sample_cells (optional): Max cells to export (default: all cells)
sync (optional): true for immediate, false for async (default: false)
seed (optional): Random seed for reproducibility (default: 42)
project_id (optional): Your project identifier for tracking
user_id (optional): User identifier for audit logging

Example (Synchronous):

import requests

with open("pbmc_processed.h5ad", "rb") as f:
    files = {"file": ("pbmc.h5ad", f, "application/octet-stream")}
    data = {
        "cluster_key": "leiden",
        "latent_key": "X_umap",
        "sync": "true",
        "project_id": "cai_lab_pbmc_study",
        "user_id": "researcher_001"
    }
    
    response = requests.post(
        f"{API_URL}/api/v1/bio/export-artifacts",
        headers={"Authorization": f"Bearer {TOKEN}"},
        files=files,
        data=data
    )

result = response.json()
print(f"✓ Exported {result['nodes']} clusters")
print(f"✓ {result['vectors']} cell vectors")
print(f"✓ Artifacts: {result['artifacts']}")
print(f"✓ Trace ID: {result.get('trace_id')}")

Example (Asynchronous):

# Submit job
with open("large_dataset.h5ad", "rb") as f:
    files = {"file": ("data.h5ad", f, "application/octet-stream")}
    data = {"cluster_key": "leiden", "latent_key": "X_pca", "sync": "false"}
    
    response = requests.post(
        f"{API_URL}/api/v1/bio/export-artifacts",
        headers={"Authorization": f"Bearer {TOKEN}"},
        files=files,
        data=data
    )

job_id = response.json()["job_id"]
print(f"Job ID: {job_id}")

# Poll job status
import time
while True:
    status_response = requests.get(
        f"{API_URL}/api/v1/bio/jobs/{job_id}",
        headers={"Authorization": f"Bearer {TOKEN}"}
    )
    
    job = status_response.json()
    status = job["status"]
    print(f"Status: {status}")
    
    if status == "completed":
        result = job["result"]
        print(f"✓ Complete! Nodes: {result['nodes']}, Vectors: {result['vectors']}")
        break
    elif status == "failed":
        print(f"✗ Failed: {job.get('error')}")
        break
    
    time.sleep(2)  # Poll every 2 seconds

Response (Sync):

{
  "job_id": null,
  "nodes": 8,
  "edges": 0,
  "vectors": 2700,
  "artifacts": [
    "collapse_map.json",
    "collapse_vectors.jsonl"
  ],
  "status": "completed",
  "message": "Export completed successfully (trace: abc123...)"
}

2. Signal Sync

POST /api/v1/bio/signal-sync

Compute cross-artifact coherence and synchronization metrics between two collapse maps. Useful for comparing replicates or validating pipeline stability.

Parameters:

artifact1 (required): First collapse_map.json file
artifact2 (required): Second collapse_map.json file
coherence_threshold (optional): Minimum coherence for sync (default: 0.8)
sync (optional): Execution mode (default: false)
seed, project_id, user_id: As above

Example:

with open("run1_collapse_map.json", "rb") as f1,      open("run2_collapse_map.json", "rb") as f2:
    
    files = {
        "artifact1": ("run1.json", f1, "application/json"),
        "artifact2": ("run2.json", f2, "application/json")
    }
    data = {"coherence_threshold": "0.7", "sync": "true"}
    
    response = requests.post(
        f"{API_URL}/api/v1/bio/signal-sync",
        headers={"Authorization": f"Bearer {TOKEN}"},
        files=files,
        data=data
    )

result = response.json()
print(f"Coherence Score: {result['coherence_score']:.3f}")
print(f"Node Overlap: {result['node_overlap']:.3f}")
print(f"Synchronized: {result['synchronized']}")

Interpretation:

Coherence Score	Interpretation
0.9 - 1.0	Excellent consistency (technical replicates)
0.7 - 0.9	Good consistency (biological replicates)
0.5 - 0.7	Moderate similarity (related conditions)
< 0.5	Low similarity (different conditions/batches)

3. Parasite Detector (QC)

POST /api/v1/bio/qc/parasite-detect

Detect quality control "parasites" including ambient RNA, doublets, and batch effects. Returns per-cell flags and overall contamination score.

Parameters:

file (required): .h5ad file upload
cluster_key (required): Cluster column in adata.obs
batch_key (required): Batch/sample column in adata.obs
ambient_threshold (optional): Ambient RNA cutoff (default: 0.15)
doublet_threshold (optional): Doublet score cutoff (default: 0.25)
batch_threshold (optional): Batch effect cutoff (default: 0.3)
sync, seed, project_id, user_id: As above

Example:

with open("pbmc_raw.h5ad", "rb") as f:
    files = {"file": ("pbmc.h5ad", f, "application/octet-stream")}
    data = {
        "cluster_key": "leiden",
        "batch_key": "sample",
        "ambient_threshold": "0.15",
        "doublet_threshold": "0.25",
        "sync": "true"
    }
    
    response = requests.post(
        f"{API_URL}/api/v1/bio/qc/parasite-detect",
        headers={"Authorization": f"Bearer {TOKEN}"},
        files=files,
        data=data
    )

result = response.json()
print(f"Cells: {result['n_cells']}")
print(f"Flagged: {result['n_flagged']} ({result['n_flagged']/result['n_cells']*100:.1f}%)")
print(f"Parasite Score: {result['parasite_score']:.1f}%")

# Flags: list of booleans, one per cell
print(f"\nFlagged cell indices: {[i for i, f in enumerate(result['flags']) if f][:10]}...")

Recommended Actions:

Parasite Score	Action
0-5%	Excellent quality - proceed
5-15%	Good quality - minor filtering recommended
15-30%	Moderate contamination - filter flagged cells
> 30%	High contamination - review QC pipeline

4. Evolution Report

POST /api/v1/bio/evolution/report

Compare two pipeline runs to assess stability, drift, and reproducibility. Useful for parameter optimization and batch effect validation.

Parameters:

run1_file (required): First run H5AD file
run2_file (required): Second run H5AD file
run2_name (required): Name/label for second run
cluster_key (required): Cluster column
latent_key (required): Embedding key
sync, seed, project_id, user_id: As above

Example:

with open("run1_leiden05.h5ad", "rb") as f1,      open("run2_leiden10.h5ad", "rb") as f2:
    
    files = {
        "run1_file": ("run1.h5ad", f1, "application/octet-stream"),
        "run2_file": ("run2.h5ad", f2, "application/octet-stream")
    }
    data = {
        "run2_name": "leiden_resolution_1.0",
        "cluster_key": "leiden",
        "latent_key": "X_pca",
        "sync": "true"
    }
    
    response = requests.post(
        f"{API_URL}/api/v1/bio/evolution/report",
        headers={"Authorization": f"Bearer {TOKEN}"},
        files=files,
        data=data
    )

result = response.json()
print(f"Run 1: {result['run1_name']} ({result['n_cells_run1']} cells)")
print(f"Run 2: {result['run2_name']} ({result['n_cells_run2']} cells)")
print(f"Stability Score: {result['stability_score']:.1f}%")
print(f"\nDelta Metrics: {result['delta_metrics']}")

Stability Score Interpretation:

Score	Interpretation
> 90%	Excellent stability (robust pipeline)
70-90%	Good stability (acceptable variation)
50-70%	Moderate drift (review parameters)
< 50%	High drift (investigate batch effects)

Job Management

Poll Job Status

GET /api/v1/bio/jobs/{job_id}

response = requests.get(
    f"{API_URL}/api/v1/bio/jobs/{job_id}",
    headers={"Authorization": f"Bearer {TOKEN}"}
)

job = response.json()
# Fields: job_id, endpoint, status, created_at, completed_at, result, error

Retrieve Project Memory

GET /api/v1/bio/memory/{project_id}

response = requests.get(
    f"{API_URL}/api/v1/bio/memory/cai_lab_pbmc_study",
    headers={"Authorization": f"Bearer {TOKEN}"}
)

memory = response.json()
print(f"Project: {memory['project_id']}")
print(f"Total Traces: {memory['count']}")

for trace in memory['traces']:
    print(f"  {trace['event_type']}: {trace['metrics']}")

💡 Pro Tips:
• Use sync=true for small datasets (<5k cells)
• Use sync=false for large datasets (>10k cells)
• Set project_id to group related operations
• Monitor trace_id for debugging and audit trails
• Check parasite_score before downstream analysis

🧬 Understanding the Output

128-Dimensional Vector Structure

Dimensions	Content	Purpose
0-15	Entropy signature	Gene expression distribution (16 bins)
16	HVG fraction	% of highly variable genes expressed
17	Mitochondrial %	Cell quality indicator
18	Total counts	Library size (normalized)
22	Silhouette score	Cluster separation quality (-1 to 1)
27	Purity score	Neighborhood homogeneity (0 to 1)
28-127	Biological tokens	Hashed cell type & tissue features

Compression Example (PBMC3k)

Raw data: 2,700 cells × 13,714 genes = 37,027,800 values
Cluster vectors: 10 clusters × 128 dims = 1,280 values
Compression ratio: 28,928× reduction
Information retention: ~85-90% (via silhouette scores)

🔧 Troubleshooting

Common Errors

1. Authentication Failed (401)

Error: {"detail": "Invalid or missing token"}

Solutions:

Verify token is correct (check for extra spaces)
Ensure Authorization: Bearer TOKEN header format
Contact support if token expired

2. File Upload Failed (400)

Error: {"detail": "Expected .h5ad file"}

Solutions:

Verify file extension is .h5ad
Check file is valid AnnData: sc.read_h5ad("file.h5ad")
Ensure file size < 500MB

3. Rate Limit Exceeded (429)

Error: {"detail": "Rate limit exceeded"}

Solutions:

Wait 1 hour for rate limit reset
Implement exponential backoff in your code
Contact support for higher limits if needed

Best Practices

Use cluster mode by default - Cell mode generates too many vectors for most use cases
Cache query vectors - Encode once, search multiple times
Preprocess h5ad files locally - Filter low-quality cells before uploading
Monitor rate limits - Track API calls to stay under 1,000/hour

Data Preparation Tips

import scanpy as sc

# Quality control
adata = sc.read_h5ad("raw_data.h5ad")
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs['pct_counts_mt'] < 5, :]

# Normalization
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Save cleaned data
adata.write_h5ad("cleaned_data.h5ad")

📞 Support & Resources

Research Liaison

research.request@donsystems.com

General inquiries, collaboration requests, token provisioning

Technical Support

support@donsystems.com

API errors, troubleshooting, integration assistance

Partnerships

partnerships@donsystems.com

Academic collaborations, research proposals

Documentation Resources

API Reference: Interactive Swagger UI
GitHub Examples: github.com/DONSystemsLLC/don-research-api
Research Paper: DON Stack: Quantum-Enhanced Genomics Compression (in preparation)

Office Hours

Monday-Friday, 9 AM - 5 PM CST

Response Time: < 24 hours for technical issues

Reporting Issues

Please include:

Institution name (Texas A&M)
API endpoint and parameters used
Full error message (JSON response)
Sample data file (if < 10MB) or description
Expected vs. actual behavior