This example demonstrates how to chunk a document, generate embeddings, and store them in Chroma Cloud for semantic search and retrieval.
The example performs the following operations:
- Ingestion Mode: Chunks a document (
document.txt) into smaller pieces, generates embeddings using Jina AI, and stores them in Chroma Cloud - Query Mode: Performs semantic search on the stored documents using natural language queries
- PHP 8.1 or higher
- Chroma Cloud account with API key
- Jina AI API key (for embeddings)
- Composer dependencies installed (
composer install)
- Set your API keys as environment variables:
export CHROMA_API_KEY="your-chroma-cloud-api-key"
export JINA_API_KEY="your-jina-api-key"Or pass them via CLI arguments (see Usage below).
Chunk and store the document to Chroma Cloud:
php index.php -mode ingestWith custom options:
php index.php -mode ingest \
--api-key "your-chroma-api-key" \
--jina-key "your-jina-api-key" \
--tenant "my-tenant" \
--database "my-database"Search the stored documents:
php index.php -mode query --query "What happened at the Dartmouth Workshop?"With custom options:
php index.php -mode query \
--query "Who proposed the Turing Test?" \
--api-key "your-chroma-api-key" \
--jina-key "your-jina-api-key" \
--tenant "my-tenant" \
--database "my-database"| Argument | Description | Default | Required |
|---|---|---|---|
-mode |
Operation mode: ingest or query |
- | Yes |
--query |
Query text for search (query mode only) | "Which event marked the birth of symbolic AI?" | No |
--api-key |
Chroma Cloud API key | CHROMA_API_KEY env var |
Yes |
--jina-key |
Jina AI API key for embeddings | JINA_API_KEY env var |
Yes |
--tenant |
Chroma Cloud tenant name | default_tenant |
No |
--database |
Chroma Cloud database name | default_database |
No |
--collection-name |
Collection name to use | history_of_ai |
No |
Try these example queries to test the semantic search:
# Historical events
php index.php -mode query --query "What happened at the Dartmouth Workshop?"
# People and contributions
php index.php -mode query --query "Who proposed the Turing Test?"
# Technical breakthroughs
php index.php -mode query --query "What was the significance of AlexNet in 2012?"
# Concepts and explanations
php index.php -mode query --query "How do Large Language Models and Generative AI work?"
# Historical figures
php index.php -mode query --query "Who is considered the first computer programmer?"The document is chunked based on:
- CHAPTER markers: New chapters create new chunks
- PAGE markers: New pages create new chunks
- Text accumulation: Text between markers is accumulated into chunks
Each chunk includes:
- Unique ID
- Document text
- Metadata (chapter and page information)
- Uses Jina AI's embedding function to convert text chunks into vector embeddings
- Embeddings are generated in batch for efficiency
- All chunks are embedded before storage
- Chunks are stored in a Chroma Cloud collection
- The collection is recreated on each ingestion (previous data is deleted)
- Each chunk maintains its metadata for filtering and context
- Natural language queries are converted to embeddings using the same Jina AI function
- Vector similarity search finds the most relevant chunks
- Results include distance scores, documents, and metadata
--- Chroma Cloud Example: ingest Mode ---
Tenant: default_tenant, Database: default_database
Connected to Chroma Cloud version: 0.1.0
Starting Ingestion...
Parsed 9 chunks from document.
Embedding and adding 9 items...
Ingestion Complete!
--- Chroma Cloud Example: query Mode ---
Tenant: default_tenant, Database: default_database
Connected to Chroma Cloud version: 0.1.0
Querying: "What happened at the Dartmouth Workshop?"
--- Results ---
[0] (Distance: 0.123)
Location: CHAPTER 1: The Dawn of Thinking Machines, PAGE 3
Content: The 1956 Dartmouth Workshop is widely considered the founding event of AI as a field. John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon brought together...
---------------------------
Replace document.txt with your own document. The chunking logic will automatically process it based on CHAPTER and PAGE markers.
Modify index.php to use a different embedding function:
use Codewithkyrian\ChromaDB\Embeddings\OpenAIEmbeddingFunction;
$ef = new OpenAIEmbeddingFunction($config['openai_key']);Modify the chunkDocument() function to implement your own chunking logic (e.g., by sentence, by paragraph, fixed-size chunks, etc.).
Error: Chroma Cloud API Key is required
- Set
CHROMA_API_KEYenvironment variable or use--api-keyargument
Error: Jina API Key is required
- Set
JINA_API_KEYenvironment variable or use--jina-keyargument
Error: Collection not found
- Run ingestion mode first to create and populate the collection
No results returned
- Ensure the collection was successfully ingested
- Try different query phrasings
- Check that the query is related to the document content