๐งช Usage Guide
- Python
- JavaScript
Initiating the Chroma clientโ
import chromadb
By default Chroma uses an in-memory database, which gets persisted on exit and loaded on start (if it exists). This is fine for many experimental / prototyping workloads, limited by your machine's memory.
from chromadb.config import Settings
client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="/path/to/persist/directory" # Optional, defaults to .chromadb/ in the current directory
))
The persist_directory
is where Chroma will store its database files on disk, and load them on start.
import { ChromaClient } from 'chromadb'
The JS client talks to a chroma server backend. This can run on your local computer via docker
(see below) or be easily deployed to AWS.
const client = new ChromaClient();
The client object has a few useful convenience methods.
client.heartbeat() # returns a nanosecond heartbeat. Useful for making sure the client remains connected.
client.reset() # Empties and completely resets the database. โ ๏ธ This is destructive and not reversible.
The client object has a few useful convenience methods.
await client.reset() # Empties and completely resets the database. โ ๏ธ This is destructive and not reversible.
Running Chroma in client/server modeโ
Chroma can also be configured to use an on-disk database, useful for larger data which doesn't fit in memory. To run Chroma in client server mode, run the docker container:
docker-compose up -d --build
Then update your chroma client to point at the docker container. Default: localhost:8000
import chromadb
from chromadb.config import Settings
chroma_client = chromadb.Client(Settings(chroma_api_impl="rest",
chroma_server_host="localhost",
chroma_server_http_port="8000"
))
That's it! Chroma's API will run in client-server
mode with just this change.
The JS client then talks to the chroma server backend. This can run on your local computer or be easily deployed to AWS.
import { ChromaClient } from 'chromadb'
const client = new ChromaClient();
Using collectionsโ
Chroma lets you manage collections of embeddings, using the collection
primitive.
Creating, inspecting, and deleting Collectionsโ
Chroma uses collection names in the url, so there are a few restrictions on naming them:
- The length of the name must be between 3 and 63 characters.
- The name must start and end with a lowercase letter or a digit, and it can contain dots, dashes, and underscores in between.
- The name must not contain two consecutive dots.
- The name must not be a valid IP address.
Chroma collections are created with a name and an optional embedding function. If you supply an embedding function, you must supply it every time you get the collection.
collection = client.create_collection(name="my_collection", embedding_function=emb_fn)
collection = client.get_collection(name="my_collection", embedding_function=emb_fn)
If you later wish to get_collection
, you MUST do so with the embedding function you supplied while creating the collection
The embedding function takes text as input, and performs tokenization and embedding. If no embedding function is supplied, Chroma will use sentence transfomer as a default.
import { ChromaClient } from 'chromadb'
The JS client talks to a chroma server backend. This can run on your local computer or be easily deployed to AWS.
let collection = await client.createCollection("my_collection", undefined, embedding_function=emb_fn)
let collection2 = await client.getCollection("my_collection", embedding_function=emb_fn)
If you later wish to getCollection
, you MUST do so with the embedding function you supplied while creating the collection
The embedding function takes text as input, and performs tokenization and embedding.
You can learn more about ๐งฌ embedding functions, and how to create your own.
Existing collections can be retrieved by name with .get_collection
, and deleted with .delete_collection
. You can also use .get_or_create_collection
to get a collection if it exists, or create it if it doesn't.
collection = client.get_collection(name="test") # Get a collection object from an existing collection, by name. Will raise an exception if it's not found.
collection = client.get_or_create_collection(name="test") # Get a collection object from an existing collection, by name. If it doesn't exist, create it.
client.delete_collection(name="my_collection") # Delete a collection and all associated embeddings, documents, and metadata. โ ๏ธ This is destructive and not reversible
Existing collections can be retrieved by name with .getCollection
, and deleted with .deleteCollection
.
const collection = await client.getCollection("test") # Get a collection object from an existing collection, by name. Will raise an exception of it's not found.
await client.deleteCollection("my_collection") # Delete a collection and all associated embeddings, documents, and metadata. โ ๏ธ This is destructive and not reversible
Collections have a few useful convenience methods.
collection.peek() # returns a list of the first 10 items in the collection
collection.count() # returns the number of items in the collection
collection.modify(name="new_name") # Rename the collection
await collection.peek() // returns a list of the first 10 items in the collection
await collection.count() // returns the number of items in the collection
Adding data to a Collectionโ
Add data to Chroma with .add
.
Raw documents:
collection.add(
documents=["lorem ipsum...", "doc2", "doc3", ...],
metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
ids=["id1", "id2", "id3", ...]
)
await collection.add(
["id1", "id2", "id3", ...],
undefined,
[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
["lorem ipsum...", "doc2", "doc3", ...],
)
// input order
// ids - required
// embeddings - optional
// metadata - optional
// documents - optional
If Chroma is passed a list of documents
, it will automatically tokenize and embed them with the collection's embedding function (the default will be used if none was supplied at collection creation). Chroma will also store the documents
themselves. If the documents are too large to embed using the chosen embedding function, an exception will be raised.
Each document must have a unique associated id
. Trying to .add
the same ID twice will result in an error. An optional list of metadata
dictionaries can be supplied for each document, to store additional information and enable filtering.
Alternatively, you can supply a list of document-associated embeddings
directly, and Chroma will store the associated documents without embedding them itself.
await collection.add(
documents=["doc1", "doc2", "doc3", ...],
embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
ids=["id1", "id2", "id3", ...]
)
await collection.add(
["id1", "id2", "id3", ...],
[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
["lorem ipsum...", "doc2", "doc3", ...],
)
If the supplied embeddings
are not the same dimension as the collection, an exception will be raised.
You can also store documents elsewhere, and just supply a list of embeddings
and metadata
to Chroma. You can use the ids
to associate the embeddings with your documents stored elsewhere.
collection.add(
embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...
metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
ids=["id1", "id2", "id3", ...]
)
await collection.add(
["id1", "id2", "id3", ...],
[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
)
Querying a Collectionโ
Chroma collections can be queried in a variety of ways, using the .query
method.
You can query by a set of query_embeddings
.
collection.query(
query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2] ...]
n_results=10,
where={"metadata_field": "is_equal_to_this"},
where_document={"$contains":"search_string"}
)
const result = await collection.query(
[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2] ...],
10,
{"metadata_field": "is_equal_to_this"},
undefined,
)
// input order
// query_embeddings - optional
// n_results - required
// where - optional
// query_texts - optional
The query will return the n_results
closest matches to each query_embedding
, in order.
An optional where
filter dictionary can be supplied to filter the results by the metadata
associated with each document.
Additionally, an optional where_document
filter dictionary can be supplied to filter the results by contents of the document.
If the supplied query_embeddings
are not the same dimension as the collection, an exception will be raised.
You can also query by a set of query_texts
. Chroma will first embed each query_text
with the collection's embedding function, and then perform the query with the generated embedding.
collection.query(
query_texts=["doc10", "thus spake zarathustra", ...]
n_results=10,
where={"metadata_field": "is_equal_to_this"},
where_document={"$contains":"search_string"}
)
You can also retrieve items from a collection by id
using .get
.
collection.get(
ids=["id1", "id2", "id3", ...],
where={"style": "style1"}
)
await collection.query(
undefined, // query_embeddings
10, // n_results
{"metadata_field": "is_equal_to_this"}, // where
["doc10", "thus spake zarathustra", ...], // query_text
)
You can also retrieve items from a collection by id
using .get
.
await collection.get(
["id1", "id2", "id3", ...], //ids
{"style": "style1"} // where
)
.get
also supports the where
and where_document
filters. If no ids
are supplied, it will return all items in the collection that match the where
and where_document
filters.
Choosing which data is returnedโ
When using get or query you can use the include parameter to specify which data you want returned - any of embeddings
, documents
, metadatas
, and for query, distances
. By default, Chroma will return the documents
, metadatas
and in the case of query, the distances
of the results. embeddings
are excluded by default for performance and the ids
are always returned. You can specify which of these you want returned by passing an array of included field names to the includes parameter of the query or get method.
# Only get documents and ids
collection.get(
include=["documents"]
)
collection.query(
query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2] ...],
include=["documents"]
)
Using Where filtersโ
Chroma supports filtering queries by metadata
and document
contents. The where
filter is used to filter by metadata
, and the where_document
filter is used to filter by document
contents.
Filtering by metadataโ
In order to filter on metadata, you must supply a where
filter dictionary to the query. The dictionary must have the following structure:
{
"metadata_field": {
<Operator>: <Value>
},
"metadata_field": {
<Operator>: <Value>
},
}
Filtering metadata supports the following operators:
$eq
- equal to (string, int, float)$ne
- not equal to (string, int, float)$gt
- greater than (int, float)$gte
- greater than or equal to (int, float)$lt
- less than (int, float)$lte
- less than or equal to (int, float)
Using the $eq operator is equivalent to using the where
filter.
{
"metadata_field": "search_string"
}
# is equivalent to
{
"metadata_field": {
"$eq": "search_string"
}
}
Filtering by document contentsโ
In order to filter on document contents, you must supply a where_document
filter dictionary to the query. The dictionary must have the following structure:
# Filtering for a search_string
{
"$contains": "search_string"
}
Using logical operatorsโ
You can also use the logical operators $and
and $or
to combine multiple filters.
An $and
operator will return results that match all of the filters in the list.
{
"$and": [
{
"metadata_field": {
<Operator>: <Value>
}
},
{
"metadata_field": {
<Operator>: <Value>
}
}
]
}
An $or
operator will return results that match any of the filters in the list.
{
"$or": [
{
"metadata_field": {
<Operator>: <Value>
}
},
{
"metadata_field": {
<Operator>: <Value>
}
}
]
}
Updating data in a collectionโ
Any property of items in a collection can be updated using .update
.
collection.update(
ids=["id1", "id2", "id3", ...],
embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
documents=["doc1", "doc2", "doc3", ...],
)
If an id
is not found in the collection, an exception will be raised. If documents
are supplied without corresponding embeddings
, the embeddings will be recomupted with the collection's embedding function.
If the supplied embeddings
are not the same dimension as the collection, an exception will be raised.
Deleting data from a collectionโ
Chroma supports deleting items from a collection by id
using .delete
. The embeddings, documents, and metadata associated with each item will be deleted.
โ ๏ธ Naturally, this is a destructive operation, and cannot be undone.
collection.delete(
ids=["id1", "id2", "id3",...],
where={"chapter": "20"}
)
await collection.delete(
["id1", "id2", "id3",...], //ids
{"chapter": "20"} //where
)
.delete
also supports the where
filter. If no ids
are supplied, it will delete all items in the collection that match the where
filter.