Concatenate datasets to a single array store¶

In the previous notebooks, we’ve seen how to incrementally create a collection of scRNA-seq datasets and train models on it.

Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata (see this blog post). This is what CELLxGENE does to create Census: a number of .h5ad files are concatenated to give rise to a single tiledbsoma array store (CELLxGENE: scRNA-seq).

Note

This notebook shows how lamindb can be used with tiledbsoma append mode, also expained in the tiledbsoma documentation.

import lamindb as ln
import pandas as pd
import scanpy as sc
import tiledbsoma.io
from functools import reduce

→ connected lamindb: testuser1/test-scrna

ln.context.uid = "oJN8WmVrxI8m0000"
ln.context.track()

Query the collection of h5ad files that we’d like to convert into a single array.

collection = ln.Collection.get(
    name="My versioned scRNA-seq collection", version="2"
)
collection.describe()

Prepare the AnnData objects¶

We need to prepare theAnnData objects in the collection to be concatenated into one tiledbsoma.Experiment. They need to have the same .var and .obs columns, .uns and .obsp should be removed.

adatas = [artifact.load() for artifact in collection.ordered_artifacts]

Compute the intersetion of all columns. All AnnData objects should have the same columns in their .obs, .var, .raw.var to be ingested into one tiledbsoma.Experiment.

obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas])
var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas])
var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])

Prepare the AnnData objects for concatenation. Prepare id fields, sanitize index names, intersect columns, drop slots. Here we have to drop .obsp, .uns and also columns from the dataframes that are not in the intersections obtained above, otherwise the ingestion will fail. We will need to provide obs and var names in ln.integrations.save_tiledbsoma_experiment, so we create these fileds (obs_id, var_id) from the dataframe indices.

for i, adata in enumerate(adatas):
    del adata.obsp
    del adata.uns
    
    adata.obs = adata.obs.filter(obs_columns)
    adata.obs["obs_id"] = adata.obs.index
    adata.obs["dataset"] = i
    adata.obs.index.name = None
    
    adata.var = adata.var.filter(var_columns)
    adata.var["var_id"] = adata.var.index
    adata.var.index.name = None
    
    drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
    adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
    adata.raw.var["var_id"] = adata.raw.var.index
    adata.raw.var.index.name = None

Create the array store¶

Ingest the AnnData objects. This saves the AnnData objects in one array store, creates Artifact and saves it. This function also writes current run.uid to tiledbsoma.Experiment obs, under lamin_run_uid.

If you know tiledbsoma API, then note, that ln.integrations.save_tiledbsoma_experiment includes both tiledbsoma.io.register_anndatas and tiledbsoma.io.from_anndata.

soma_artifact = ln.integrations.save_tiledbsoma_experiment(
    adatas,
    description="tiledbsoma experiment",
    measurement_name="RNA",
    obs_id_name="obs_id",
    var_id_name="var_id",
    append_obsm_varm=True
)

Query the array store¶

Open and query the experiment. We can use the registered Artifact. Here we query obs from the array store.

with soma_artifact.open() as soma_store:
    obs = soma_store["obs"]
    var = soma_store["ms"]["RNA"]["var"]
    
    obs_columns_store = obs.schema.names
    var_columns_store = var.schema.names
    
    obs_store_df = obs.read().concat().to_pandas()
    
    print(obs_store_df)

Append `AnnData` to the array store¶

Prepare a new AnnData object to be appended to the store.

adata = ln.core.datasets.anndata_with_obs()

adata.obs_names_make_unique()
adata.var_names_make_unique()

adata.obs["obs_id"] = adata.obs.index
adata.var["var_id"] = adata.var.index

adata.obs["dataset"] = obs_store_df["dataset"].max()

obs_columns_same = [obs_col for obs_col in adata.obs.columns if obs_col in obs_columns_store]
adata.obs = adata.obs[obs_columns_same]

var_columns_same = [var_col for var_col in adata.var.columns if var_col in var_columns_store]
adata.var = adata.var[var_columns_same]

adata.write_h5ad("adata_to_append.h5ad")

Append the AnnData object from disk. This also creates a new version of soma_artifact.

soma_artifact = ln.integrations.save_tiledbsoma_experiment(
    ["adata_to_append.h5ad"],
    revises=soma_artifact,
    measurement_name="RNA",
    obs_id_name="obs_id",
    var_id_name="var_id"
)

Update the array store¶

Read X from the store.

with soma_artifact.open() as soma_store: # mode="r" by default
    ms_rna = soma_store["ms"]["RNA"]
    n_obs = len(soma_store["obs"])
    n_var = len(ms_rna["var"])
    X = ms_rna["X"]["data"].read().coos((n_obs, n_var)).concat().to_scipy()

Calculate PCA from the queried X.

pca_array = sc.pp.pca(X, n_comps=2)

soma_artifact

Artifact(uid='nCqIkb08iwOsKEnN0001', is_latest=True, description='tiledbsoma experiment', key='.lamindb/nCqIkb08iwOsKEnN.tiledbsoma', suffix='.tiledbsoma', size=15068509, hash='1zWHgdughEQI1t8Ig1AJQQ', n_objects=173, _hash_type='md5-d', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=6, run_id=6, updated_at='2024-09-02 13:27:52 UTC')

Open the array store in write mode and add PCA. When the store is updated, the corresponding artifact also gets updated with a new version.

with soma_artifact.open(mode="w") as soma_store:
    tiledbsoma.io.add_matrix_to_collection(
        exp=soma_store,
        measurement_name="RNA",
        collection_name="obsm",
        matrix_name="pca",
        matrix_data=pca_array
    )

Note that the artifact has been changed.

soma_artifact

Artifact(uid='nCqIkb08iwOsKEnN0002', is_latest=True, description='tiledbsoma experiment', key='.lamindb/nCqIkb08iwOsKEnN.tiledbsoma', suffix='.tiledbsoma', size=15089313, hash='skVIFgqaFfBVQtspHRl0TQ', n_objects=182, _hash_type='md5-d', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=1, transform_id=6, run_id=6, updated_at='2024-09-02 13:27:53 UTC')