Analysis flow#
Here, we’ll track typical data transformations like subsetting that occur during analysis.
If exploring more generally, read this first: Project flow.
Setup#
# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty
Show code cell output
✅ saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-11-20 22:26:00 UTC)
✅ saved: Storage(uid='KAW722Tk', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-11-20 22:26:00 UTC, created_by_id=1)
💡 loaded instance: testuser1/analysis-usecase
💡 did not register local instance on hub
import lamindb as ln
import lnschema_bionty as lb
from lamin_utils import logger
lb.settings.organism = "human" # globally set organism
lb.settings.auto_save_parents = False
💡 lamindb instance: testuser1/analysis-usecase
Register an initial dataset#
Here we register an initial file with a pipeline script.
# register_example_file.py
def register_example_file():
# create a pipeline transform to track the registration of the file
transform = ln.Transform(
name="register example file", type="pipeline", version="0.0.1"
)
ln.track(transform)
# an example dataset that has a few cell type, tissue and disease annotations
adata = ln.dev.datasets.anndata_with_obs()
# validate and register features
genes = lb.Gene.from_values(adata.var_names, lb.Gene.ensembl_gene_id)
ln.save(genes)
obs_features = ln.Feature.from_df(adata.obs)
ln.save(obs_features)
# validate and register labels
cell_types = lb.CellType.from_values(adata.obs["cell_type"])
ln.save(cell_types)
tissues = lb.Tissue.from_values(adata.obs["tissue"])
ln.save(tissues)
diseases = lb.Disease.from_values(adata.obs["disease"])
ln.save(diseases)
# register file and annotate with features & labels
file = ln.File.from_anndata(
adata, description="anndata with obs", field=lb.Gene.ensembl_gene_id
)
file.save()
features = ln.Feature.lookup()
file.labels.add(cell_types, features.cell_type)
file.labels.add(tissues, features.tissue)
file.labels.add(diseases, features.disease)
register_example_file()
Show code cell output
💡 saved: Transform(uid='HX0OVMxrZbb5Nl', name='register example file', version='0.0.1', type='pipeline', updated_at=2023-11-20 22:26:01 UTC, created_by_id=1)
💡 saved: Run(uid='XuwmTQPinm0P6Us6gnrd', run_at=2023-11-20 22:26:01 UTC, transform_id=1, created_by_id=1)
❗ did not create CellType record for 1 non-validated name: 'my new cell type'
... storing 'cell_type' as categorical
... storing 'cell_type_id' as categorical
... storing 'tissue' as categorical
... storing 'disease' as categorical
Pull the registered dataset, apply a transformation, and register the result#
Set the current notebook as the new transform:
ln.track()
💡 notebook imports: lamin_utils==0.11.7 lamindb==0.61.0 lnschema_bionty==0.35.1
💡 saved: Transform(uid='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-11-20 22:26:06 UTC, created_by_id=1)
💡 saved: Run(uid='CDEr3LQ6v4E6q2o5t1tT', run_at=2023-11-20 22:26:06 UTC, transform_id=2, created_by_id=1)
file = ln.File.filter(description="anndata with obs").one()
file.describe()
File(uid='89J6XHV6w7ZsLviEvGUv', suffix='.h5ad', accessor='AnnData', description='anndata with obs', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', hash_type='md5', visibility=0, key_is_virtual=True, updated_at=2023-11-20 22:26:06 UTC)
Provenance:
🗃️ storage: Storage(uid='KAW722Tk', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-11-20 22:26:00 UTC, created_by_id=1)
🧩 transform: Transform(uid='HX0OVMxrZbb5Nl', name='register example file', version='0.0.1', type='pipeline', updated_at=2023-11-20 22:26:01 UTC, created_by_id=1)
👣 run: Run(uid='XuwmTQPinm0P6Us6gnrd', run_at=2023-11-20 22:26:01 UTC, transform_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-11-20 22:26:00 UTC)
Features:
var: FeatureSet(uid='y6gK7vGLzROkUsEb5x87', n=99, type='number', registry='bionty.Gene', hash='fHbDaAAmJse48vnUQh9C', updated_at=2023-11-20 22:26:06 UTC, created_by_id=1)
'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'C1orf112', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52', ...
obs: FeatureSet(uid='Vn3DqHZD5fFDkNg5QZ0I', n=4, registry='core.Feature', hash='GPb-hSMIzU0VkTTskyle', updated_at=2023-11-20 22:26:06 UTC, created_by_id=1)
🔗 cell_type (3, bionty.CellType): 'T cell', 'hematopoietic stem cell', 'hepatocyte'
cell_type_id (category)
🔗 tissue (4, bionty.Tissue): 'kidney', 'liver', 'heart', 'brain'
🔗 disease (4, bionty.Disease): 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
Labels:
🏷️ tissues (4, bionty.Tissue): 'kidney', 'liver', 'heart', 'brain'
🏷️ cell_types (3, bionty.CellType): 'T cell', 'hematopoietic stem cell', 'hepatocyte'
🏷️ diseases (4, bionty.Disease): 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
Get a backed AnnData object#
adata = file.backed()
adata
AnnDataAccessor object with n_obs × n_vars = 40 × 100
constructed for the AnnData object 89J6XHV6w7ZsLviEvGUv.h5ad
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
Subset dataset to specific cell types and diseases#
cell_types = file.cell_types.all().lookup(return_field="name")
diseases = file.diseases.all().lookup(return_field="name")
Create the subset:
subset_obs = adata.obs.cell_type.isin(
[cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
cell_type disease
T cell chronic kidney disease 10
hematopoietic stem cell liver lymphoma 10
dtype: int64
Register the subsetted AnnData:
file_subset = ln.File.from_anndata(
adata_subset.to_memory(),
description="anndata with obs subset",
field=lb.Gene.ensembl_gene_id,
)
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1899: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
file_subset.save()
features = ln.Feature.lookup()
file_subset.labels.add(adata_subset.obs.cell_type, features.cell_type)
file_subset.labels.add(adata_subset.obs.disease, features.disease)
file_subset.labels.add(adata_subset.obs.tissue, features.tissue)
Examine data flow#
Query a subsetted .h5ad
file containing “hematopoietic stem cell” and “T cell”:
cell_types = lb.CellType.lookup()
my_subset = ln.File.filter(
suffix=".h5ad",
description__endswith="subset",
cell_types__in=[
cell_types.hematopoietic_stem_cell,
cell_types.t_cell,
],
).first()
my_subset
File(uid='9qJKeCgrVfv5tnnPPBSt', suffix='.h5ad', accessor='AnnData', description='anndata with obs subset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', visibility=0, key_is_virtual=True, updated_at=2023-11-20 22:26:07 UTC, storage_id=1, transform_id=2, run_id=2, created_by_id=1)
Common questions that might arise are:
What is the history of this file?
Which features and labels are associated with it?
Which notebook analyzed and registered this file?
By whom?
And which file is its parent?
Let’s answer this using LaminDB:
print("--> What is the history of this file?\n")
file_subset.view_flow()
print("\n\n--> Which features and labels are associated with it?\n")
logger.print(file_subset.features)
logger.print(file_subset.labels)
print("\n\n--> Which notebook analyzed and registered this file\n")
logger.print(file_subset.transform)
print("\n\n--> By whom\n")
logger.print(file_subset.created_by)
print("\n\n--> And which file is its parent\n")
display(file_subset.run.input_files.df())
--> What is the history of this file?
--> Which features and labels are associated with it?
Features:
var: FeatureSet(uid='y6gK7vGLzROkUsEb5x87', n=99, type='number', registry='bionty.Gene', hash='fHbDaAAmJse48vnUQh9C', updated_at=2023-11-20 22:26:06 UTC, created_by_id=1)
'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'C1orf112', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52', ...
obs: FeatureSet(uid='Vn3DqHZD5fFDkNg5QZ0I', n=4, registry='core.Feature', hash='GPb-hSMIzU0VkTTskyle', updated_at=2023-11-20 22:26:06 UTC, created_by_id=1)
🔗 cell_type (2, bionty.CellType): 'T cell', 'hematopoietic stem cell'
cell_type_id (category)
🔗 tissue (2, bionty.Tissue): 'kidney', 'liver'
🔗 disease (2, bionty.Disease): 'chronic kidney disease', 'liver lymphoma'
Labels:
🏷️ tissues (2, bionty.Tissue): 'kidney', 'liver'
🏷️ cell_types (2, bionty.CellType): 'T cell', 'hematopoietic stem cell'
🏷️ diseases (2, bionty.Disease): 'chronic kidney disease', 'liver lymphoma'
--> Which notebook analyzed and registered this file
Transform(uid='eNef4Arw8nNMz8', name='Analysis flow', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-11-20 22:26:06 UTC, created_by_id=1)
--> By whom
User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-11-20 22:26:00 UTC)
--> And which file is its parent
uid | storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | visibility | key_is_virtual | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||
1 | 89J6XHV6w7ZsLviEvGUv | 1 | None | .h5ad | AnnData | anndata with obs | None | 46992 | IJORtcQUSS11QBqD-nTD0A | md5 | 1 | 1 | None | 0 | True | 2023-11-20 22:26:06.498729+00:00 | 1 |
Show code cell content
!lamin delete --force analysis-usecase
!rm -r ./analysis-usecase
💡 deleting instance testuser1/analysis-usecase
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--analysis-usecase.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase