scrna3/6 Jupyter Notebook lamindata

Query individual files#

Here, we’ll query individual files and inspect their metadata.

This guide can be skipped if you are only interested in how to leverage the overall dataset.

import lamindb as ln
import lnschema_bionty as lb
import anndata as ad
πŸ’‘ lamindb instance: testuser1/test-scrna
ln.track()
πŸ’‘ notebook imports: anndata==0.9.2 lamindb==0.61.0 lnschema_bionty==0.35.1
πŸ’‘ saved: Transform(uid='agayZTonayqAz8', name='Query individual files', short_name='scrna3', version='0', type=notebook, updated_at=2023-11-20 22:27:15 UTC, created_by_id=1)
πŸ’‘ saved: Run(uid='YBxTFVlWwO8eYaCU86wh', run_at=2023-11-20 22:27:15 UTC, transform_id=3, created_by_id=1)

Query files by provenance metadata#

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).search("scrna")
uid score
name
scRNA-seq Nv48yAceNSh8z8 90.0
Append a new batch of data ManDYgmftZ8Cz8 36.0
Query individual files agayZTonayqAz8 36.0
transform = ln.Transform.filter(uid="Nv48yAceNSh8z8").one()
ln.File.filter(transform=transform).df()
uid storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id visibility key_is_virtual updated_at created_by_id
id
1 pscxqpvwQa4OPAZUyTXQ 1 scrna/conde22.h5ad .h5ad AnnData Human immune cells from Conde22 None 57612943 9sXda5E7BYiVoDOQkTC0KB sha1-fl 1 1 None 0 True 2023-11-20 22:26:39.939281+00:00 1

Query files by biological metadata#

assays = lb.ExperimentalFactor.lookup()
organism = lb.Organism.lookup()
cell_types = lb.CellType.lookup()
query = ln.File.filter(
    experimental_factors=assays.single_cell_rna_sequencing,
    organism=organism.human,
    cell_types=cell_types.gamma_delta_t_cell,
)
query.df()
uid storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id visibility key_is_virtual updated_at created_by_id
id
1 pscxqpvwQa4OPAZUyTXQ 1 scrna/conde22.h5ad .h5ad AnnData Human immune cells from Conde22 None 57612943 9sXda5E7BYiVoDOQkTC0KB sha1-fl 1 1 None 0 True 2023-11-20 22:26:39.939281+00:00 1

Inspect file metadata#

query_set = ln.File.filter().all()

file1, file2 = query_set[0], query_set[1]
file1.describe()
File(uid='pscxqpvwQa4OPAZUyTXQ', key='scrna/conde22.h5ad', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=0, key_is_virtual=True, updated_at=2023-11-20 22:26:39 UTC)

Provenance:
  πŸ—ƒοΈ storage: Storage(uid='3D34qtUi', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-11-20 22:26:18 UTC, created_by_id=1)
  πŸ“” transform: Transform(uid='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type='notebook', updated_at=2023-11-20 22:26:22 UTC, created_by_id=1)
  πŸ‘£ run: Run(uid='46kG4CBBe1x2VtWQJUIl', run_at=2023-11-20 22:26:22 UTC, transform_id=1, created_by_id=1)
  πŸ‘€ created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-11-20 22:26:18 UTC)
  ⬇️ input_of (core.Run): ['2023-11-20 22:26:46 UTC']
Features:
  var: FeatureSet(uid='hv5R7hZCb92nA84ilYwE', n=36390, type='number', registry='bionty.Gene', hash='rMZltwoBCMdVPVR8x6nJ', updated_at=2023-11-20 22:26:37 UTC, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(uid='590M3uln2vLdbw0o79D8', n=4, registry='core.Feature', hash='AJq64lFK7nTxOb_VkyrX', updated_at=2023-11-20 22:26:38 UTC, created_by_id=1)
    πŸ”— cell_type (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
    πŸ”— assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
    πŸ”— tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    πŸ”— donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
file1.view_flow()
_images/5d2652001452b0caf40b6a55fe72aa7f0d67ca2b9cb2c1b4fc8dd8aaf16f730a.svg
file2.describe()
File(uid='z0DtvXhdZMNVoqIagoG1', suffix='.h5ad', accessor='AnnData', description='10x reference adata', size=857752, hash='SAuVZAKKM_Ypj_0SdrhDIg', hash_type='md5', visibility=0, key_is_virtual=True, updated_at=2023-11-20 22:27:07 UTC)

Provenance:
  πŸ—ƒοΈ storage: Storage(uid='3D34qtUi', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-11-20 22:26:18 UTC, created_by_id=1)
  πŸ“” transform: Transform(uid='ManDYgmftZ8Cz8', name='Append a new batch of data', short_name='scrna2', version='0', type='notebook', updated_at=2023-11-20 22:26:46 UTC, created_by_id=1)
  πŸ‘£ run: Run(uid='cteTfZEj8LfaJtGMS2Ao', run_at=2023-11-20 22:26:46 UTC, transform_id=2, created_by_id=1)
  πŸ‘€ created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-11-20 22:26:18 UTC)
Features:
  var: FeatureSet(uid='PVSKwGoomf7OKfeZ4v5R', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-11-20 22:27:07 UTC, created_by_id=1)
    'IL18', 'NPM3', 'S100A9', 'S100A8', 'CNN2', 'ARHGAP45', 'RNF34', 'GPX4', 'S100A6', 'ADISSP', 'S100A4', 'FAM174C', 'SIT1', 'CCDC107', 'RSL1D1', 'TLN1', 'HES4', 'TNFRSF17', 'PCNA', 'RAB13', ...
  obs: FeatureSet(uid='HUy6VcDHXw1HvB6kySCw', n=1, registry='core.Feature', hash='LqUG3OITAsJPqus9JRcI', updated_at=2023-11-20 22:27:07 UTC, created_by_id=1)
    πŸ”— cell_type (9, bionty.CellType): 'dendritic cell', 'CD38-positive naive B cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD16-positive, CD56-dim natural killer cell, human', 'CD4-positive, alpha-beta T cell', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte'
  external: FeatureSet(uid='fDcFmGwq5TFM9yC5gJtL', n=2, registry='core.Feature', hash='FX4Np14jGDGG5Y5cdVJ_', updated_at=2023-11-20 22:27:07 UTC, created_by_id=1)
    πŸ”— assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
    πŸ”— organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ cell_types (9, bionty.CellType): 'dendritic cell', 'CD38-positive naive B cell', 'B cell, CD19-positive', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD16-positive, CD56-dim natural killer cell, human', 'CD4-positive, alpha-beta T cell', 'cytotoxic T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'CD14-positive, CD16-negative classical monocyte'
  🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
file2.view_flow()
_images/892ecaf8cab6c0c6e3b77162d518d7f20457027786fa55df0b30b6b16cc377ee.svg

Compare features#

Here we compute shared genes without loading files:

file1_genes = file1.features["var"]
file2_genes = file2.features["var"]

shared_genes = file1_genes & file2_genes
len(shared_genes)
749
shared_genes.list("symbol")[:10]
['HES4',
 'TNFRSF4',
 'SSU72',
 'PARK7',
 'RBP7',
 'SRM',
 'MAD2L2',
 'AGTRAP',
 'TNFRSF1B',
 'EFHD2']

Compare cell types#

file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()

shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names
['CD16-positive, CD56-dim natural killer cell, human']

Load the individual files#

We could either load the files into memory or access them in backed mode through .backed() to lazily load their content from the cloud or the disk.display_markdown

Let’s load them into memory:

adata1 = file1.load()
adata2 = file2.load()

We can now subset the two datasets by shared cell types:

adata1_subset = adata1[adata1.obs["cell_type"].isin(shared_celltypes_names)]

adata2_subset = adata2[adata2.obs["cell_type"].isin(shared_celltypes_names)]