Jupyter Notebook Binder

CellTypist#

Cell types classify cells based on public and private knowledge from studying transcription, morphology, function & other properties. Established cell types have well-characterized markers and properties; however, cell subtypes and states are continuously being discovered, refined and better understood.

In this notebook, we register the immune cell type vocabulary from CellTypist, a computational tool used for cell type classification in scRNA-seq data.

Setup#

!lamin init --storage ./celltypist --schema bionty
Hide code cell output
✅ saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-11-20 22:24:29 UTC)
✅ saved: Storage(uid='hVeSFmaW', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/celltypist', type='local', updated_at=2023-11-20 22:24:29 UTC, created_by_id=1)
💡 loaded instance: testuser1/celltypist
💡 did not register local instance on hub

Hide code cell content
# filter warnings from celltypist
import warnings

warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*")
import lamindb as ln
import lnschema_bionty as lb
import celltypist
import pandas as pd

lb.settings.organism = "human"  # globally set organism
💡 lamindb instance: testuser1/celltypist
2023-11-20 22:24:32,047:INFO - Failed to extract font properties from /usr/share/fonts/truetype/noto/NotoColorEmoji.ttf: In FT2Font: Can not load face (unknown file format; error code 0x2)
2023-11-20 22:24:32,082:INFO - generated new fontManager
ln.track()
💡 notebook imports: celltypist==1.6.2 lamindb==0.61.0 lnschema_bionty==0.35.1 pandas==1.5.3
💡 saved: Transform(uid='s5mkN5NQ1ttIz8', name='CellTypist', short_name='celltypist', version='0', type=notebook, updated_at=2023-11-20 22:24:33 UTC, created_by_id=1)
💡 saved: Run(uid='oOXKKEKPHfF9rMXEAfHF', run_at=2023-11-20 22:24:33 UTC, transform_id=1, created_by_id=1)

Access CellTypist records #

As a first step we will read in CellTypist’s immune cell encyclopedia

description = "CellTypist Pan Immune Atlas v2: basic cell type information"
celltypist_source_v2_url = "https://github.com/Teichlab/celltypist_wiki/raw/main/atlases/Pan_Immune_CellTypist/v2/tables/Basic_celltype_information.xlsx"

# our source data
celltypist_file = ln.File.filter(description=description).one_or_none()

if celltypist_file is None:
    celltypist_df = pd.read_excel(celltypist_source_v2_url)
    celltypist_file = ln.File(celltypist_df).save()
else:
    celltypist_df = celltypist_file.load().head()

It provides an ontology_id of the public Cell Ontology for the majority of records.

celltypist_df.head()
High-hierarchy cell types Low-hierarchy cell types Description Cell Ontology ID Curated markers
0 B cells B cells B lymphocytes with diverse cell surface immuno... CL:0000236 CD79A, MS4A1, CD19
1 B cells Follicular B cells resting mature B lymphocytes found in the prim... CL:0000843 CXCR5, TNFRSF13B, CD22
2 B cells Proliferative germinal center B cells proliferating germinal center B cells CL:0000844 MKI67, SUGCT, AICDA
3 B cells Germinal center B cells proliferating mature B cells that undergo soma... CL:0000844 POU2AF1, CD40, SUGCT
4 B cells Memory B cells long-lived mature B lymphocytes which are form... CL:0000787 CR2, CD27, MS4A1

The “Cell Ontology ID” is associated with multiple “Low-hierarchy cell types”:

celltypist_df.set_index(["Cell Ontology ID", "Low-hierarchy cell types"]).head(10)
High-hierarchy cell types Description Curated markers
Cell Ontology ID Low-hierarchy cell types
CL:0000236 B cells B cells B lymphocytes with diverse cell surface immuno... CD79A, MS4A1, CD19
CL:0000843 Follicular B cells B cells resting mature B lymphocytes found in the prim... CXCR5, TNFRSF13B, CD22
CL:0000844 Proliferative germinal center B cells B cells proliferating germinal center B cells MKI67, SUGCT, AICDA
Germinal center B cells B cells proliferating mature B cells that undergo soma... POU2AF1, CD40, SUGCT
CL:0000787 Memory B cells B cells long-lived mature B lymphocytes which are form... CR2, CD27, MS4A1
Age-associated B cells B cells CD11c+ T-bet+ memory B cells associated with a... FCRL2, ITGAX, TBX21
CL:0000788 Naive B cells B cells mature B lymphocytes which express cell-surfac... IGHM, IGHD, TCL1A
CL:0000818 Transitional B cells B cells immature B cell precursors in the bone marrow ... CD24, MYO1C, MS4A1
CL:0000817 Large pre-B cells B-cell lineage proliferative B lymphocyte precursors derived ... MME, CD24, MKI67
Small pre-B cells B-cell lineage non-proliferative B lymphocyte precursors deri... MME, CD24, IGLL5

Validate CellTypist records #

For any cell type record that can be validated against the public Cell Ontology, we’d like to ensure that it’s actually validated.

This will avoid that we’ll refer to the same cell type with different identifiers.

We need a Bionty object for this:

bionty = lb.CellType.bionty()
bionty
CellType
Organism: all
Source: cl, 2023-08-24
#terms: 2894

📖 CellType.df(): ontology reference table
🔎 CellType.lookup(): autocompletion of terms
🎯 CellType.search(): free text search of terms
✅ CellType.validate(): strictly validate values
🧐 CellType.inspect(): full inspection of values
👽 CellType.standardize(): convert to standardized names
🪜 CellType.diff(): difference between two versions
🔗 CellType.ontology: Pronto.Ontology object

We can now validate the "Cell Ontology ID" column

When should I use inspect() and when validate()?

inspect() gives us more logging than validate() but runs a bit slower.

Hence, we’ll use inspect if we suspect validation won’t pass and we want to debug why to curate data.

bionty.inspect(celltypist_df["Cell Ontology ID"], bionty.ontology_id);

This looks good!

But when inspecting the names, most of them don’t validate:

bionty.inspect(celltypist_df["Low-hierarchy cell types"], bionty.name);
97 terms (99.00%) are not validated for name: B cells, Follicular B cells, Proliferative germinal center B cells, Germinal center B cells, Memory B cells, Age-associated B cells, Naive B cells, Transitional B cells, Large pre-B cells, Small pre-B cells, Pre-pro-B cells, Pro-B cells, Cycling B cells, Cycling DCs, Cycling gamma-delta T cells, Cycling monocytes, Cycling NK cells, Cycling T cells, DC, DC1, ...
   detected 6 terms with synonyms: DC1, DC2, ETP, ILC2, ILC3, pDC
→  standardize terms via .standardize()

A search tells us that terms that are named in plural in Cell Typist occur with a name in singular in the Cell Ontology:

celltypist_df["Low-hierarchy cell types"][0]
'B cells'
bionty.search(celltypist_df["Low-hierarchy cell types"][0]).head(2)
ontology_id definition synonyms parents __agg__ __ratio__
name
B cell CL:0000236 A Lymphocyte Of B Lineage That Is Capable Of B... B-lymphocyte|B lymphocyte|B-cell [CL:0000945] b cell 92.307692
B-1 B cell CL:0000819 A B Cell Of Distinct Lineage And Surface Marke... B1 B-cell|B1 B cell|B-1 B-cell|B1 cell|B1 B ly... [CL:0000785] b-1 b cell 85.714286

Let’s try to strip "s" and inspect if more names are now validated. Yes, there are!

bionty.inspect(
    [i.rstrip("s") for i in celltypist_df["Low-hierarchy cell types"]],
    bionty.name,
);
93 terms (94.90%) are not validated for name: Follicular B cell, Proliferative germinal center B cell, Germinal center B cell, Memory B cell, Age-associated B cell, Naive B cell, Transitional B cell, Large pre-B cell, Small pre-B cell, Pre-pro-B cell, Pro-B cell, Cycling B cell, Cycling DC, Cycling gamma-delta T cell, Cycling monocyte, Cycling NK cell, Cycling T cell, DC, DC1, DC2, ...
   detected 31 terms with inconsistent casing/synonyms: Follicular B cell, Germinal center B cell, Memory B cell, Naive B cell, Transitional B cell, Small pre-B cell, Pro-B cell, DC1, DC2, Endothelial cell, Epithelial cell, Erythrocyte, ETP, Fibroblast, Granulocyte, Neutrophil, ILC2, ILC3, NK cell, Alveolar macrophage, ...
→  standardize terms via .standardize()

Every “low-hierarchy cell type” has an ontology id and most “high-hierarchy cell types” also appear as “low-hierarchy cell types” in the Cell Typist table. Four, however, don’t, and therefore don’t have an ontology ID.

high_terms = celltypist_df["High-hierarchy cell types"].unique()
low_terms = celltypist_df["Low-hierarchy cell types"].unique()

high_terms_nonval = set(high_terms).difference(low_terms)
high_terms_nonval
{'B-cell lineage', 'Cycling cells', 'Erythroid', 'T cells'}

Register CellTypist records #

Let’s first add the “High-hierarchy cell types” as a column "parent".

This enables LaminDB to populate the parents and children fields, which will enable you to query for hierarchical relationships.

celltypist_df["parent"] = celltypist_df.pop("High-hierarchy cell types")

# if high and low terms are the same, no parents
celltypist_df.loc[
    (celltypist_df["parent"] == celltypist_df["Low-hierarchy cell types"]), "parent"
] = None

# rename columns, drop markers
celltypist_df.drop(columns=["Curated markers"], inplace=True)
celltypist_df.rename(
    columns={"Low-hierarchy cell types": "name", "Cell Ontology ID": "ontology_id"},
    inplace=True,
)
celltypist_df.columns = celltypist_df.columns.str.lower()
celltypist_df.head(2)
name description ontology_id parent
0 B cells B lymphocytes with diverse cell surface immuno... CL:0000236 None
1 Follicular B cells resting mature B lymphocytes found in the prim... CL:0000843 B cells

Now, let’s create records from the public ontology:

public_records = lb.CellType.from_values(
    celltypist_df.ontology_id, lb.CellType.ontology_id
)

Let’s now amend public ontology records so that they maintain additional annotations that Cell Typist might have.

records_names = {}
public_records_dict = {r.ontology_id: r for r in public_records}

for _, row in celltypist_df.iterrows():
    name = row["name"]
    ontology_id = row["ontology_id"]
    public_record = public_records_dict[ontology_id]

    # if both name and ontology_id match public record, use public record
    if name.lower() == public_record.name.lower():
        records_names[name] = public_record
        continue
    else:  # when ontology_id matches the public record and name doesn't match
        # if singular form of the Celltypist name matches public name
        if name.lower().rstrip("s") == public_record.name.lower():
            # add the Celltypist name to the synonyms of the public ontology record
            public_record.add_synonym(name)
            records_names[name] = public_record
            continue
        if public_record.synonyms is not None:
            synonyms = [s.lower() for s in public_record.synonyms.split("|")]
            # if any of the public matches celltypist name
            if any(
                [
                    i.lower() in {name.lower(), name.lower().rstrip("s")}
                    for i in synonyms
                ]
            ):
                # add the Celltypist name to the synonyms of the public ontology record
                public_record.add_synonym(name)
                records_names[name] = public_record
                continue

        # create a record only based on Celltypist metadata
        records_names[name] = lb.CellType(
            name=name, ontology_id=ontology_id, description=row.description
        )

You can see certain records are created by adding the Celltypist name to the synonyms of the public record:

records_names["GMP"]
CellType(uid='L8KIZwZR', name='GMP', ontology_id='CL:0000557', description='hematopoietic granulocyte-monocyte progenitors that are committed to the granulocyte and monocyte lineage cells', created_by_id=1)

Other records are created based on Celltypist metadata:

records_names["Age-associated B cells"]
CellType(uid='00ieV0IG', name='Age-associated B cells', ontology_id='CL:0000787', description='CD11c+ T-bet+ memory B cells associated with autoimmunity and aging', created_by_id=1)

Let’s save them to our database:

records = records_names.values()

ln.save(records)
Hide code cell output
❗ now recursing through parents: this only happens once, but is much slower than bulk saving

Add parent-child relationship of the records from Celltypist#

We still need to add the renaming 4 High hierarchy terms:

list(high_terms_nonval)
['Cycling cells', 'T cells', 'B-cell lineage', 'Erythroid']

Let’s get the top hits from a search:

for term in list(high_terms_nonval):
    print(f"Term: {term}")
    display(bionty.search(term).head(1))
Term: Cycling cells

ontology_id definition synonyms parents __agg__ __ratio__
name
circulating cell CL:0000080 A Cell Which Moves Among Different Tissues Of ... None [CL:0000003] circulating cell 75.862069
Term: T cells

ontology_id definition synonyms parents __agg__ __ratio__
name
T cell CL:0000084 A Type Of Lymphocyte Whose Defining Characteri... T-cell|T-lymphocyte|T lymphocyte [CL:0000542] t cell 92.307692
Term: B-cell lineage

ontology_id definition synonyms parents __agg__ __ratio__
name
obsolete cell by lineage CL:0000220 None None [] obsolete cell by lineage 73.684211
Term: Erythroid

ontology_id definition synonyms parents __agg__ __ratio__
name
erythroid lineage cell CL:0000764 A Immature Or Mature Cell In The Lineage Leadi... erythropoietic cell [CL:0000763] erythroid lineage cell 90.0

So we decide to:

  • Add the “T cells” to the synonyms of the public “T cell” record

  • Create the remaining 3 terms only using their names (we think “B cell flow” shouldn’t be identified with “B cell”)

for name in high_terms_nonval:
    if name == "T cells":
        record = lb.CellType.from_bionty(name="T cell")
        record.add_synonym(name)
        record.save()
    else:
        record = lb.CellType(name=name)
        record.save()
    records_names[name] = record
❗ records with similar names exist! did you mean to load one of them?
uid synonyms score
name
Cycling B cells ibzfn1zQ 92.9
Cycling T cells TTziQpub 92.9
❗ records with similar names exist! did you mean to load one of them?
uid synonyms score
name
Mid erythroid lveE8XKg 95.0
Early erythroid MiIxaBcE 90.0
Late erythroid NY6Iq1SQ 90.0
Megakaryocyte-erythroid-mast cell progenitor rDuO4MVx 90.0
erythroid lineage cell Gx34JGrp erythropoietic cell 90.0

Now let’s add the parent records:

for _, row in celltypist_df.iterrows():
    record = records_names[row["name"]]
    if row["parent"] is not None:
        parent_record = records_names[row["parent"]]
        record.parents.add(parent_record)

Access the registry#

The previously added CellTypist ontology registry is now available in LaminDB. To retrieve the full ontology table as a Pandas DataFrame we can use .filter:

lb.CellType.filter().df()
uid name ontology_id abbr synonyms description bionty_source_id updated_at created_by_id
id
1 cx8VcggA B cell CL:0000236 None B-cell|B-lymphocyte|B cells|B lymphocyte A Lymphocyte Of B Lineage That Is Capable Of B... 21.0 2023-11-20 22:24:34.966455+00:00 1
2 FMTngXKK follicular B cell CL:0000843 None follicular B-lymphocyte|follicular B-cell|foll... A Resting Mature B Cell That Has The Phenotype... 21.0 2023-11-20 22:24:34.966494+00:00 1
3 TC2eLf0p Proliferative germinal center B cells CL:0000844 None None proliferating germinal center B cells NaN 2023-11-20 22:24:34.966526+00:00 1
4 uMLhrmbZ germinal center B cell CL:0000844 None germinal center B-cell|Germinal center B cells... A Rapidly Cycling Mature B Cell That Has Disti... 21.0 2023-11-20 22:24:34.966557+00:00 1
5 67zMsufW memory B cell CL:0000787 None memory B-lymphocyte|memory B-cell|Memory B cel... A Memory B Cell Is A Mature B Cell That Is Lon... 21.0 2023-11-20 22:24:34.966587+00:00 1
... ... ... ... ... ... ... ... ... ...
145 BxNjby0x T cell CL:0000084 None T cells|T-cell|T lymphocyte|T-lymphocyte A Type Of Lymphocyte Whose Defining Characteri... 21.0 2023-11-20 22:24:58.779165+00:00 1
146 2C5PhwrW mature T cell CL:0002419 None mature T-cell|CD3e-positive T cell A T Cell That Expresses A T Cell Receptor Comp... 21.0 2023-11-20 22:24:58.599869+00:00 1
147 OTJNBiwc Cycling cells None None None None NaN 2023-11-20 22:24:58.764365+00:00 1
148 tAfjGRvg B-cell lineage None None None None NaN 2023-11-20 22:24:58.792235+00:00 1
149 mIwPohXK Erythroid None None None None NaN 2023-11-20 22:24:58.809747+00:00 1

149 rows × 9 columns

This enables us to look for cell types by creating a lookup object from our new CellType registry.

db_lookup = lb.CellType.lookup()
db_lookup.memory_b_cell
CellType(uid='67zMsufW', name='memory B cell', ontology_id='CL:0000787', synonyms='memory B-lymphocyte|memory B-cell|Memory B cells|memory B lymphocyte', description='A Memory B Cell Is A Mature B Cell That Is Long-Lived, Readily Activated Upon Re-Encounter Of Its Antigenic Determinant, And Has Been Selected For Expression Of Higher Affinity Immunoglobulin. This Cell Type Has The Phenotype Cd19-Positive, Cd20-Positive, Mhc Class Ii-Positive, And Cd138-Negative.', updated_at=2023-11-20 22:24:34 UTC, bionty_source_id=21, created_by_id=1)

See cell type hierarchy:

db_lookup.memory_b_cell.view_parents()
_images/225b9851d7d078f05844ed6e6fe5047f2d3b5175b3806d760e48a1455c076946.svg

Access parents of a record:

db_lookup.memory_b_cell.parents.list()
[CellType(uid='cx8VcggA', name='B cell', ontology_id='CL:0000236', synonyms='B-cell|B-lymphocyte|B cells|B lymphocyte', description='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', updated_at=2023-11-20 22:24:34 UTC, bionty_source_id=21, created_by_id=1),
 CellType(uid='0I51jgPp', name='mature B cell', ontology_id='CL:0000785', synonyms='mature B-cell|mature B lymphocyte|mature B-lymphocyte', description='A B Cell That Is Mature, Having Left The Bone Marrow. Initially, These Cells Are Igm-Positive And Igd-Positive, And They Can Be Activated By Antigen.', updated_at=2023-11-20 22:24:42 UTC, bionty_source_id=21, created_by_id=1)]
# clean up test instance
!lamin delete --force celltypist
!rm -r ./celltypist
Hide code cell output
💡 deleting instance testuser1/celltypist
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--celltypist.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/celltypist