Queries the Human Cell Atlas • cellNexus

cellNexus is a query interface for programmatic exploration and retrieval of harmonised, curated, and reannotated CELLxGENE human-cell-atlas data.

cellNexus abstract overview

This standalone documentation website provides:

Detailed data-processing information (quality control, harmonisation, and expression representations).
Complete explanations of metadata columns used in filtering and interpretation.
Guided examples for metadata-first exploration and gene-expression analysis.

Data Processing Overview

The harmonisation pipeline standardises data across datasets so queries are consistent across studies:

Metadata are retrieved from cloud-hosted harmonised tables.
Standardised quality control removes empty droplets, dead/damaged cells, and likely doublets.
Cell-level data are served through common assay layers (counts, cpm, sct, pseudobulk).
Outputs are returned in analysis-ready R formats such as SingleCellExperiment and Seurat.

Quality control steps

The QC flags used throughout cellNexus are computed using HPCell. In brief:

Empty droplets (empty_droplet):
Computed from a SingleCellExperiment.
Excludes mitochondrial genes and ribosomal genes before scoring:
Computes, per cell, the number of expressed genes and flags a cell as an empty droplet when (n_ <) RNA_feature_threshold (by default 200, except for targeted panels such as Rhapsody technology).
Alive cells (alive):
- Filters out empty droplets.
- Computes per-cell QC metrics from raw counts using scuttle::perCellQCMetrics(..., subsets=list(Mito=...)), where the mitochondrial subset is defined by ^MT in the feature names.
- Determines high mitochondrial content via scater::isOutlier(subsets_Mito_percent, type="higher"). Outlier calling is performed within each cell-type group using our harmonised label cell_type_unified_ensemble.
- Alive cells are labelled as those without high mitochondrial content (!high_mitochondrion).
Doublets (scDblFinder.class):
- Filters out empty droplets.
- scDblFinder::scDblFinder() default parameters are used. For cells that cannot be classified by scDblFinder, the class is set to "Unknown" to avoid dropping cells.

RNA abundance

RNA counts:
- RNA count distributions per sample are annotated from cellxgenedp, using the x_approximate_distribution column.
CPM:
- Counts-per-million normalisation computed from the raw counts assay via scuttle::calculateCPM().
Rank:
- Per-cell gene-expression ranks computed with singscore::rankGenes().
- Implemented in column chunks (default 1000 cells per slice) to handle very large datasets; slices are written to disk as an HDF5Array-backed sparse integer matrix and then column-bound.
SCT:
- Variance-stabilising normalisation computed with Seurat SCTransform() (v2), with regression of cell-level covariates (subsets_Mito_percent and subsets_Ribo_percent).
- QC filtering is applied first.
- The median common scale across the whole resource is applied (scale_factor=2186)
Pseudobulk:
- All low-quality cells flagged by QC are removed before aggregation.
- Aggregates counts across cells using scuttle::aggregateAcrossCells(), aggregating sample_id and the harmonised cell type (cell_type_harmonised_ensemble).

Metadata Explore

Through harmonisation and curation, cellNexus adds columns that are not present in the original CELLxGENE metadata alone.

Column	Description
`cell_id`	Cell identifier.
`observation_joinid`	Cell ID join key linking metadata.
`dataset_id`	Primary dataset identifier in the atlas.
`sample_id`	Harmonised sample identifier.
`sample_`	Internal sample subdivision helper.
`experiment___`	Upstream experiment grouping variable.
`sample_heuristic`	Internal sample subdivision helper.
`age_days`	Donor age in days.
`tissue_groups`	Coarse tissue grouping for analysis.
`nFeature_expressed_in_sample`	Number of expressed features per cell.
`nCount_RNA`	Total RNA counts per cell (sample-aware).
`empty_droplet`	Quality-control flag for empty droplets.
`cell_type_unified_ensemble`	Consensus immune identity from Azimuth and `SingleR` (Blueprint, Monaco).
`is_immune`	Curated flag for immune-cell context.
`subsets_Mito_percent`	Percent of each cell’s total counts coming from mitochondrial genes in a sample.
`subsets_Ribo_percent`	Percent of each cell’s total counts coming from ribosomal genes in a sample.
`high_mitochondrion`	TRUE if the cell’s mitochondrial percent exceeds the QC cutoff.
`high_ribosome`	TRUE if the cell’s ribosomal percent exceeds the QC cutoff.
`scDblFinder.class`	Quality-control flag for doublet classification from `scDblFinder`.
`sample_chunk`	Internal sample subdivision chunks.
`cell_chunk`	Internal cell subdivision chunks.
`sample_pseudobulk_chunk`	Internal pseudobulk subdivision chunks.
`file_id_cellNexus_single_cell`	Internal file id for single-cell layers.
`file_id_cellNexus_pseudobulk`	Internal file id for pseudobulk layers.
`count_upper_bound`	Count capping threshold used in transformation.
`nfeature_expressed_thresh`	Threshold of the number of expressed features per cell.
`inverse_transform`	Transformation method used in pre-processing pipeline.
`alive`	Quality-control flag for viable cells (e.g. mitochondrial signal).
`cell_annotation_blueprint_singler`	`SingleR` annotation (Blueprint).
`cell_annotation_monaco_singler`	`SingleR` annotation (Monaco).
`cell_annotation_azimuth_l2`	Azimuth cell annotation.
`ethnicity_flagging_score`	Supporting score for ethnicity imputation.
`low_confidence_ethnicity`	Supporting flag for low-confidence ethnicity calls.
`.aggregated_cells`	Post-QC cells combined into each pseudobulk sample.
`imputed_ethnicity`	Imputed ethnicity label.
`atlas_id`	cellNexus atlas release identifier (internal use).

Field definitions for the CELLxGENE schema follow the CELLxGENE schema 5.1.0, and CELLxGENE Census schema

Client Usage Examples

R client (`cellNexus`)

library(cellNexus)
library(dplyr)
library(stringr)

metadata <- get_metadata() |>
  join_census_table()

metadata <- metadata |>
  keep_quality_cells()

query <- metadata |>
  filter(
    self_reported_ethnicity == "African",
    str_like(assay, "%10x%"),
    tissue == "lung parenchyma",
    str_like(cell_type, "%CD4%")
  )

sce <- get_single_cell_experiment(query, assays = c("counts", "cpm"))
pb <- get_pseudobulk(query)

Python client (`cellNexusPy`)

Python support is available in the companion repository: MangiolaLaboratory/cellNexusPy.

from cellnexuspy import get_metadata, get_anndata

sample_dataset = "https://object-store.rc.nectar.org.au/v1/AUTH_06d6e008e3e642da99d806ba3ea629c5/cellNexus-metadata/sample_metadata.1.3.0.parquet"
conn, table = get_metadata(parquet_url=sample_dataset)

table = table.filter("""
    empty_droplet = 'false'
    AND alive = 'true'
    AND "scDblFinder.class" != 'doublet'
    AND feature_count >= 5000
""")

query = table.filter("""
    self_reported_ethnicity = 'African'
    AND assay LIKE '%10%'
    AND tissue = 'lung parenchyma'
    AND cell_type LIKE '%CD4%'
""")

adata = get_anndata(query, assay="cpm")
pb = get_anndata(query, aggregation="pseudobulk")
conn.close()

For other implementation details and code examples, see cellNexus README

cellNexus