Skip to contents

This function runs CHOIR clustering to identify a set of clusters that represent statistically distinct populations. CHOIR is based on the premise that, if clusters contain biologically different cell types or states, a classifier that considers features present in cells from each cluster should be able to distinguish the clusters with a higher level of accuracy than classifiers trained on randomly permuted cluster labels.

Usage

CHOIR(
  object,
  key = "CHOIR",
  alpha = 0.05,
  p_adjust = "bonferroni",
  feature_set = "var",
  exclude_features = NULL,
  n_iterations = 100,
  n_trees = 50,
  use_variance = TRUE,
  min_accuracy = 0.5,
  min_connections = 1,
  max_repeat_errors = 20,
  distance_approx = TRUE,
  distance_awareness = 2,
  collect_all_metrics = FALSE,
  sample_max = Inf,
  downsampling_rate = "auto",
  max_clusters = "auto",
  min_cluster_depth = 2000,
  normalization_method = "none",
  subtree_reductions = TRUE,
  reduction_method = NULL,
  reduction_params = list(),
  n_var_features = NULL,
  batch_correction_method = "none",
  batch_correction_params = list(),
  batch_labels = NULL,
  neighbor_params = list(),
  cluster_params = list(algorithm = 1, group.singletons = TRUE),
  use_assay = NULL,
  use_slot = NULL,
  ArchR_matrix = NULL,
  ArchR_depthcol = NULL,
  reduction = NULL,
  var_features = NULL,
  atac = FALSE,
  n_cores = NULL,
  random_seed = 1,
  verbose = TRUE
)

Arguments

object

An object of class 'Seurat', 'SingleCellExperiment', or 'ArchRProject'.

key

The name under which CHOIR-related data for this run is stored in the object. Defaults to 'CHOIR'.

alpha

A numeric value indicating the significance level used for permutation test comparisons of cluster prediction accuracies. Defaults to 0.05.

p_adjust

A string indicating which multiple comparison adjustment to use. Permitted values are 'bonferroni', 'fdr', and 'none'. Defaults to 'bonferroni'.

feature_set

A string indicating whether to train random forest classifiers on 'all' features or only variable ('var') features. Defaults to 'var'.

exclude_features

A character vector indicating features that should be excluded from input to the random forest classifier. Default = NULL will not exclude any features.

n_iterations

A numeric value indicating the number of iterations run for each permutation test comparison. Defaults to 100.

n_trees

A numeric value indicating the number of trees in each random forest. Defaults to 50.

use_variance

A boolean value indicating whether to use the variance of the random forest accuracy scores as part of the permutation test threshold. Defaults to TRUE.

min_accuracy

A numeric value indicating the minimum accuracy required of the random forest classifier, below which clusters will be automatically merged. Defaults to 0.5 (chance).

min_connections

A numeric value indicating the minimum number of nearest neighbors between two clusters for them to be considered 'adjacent'. Non-adjacent clusters will not be merged. Defaults to 1.

max_repeat_errors

Used to account for situations in which random forest classifier errors are concentrated among a few cells that are repeatedly misassigned. A numeric value indicating the maximum number of such 'repeat errors' that will be taken into account. If set to 0, 'repeat errors' will not be evaluated. Defaults to 20.

distance_approx

A boolean value indicating whether or not to use approximate distance calculations. Default = TRUE will use centroid-based distances.

distance_awareness

A numeric value representing the distance threshold above which a cluster will not merge with another cluster. Specifically, this value is multiplied by the distance between a cluster and its closest distinguishable neighbor to set the threshold. Default = 2 sets this threshold at a 2-fold increase in distance. Alternately, to omit all distance calculations, set to FALSE.

collect_all_metrics

A boolean value indicating whether to collect and save additional metrics from the random forest classifier comparisons, including feature importances and tree depth. Defaults to FALSE.

sample_max

A numeric value indicating the maximum number of cells used per cluster to train/test each random forest classifier. Default = Inf does not cap the number of cells used.

downsampling_rate

A numeric value indicating the proportion of cells used per cluster to train/test each random forest classifier. Default = "auto" sets the downsampling rate according to the dataset size, for efficiency.

max_clusters

Indicates the extent to which the hierarchical clustering tree will be expanded. Default = 'auto' will expand the tree until instances of underclustering have been eliminated in all branches. Alternately, supply a numeric value indicating the maximum number of clusters to expand the tree to.

min_cluster_depth

A numeric value indicating the maximum cluster size at the bottom of the clustering tree, prior to pruning branches. Defaults to 2000.

normalization_method

A character string or vector indicating which normalization method to use. In general, input data should be supplied to CHOIR after normalization, except in cases when the user wishes to use Seurat::SCTransform() normalization. Permitted values are 'none' or 'SCTransform'. Defaults to 'none'.

subtree_reductions

A boolean value indicating whether to generate a new dimensionality reduction for each subtree. Defaults to TRUE.

reduction_method

A character string or vector indicating which dimensionality reduction method to use. Permitted values are 'PCA' for principal component analysis, 'LSI' for latent semantic indexing, and 'IterativeLSI' for iterative latent semantic indexing. Default = NULL will specify a method automatically based on the input data type.

reduction_params

A list of additional parameters to be passed to the selected dimensionality reduction method.

n_var_features

A numeric value indicating how many variable features to identify. Default = NULL will use 2000 features, or 25000 features for ATAC-seq data.

batch_correction_method

A character string or vector indicating which batch correction method to use. Permitted values are 'Harmony' and 'none'. Defaults to 'none'.

batch_correction_params

A list of additional parameters to be passed to the selected batch correction method for each iteration. Only applicable when batch_correction_method = 'Harmony'.

batch_labels

If applying batch correction, a character string or vector indicating the name of the column containing the batch labels. Defaults to NULL.

neighbor_params

A list of additional parameters to be passed to Seurat::FindNeighbors() (or, in the case of multi-modal data for Seurat or SingleCellExperiment objects, Seurat::FindMultiModalNeighbors()).

cluster_params

A list of additional parameters to be passed to Seurat::FindClusters() for clustering at each level of the tree. Note that if group.singletons is set to TRUE, CHOIR relabels initial clusters such that each singleton constitutes its own cluster.

use_assay

For Seurat or SingleCellExperiment objects, a character string or vector indicating the assay(s) to use in the provided object. Default = NULL will choose the current active assay for Seurat objects and the logcounts assay for SingleCellExperiment objects.

use_slot

For Seurat objects, a character string or vector indicating the layers(s) — previously known as slot(s) — to use in the provided object. Default = NULL will choose a layer/slot based on the selected assay. If a non-standard assay is provided, do not leave use_slot as NULL.

ArchR_matrix

For ArchR objects, a character string or vector indicating which matri(ces) to use in the provided object. Default = NULL will use the 'TileMatrix' for ATAC-seq data or the 'GeneExpressionMatrix' for RNA-seq data.

ArchR_depthcol

For ArchR objects, a character string or vector indicating which column to use for correlation with sequencing depth. Default = NULL will use the 'nFrags' column for ATAC-seq data or the 'Gex_nUMI' for RNA-seq data.

reduction

An optional matrix of dimensionality reduction cell embeddings to be used for subsequent clustering steps. Defaults to NULL, whereby dimensionality reduction(s) will instead be calculated using method specified by reduction_method.

var_features

An optional character vector of variable features to be used for subsequent clustering steps. Defaults to NULL, whereby new sets of variable features will instead be generated.

atac

A boolean value or vector indicating whether the provided data is ATAC-seq data. Defaults to FALSE. For multi-omic datasets containing ATAC-seq data, it is important to supply this parameter as a vector corresponding to each modality in order.

n_cores

A numeric value indicating the number of cores to use for parallelization. Default = NULL will use the number of available cores minus 2.

random_seed

A numeric value indicating the random seed to be used.

verbose

A boolean value indicating whether to use verbose output during the execution of this function. Can be set to FALSE for a cleaner output.

Value

Returns the object with the following added data stored under the provided key:

cell_IDs

Cell IDs belonging to each subtree

clusters

Final clusters, full hierarchical cluster tree, and stepwise cluster results for each progressive pruning step

graph

All calculated nearest neighbor and shared nearest neighbor adjacency matrices

parameters

Record of parameter values used

records

Metadata for decision points during hierarchical tree construction, all recorded permutation test comparisons, and feature importance scores from all comparisons

reduction

Cell embeddings for all calculated dimensionality reductions

var_features

Variable features for all calculated dimensionality reductions

Details

First, a hierarchical clustering tree is generated using a top-down approach that proceeds from an initial partition, in which all cells are in the same cluster, to a partition in which all cells are demonstrably overclustered. Second, to identify a final set of clusters, this hierarchical clustering tree is pruned from the bottom up using a framework of random forest classifiers and permutation tests.

For multi-modal data, optionally supply parameter inputs as vectors/lists that sequentially specify the value for each modality.