Run CHOIR clustering — CHOIR • CHOIR

This function runs CHOIR clustering to identify a set of clusters that represent statistically distinct populations. CHOIR is based on the premise that, if clusters contain biologically different cell types or states, a classifier that considers features present in cells from each cluster should be able to distinguish the clusters with a higher level of accuracy than classifiers trained on randomly permuted cluster labels.

Usage

CHOIR(
  object,
  key = "CHOIR",
  alpha = 0.05,
  p_adjust = "bonferroni",
  feature_set = "var",
  exclude_features = NULL,
  n_iterations = 100,
  n_trees = 50,
  use_variance = TRUE,
  min_accuracy = 0.5,
  min_connections = 1,
  max_repeat_errors = 20,
  distance_approx = TRUE,
  distance_awareness = 2,
  collect_all_metrics = FALSE,
  sample_max = Inf,
  downsampling_rate = "auto",
  min_reads = NULL,
  max_clusters = "auto",
  min_cluster_depth = 2000,
  normalization_method = "none",
  subtree_reductions = TRUE,
  reduction_method = NULL,
  reduction_params = list(),
  n_var_features = NULL,
  batch_correction_method = "none",
  batch_correction_params = list(),
  batch_labels = NULL,
  neighbor_params = list(),
  cluster_params = list(algorithm = 1, group.singletons = TRUE),
  use_assay = NULL,
  use_slot = NULL,
  ArchR_matrix = NULL,
  ArchR_depthcol = NULL,
  countsplit = FALSE,
  countsplit_suffix = NULL,
  reduction = NULL,
  var_features = NULL,
  atac = FALSE,
  n_cores = NULL,
  random_seed = 1,
  verbose = TRUE
)

Arguments

object: An object of class Seurat, SingleCellExperiment, or ArchRProject. For multi-omic data, we recommend using ArchRProject objects.
key: The name under which CHOIR-related data for this run is stored in the object. Defaults to “CHOIR”.
alpha: A numerical value indicating the significance level used for permutation test comparisons of cluster distinguishability. Defaults to 0.05. Decreasing the alpha value will yield more conservative clusters (fewer clusters) and will often decrease the computational time required, because fewer cluster comparisons may be needed.
p_adjust: A string indicating which multiple comparison adjustment method to use. Permitted values are “bonferroni”, “fdr”, and “none”. Defaults to “bonferroni”. Other correction methods may be less conservative, identifying more clusters, as CHOIR applies filters that reduce the total number of tests performed.
feature_set: A string indicating whether to train random forest classifiers on “all” features or only variable (“var”) features. Defaults to “var”. Computational time and memory required may increase if more features are used. Using all features instead of variable features may result in more conservative cluster calls.
exclude_features: A character vector indicating features that should be excluded from input to the random forest classifier. Defaults to NULL, which means that no features will be excluded. This parameter can be used, for example, to exclude features correlated with cell quality, such as mitochondrial genes. Failure to exclude problematic features could result in clusters driven by cell quality, while over-exclusion of features could reduce the ability of CHOIR to distinguish cell populations that differ by those features.
n_iterations: A numerical value indicating the number of iterations run for each permutation test comparison. Increasing the number of iterations will approximately linearly increase the computational time required but provide a more accurate estimation of the significance of the permutation test. Decreasing the number of iterations runs the risk of leading to underclustering due to lack of statistical power. The default value, 100 iterations, was selected because it avoids underclustering, while minimizing computational time and the diminishing returns from running CHOIR with additional iterations.
n_trees: A numerical value indicating the number of trees in each random forest. Defaults to 50. Increasing the number of trees is likely to increase the computational time required. Though not entirely predictable, increasing the number of trees up to a point may enable more nuanced distinctions, but is likely to provide diminishing returns.
use_variance: A Boolean value indicating whether to use the variance of the random forest accuracy scores as part of the permutation test threshold. Defaults to TRUE. Setting this parameter to FALSE will make CHOIR considerably less conservative, identifying more clusters, particularly on large datasets.
min_accuracy: A numerical value indicating the minimum accuracy required of the random forest classifier, below which clusters will be automatically merged. Defaults to 0.5, representing the random chance probability of assigning correct cluster labels; therefore, decreasing the minimum accuracy is not recommended. Increasing the minimum accuracy will lead to more conservative cluster assignments and will often decrease the computational time required, because fewer cluster comparisons may be needed.
min_connections: A numerical value indicating the minimum number of nearest neighbors between two clusters for those clusters to be considered adjacent. Non-adjacent clusters will not be merged. Defaults to 1. This threshold allows CHOIR to avoid running the full permutation test comparison for clusters that are highly likely to be distinct, saving computational time. Therefore, setting this parameter to 0 will increase the number of permutation test comparisons run and, thus, the computational time. The intent of this parameter is only to avoid running permutation test comparisons between clusters that are so different that they should not be merged. Therefore, we do not recommend increasing this parameter value beyond 10, as higher values may result in instances of overclustering.
max_repeat_errors: A numerical value indicating the maximum number of repeatedly mislabeled cells that will be taken into account during the permutation tests. This parameter is used to account for situations in which random forest classifier errors are concentrated among a few cells that are repeatedly misassigned. If set to 0, such repeat errors will not be evaluated. Defaults to 20. These situations are relatively infrequent, but setting this parameter to lower values (especially 0) may result in underclustering due to a small number of intermediate cells. Setting this parameter to higher values may lead to instances of overclustering and is not recommended.
distance_approx: A Boolean value indicating whether or not to use approximate distance calculations. Defaults to TRUE, which will use centroid-based distances. Setting distance approximation to FALSE will substantially increase the computational time and memory required, particularly for large datasets. Using approximated distances (TRUE) rather than absolute distances (FALSE) is unlikely to have a meaningful effect on the distance thresholds imposed by CHOIR.
distance_awareness: A numerical value representing the distance threshold above which a cluster will not merge with another cluster and significance testing will not be used. Specifically, this value is a multiplier applied to the distance between a cluster and its closest distinguishable neighbor based on random forest comparison. Defaults to 2, which sets this threshold at a two-fold increase in distance over the closest distinguishable neighbor. This threshold allows CHOIR to avoid running the full permutation test comparison for clusters that are highly likely to be distinct, saving computational time. To omit all distance calculations and perform permutation testing on all comparisons, set this parameter to FALSE. Setting this parameter to FALSE or increasing the input value will increase the number of permutation test comparisons run and, thus, the computational time. In rare cases, very small distant clusters may be erroneously merged when distance thresholds are not used. The intent of this parameter is only to avoid running permutation test comparisons between clusters that are so different that they should not be merged. We do not recommend decreasing this parameter value below 1.5, as lower values may result in instances of overclustering.
collect_all_metrics: A Boolean value indicating whether to collect and save additional metrics from the random forest classifiers, including feature importances for every comparison. Defaults to FALSE. Setting this parameter to TRUE will slightly increase the computational time required. This parameter has no effect on the final cluster calls.
sample_max: A numerical value indicating the maximum number of cells to be sampled per cluster to train/test each random forest classifier. Defaults to Inf (infinity), which does not cap the number of cells used, so all cells will be used in all comparisons. Decreasing this parameter may decrease the computational time required, but may result in instances of underclustering. If input is provided to both the downsampling_rate and sample_max parameters, the minimum resulting cell number is calculated and used for each comparison.
downsampling_rate: A numerical value indicating the proportion of cells to be sampled per cluster to train/test each random forest classifier. For efficiency, the default value, "auto", sets the downsampling rate according to the dataset size. Decreasing this parameter may decrease the computational time required, but may also make the final cluster calls more conservative. If input is provided to both downsampling_rate and sample_max parameters, the minimum resulting cell number is calculated and used for each comparison.
min_reads: A numeric value used to filter out features prior to input to the random forest classifier. The default value, NULL, will filter out features with 0 counts for the current clusters being compared. Higher values should be used with caution, but may increase the signal-to-noise ratio encountered by the random forest classifiers.
max_clusters: Indicates the extent to which the hierarchical clustering tree will be expanded. Defaults to “auto”, which will expand the tree until cases of underclustering have been eliminated in all branches. Alternatively, supply a numerical value indicating the maximum number of clusters to which to the tree should be expanded. Using the default value for this parameter is highly recommended to avoid instances of underclustering. Setting a numerical value in this parameter hampers the ability of CHOIR to ensure that underclustering has not occurred.
min_cluster_depth: A numerical value indicating the maximum cluster size at the bottom of the clustering tree, prior to pruning branches. Defaults to 2000. Increasing this parameter can cause a computational bottleneck when generating the initial clustering tree for some large datasets; therefore, the default value is recommended. However, changing this value is unlikely to have meaningful effects on the final cluster results.
normalization_method: A character string or vector indicating which normalization method to use. In general, input data should be supplied to CHOIR after normalization, except when the user wishes to use Seurat SCTransform normalization. Permitted values are “none” or “SCTransform”. Defaults to “none”. Because CHOIR has not been tested thoroughly with SCTransform normalization, we do not recommend this approach at this time. For multi-omic datasets, provide a vector with a value corresponding to each provided value of use_assay or ArchR_matrix in the same order.
subtree_reductions: A Boolean value indicating whether to generate a new dimensionality reduction and set of highly variable features for each subtree. Defaults to TRUE, which enables CHOIR to compare similar clusters using a more nuanced set of features. Setting this parameter to FALSE may decrease computational time, but may result in instances of underclustering.
reduction_method: A character string or vector indicating which dimensionality reduction method to use. Permitted values are “PCA” for principal component analysis, “LSI” for latent semantic indexing, and “IterativeLSI” for iterative latent semantic indexing. These three methods implement the Seurat function RunPCA, the Signac function RunSVD, and the ArchR function addIterativeLSI, respectively. The default value, NULL, will select a method based on the input data type, specifically “IterativeLSI” for ArchR objects, “LSI” for Seurat or SingleCellExperiment objects when parameter atac is TRUE, and “PCA” in all other cases. For multi-omic datasets, provide a vector with a value corresponding to each provided value of use_assay or ArchR_matrix in the same order.
reduction_params: A list of additional parameters to be passed to the selected dimensionality reduction method. By default, CHOIR will use the default parameter settings of the dimensionality reduction method indicated by the input to parameter reduction_method. Input to this parameter is passed to each downstream dimensionality reduction method and will overwrite or augment those defaults. Altering the performance of the dimensionality reduction in CHOIR will affect downstream clustering results, but not in ways that are easily predictable.
n_var_features: A numerical value indicating how many variable features to identify. Defaults to 2000 features for most data inputs, or 25000 features for ATAC-seq data. Increasing the number of features may increase the computational time and memory required. If the provided value is either substantially higher or lower, instances of underclustering may occur. For multi-omic datasets, provide a vector with a value corresponding to each provided value of use_assay or ArchR_matrix in the same order.
batch_correction_method: A character string indicating which batch correction method to use. Permitted values are “Harmony” and “none”. Defaults to “none”. Batch correction should only be used when the different batches are not expected to also have unique cell types or cell states. Using batch correction would ensure that clusters do not originate from a single batch, thereby making the final cluster calls more conservative.
batch_correction_params: A list of additional parameters to be passed to the selected batch correction method for each iteration. Only applicable when batch_correction_method is “Harmony”.
batch_labels: A character string that, if applying batch correction, specifies the name of the column in the input object metadata containing the batch labels. Defaults to NULL.
neighbor_params: A list of additional parameters to be passed to Seurat function FindNeighbors (or, in the case of multi-modal data for Seurat or SingleCellExperiment objects, Seurat function FindMultiModalNeighbors).
cluster_params: A list of additional parameters to be passed to Seurat function FindClusters for clustering at each level of the tree. By default, when the Seurat::FindClusters parameter group.singletons is set to TRUE, CHOIR relabels clusters such that each singleton constitutes its own cluster.
use_assay: For Seurat or SingleCellExperiment objects, a character string or vector indicating the assay(s) to use in the provided object. The default value, NULL, will choose the current active assay for Seurat objects and the logcounts assay for SingleCellExperiment objects.
use_slot: For Seurat objects, a character string or vector indicating the layers(s)—previously known as slot(s)—to use in the provided object. The default value, NULL, will choose a layer/slot based on the selected assay. If an assay other than "RNA", "sketch”, "SCT”, or "integrated" is provided, you must specify a value for use_slot. For multi-omic datasets, provide a vector with a value corresponding to each provided value of use_assay in the same order.
ArchR_matrix: For ArchR objects, a character string or vector indicating which matrix or matrices to use in the provided object. The default value, NULL, will use the “GeneScoreMatrix” for ATAC-seq data or the “GeneExpressionMatrix” for RNA-seq data. For multi-omic datasets, provide a vector with a value corresponding to each modality. When "GeneScoreMatrix" is provided, the "GeneScoreMatrix" will be used as input to the random forest classifiers, but the "TileMatrix" will be used for the initial dimensionality reduction(s).
ArchR_depthcol: For ArchR objects, a character string or vector indicating which column to use for correlation with sequencing depth. The default value, NULL, will use the “nFrags” column for ATAC-seq data or the “Gex_nUMI” for RNA-seq data. For multi-omic datasets, provide a vector with a value corresponding to each provided value of ArchR_matrix in the same order.
countsplit: A Boolean value indicating whether or not to use count split input data (see countsplit package), such that one matrix of counts is used for clustering tree generation and a separate matrix is used for all random forest classifier permutation testing. Defaults to FALSE. Enabling count splitting is likely to result in more conservative final cluster calls and is likely to perform best in datasets with high read depths.
countsplit_suffix: A character vector indicating the suffixes that distinguish the two count split matrices to be used. Suffixes are appended onto the input string/vector for parameter use_slot for Seurat objects, use_assay for SingleCellExperiment objects, or ArchR_matrix for ArchR objects. When count splitting is enabled, the default value NULL uses suffixes "_1" and "_2".
reduction: An optional matrix of dimensionality reduction cell embeddings provided by the user for subsequent clustering steps. By default, this parameter is set to NULL, and the dimensionality reduction(s) will be calculated using the method specified by the reduction_method parameter.
var_features: An optional character vector of names of variable features to be used for subsequent clustering steps. By default, this parameter is set to NULL, and variable features will be calculated as part of running CHOIR. Input to this parameter is required when a dimensionality reduction is supplied to parameter reduction. For multi-omic datasets, concatenate feature names for all modalities.
atac: A Boolean value or vector indicating whether the provided data is ATAC-seq data. For multi-omic datasets, provide a vector with a value corresponding to each provided value of use_assay or ArchR_matrix in the same order. Defaults to FALSE.
n_cores: A numerical value indicating the number of cores to use for parallelization. By default, CHOIR will use the number of available cores minus 2. CHOIR is parallelized at the computation of permutation test iterations. Therefore, any number of cores up to the number of iterations will theoretically decrease the computational time required. In practice, 8–16 cores are recommended for datasets up to 500,000 cells.
random_seed: A numerical value indicating the random seed to be used. Defaults to 1. CHOIR uses randomization throughout the generation and pruning of the clustering tree. Therefore, changing the random seed may yield slight differences in the final cluster assignments.
verbose: A Boolean value indicating whether to use verbose output during the execution of CHOIR. Defaults to TRUE, but can be set to FALSE for a cleaner output.

Value

Returns the object with the following added data stored under the provided key:

cell_IDs: Cell IDs belonging to each subtree
clusters: Final clusters, full hierarchical cluster tree, and stepwise cluster results for each progressive pruning step
graph: All calculated nearest neighbor and shared nearest neighbor adjacency matrices
parameters: Record of parameter values used
records: Metadata for decision points during hierarchical tree construction, all recorded permutation test comparisons, and feature importance scores from all comparisons
reduction: Cell embeddings for all calculated dimensionality reductions
var_features: Variable features for all calculated dimensionality reductions

Details

First, a hierarchical clustering tree is generated using a top-down approach that proceeds from an initial partition, in which all cells are in the same cluster, to a partition in which all cells are demonstrably overclustered. Second, to identify a final set of clusters, this hierarchical clustering tree is pruned from the bottom up using a framework of random forest classifiers and permutation tests.

For multi-modal data, optionally supply parameter inputs as vectors/lists that sequentially specify the value for each modality.