Skip to contents

To identify a final set of clusters, this function will move iteratively from the bottom up to prune the provided hierarchical clustering tree using a framework of random forest classifiers and permutation tests.

Usage

pruneTree(
  object,
  key = "CHOIR",
  alpha = NULL,
  p_adjust = NULL,
  feature_set = NULL,
  exclude_features = NULL,
  n_iterations = NULL,
  n_trees = NULL,
  use_variance = NULL,
  min_accuracy = NULL,
  min_connections = NULL,
  max_repeat_errors = NULL,
  distance_approx = NULL,
  distance_awareness = 2,
  collect_all_metrics = FALSE,
  sample_max = NULL,
  downsampling_rate = NULL,
  normalization_method = NULL,
  batch_correction_method = NULL,
  batch_labels = NULL,
  cluster_params = NULL,
  use_assay = NULL,
  cluster_tree = NULL,
  input_matrix = NULL,
  nn_matrix = NULL,
  dist_matrix = NULL,
  reduction = NULL,
  n_cores = NULL,
  random_seed = NULL,
  verbose = TRUE
)

Arguments

object

An object of class 'Seurat', 'SingleCellExperiment', or 'ArchRProject'.

key

The name under which CHOIR-related data for this run is stored in the object. Defaults to 'CHOIR'.

alpha

A numeric value indicating the significance level used for permutation test comparisons of cluster prediction accuracies. Defaults to 0.05.

p_adjust

A string indicating which multiple comparison adjustment to use. Permitted values are 'bonferroni', 'fdr', and 'none'. Defaults to 'bonferroni'.

feature_set

A string indicating whether to train random forest classifiers on 'all' features or only variable ('var') features. Defaults to 'var'.

exclude_features

A character vector indicating features that should be excluded from input to the random forest classifier. Default = NULL will not exclude any features.

n_iterations

A numeric value indicating the number of iterations run for each permutation test comparison. Defaults to 100.

n_trees

A numeric value indicating the number of trees in each random forest. Defaults to 50.

use_variance

A boolean value indicating whether to use the variance of the random forest accuracy scores as part of the permutation test threshold. Defaults to TRUE.

min_accuracy

A numeric value indicating the minimum accuracy required of the random forest classifier, below which clusters will be automatically merged. Defaults to 0.5 (chance).

min_connections

A numeric value indicating the minimum number of nearest neighbors between two clusters for them to be considered 'adjacent'. Non-adjacent clusters will not be merged. Defaults to 1.

max_repeat_errors

Used to account for situations in which random forest classifier errors are concentrated among a few cells that are repeatedly misassigned. A numeric value indicating the maximum number of such 'repeat errors' that will be taken into account. If set to 0, 'repeat errors' will not be evaluated. Defaults to 20.

distance_approx

A boolean value indicating whether or not to use approximate distance calculations. Default = TRUE will use centroid-based distances.

distance_awareness

A numeric value representing the distance threshold above which a cluster will not merge with another cluster. Specifically, this value is multiplied by the distance between a cluster and its closest distinguishable neighbor to set the threshold. Default = 2 sets this threshold at a 2-fold increase in distance. Alternately, to omit all distance calculations, set to FALSE.

collect_all_metrics

A boolean value indicating whether to collect and save additional metrics from the random forest classifier comparisons, including feature importances and tree depth. Defaults to FALSE.

sample_max

A numeric value indicating the maximum number of cells used per cluster to train/test each random forest classifier. Default = Inf does not cap the number of cells used.

downsampling_rate

A numeric value indicating the proportion of cells used per cluster to train/test each random forest classifier. Default = "auto" sets the downsampling rate according to the dataset size, for efficiency.

normalization_method

A character string or vector indicating which normalization method to use. In general, input data should be supplied to CHOIR after normalization, except in cases when the user wishes to use Seurat::SCTransform() normalization. Permitted values are 'none' or 'SCTransform'. Defaults to 'none'.

batch_correction_method

A character string or vector indicating which batch correction method to use. Permitted values are 'Harmony' and 'none'. Defaults to 'none'.

batch_labels

If applying batch correction, a character string or vector indicating the name of the column containing the batch labels. Defaults to NULL.

cluster_params

A list of additional parameters to be passed to Seurat::FindClusters() for clustering at each level of the tree. Note that if group.singletons is set to TRUE, CHOIR relabels initial clusters such that each singleton constitutes its own cluster.

use_assay

For Seurat or SingleCellExperiment objects, a character string or vector indicating the assay(s) to use in the provided object. Default = NULL will choose the current active assay for Seurat objects and the logcounts assay for SingleCellExperiment objects.

cluster_tree

An optional dataframe containing the cluster IDs of each cell across the levels of a hierarchical clustering tree. Default = NULL will use the hierarchical clustering tree generation by function buildTree().

input_matrix

An optional matrix containing the feature x cell data on which to train the random forest classifiers. Default = NULL will use the feature x cell matri(ces) indicated by function buildTree().

nn_matrix

An optional matrix containing the nearest neighbor adjacency of the cells. Default = NULL will look for the adjacency matri(ces) generated by function buildTree().

dist_matrix

An optional distance matrix of cell to cell distances (based on dimensionality reduction cell embeddings). Default = NULL will look for the distance matri(ces) generated by function buildTree().

reduction

An optional matrix of dimensionality reduction cell embeddings to be used for distance calculations. Defaults = NULL will look for the dimensionality reductions generated by function buildTree().

n_cores

A numeric value indicating the number of cores to use for parallelization. Default = NULL will use the number of available cores minus 2.

random_seed

A numeric value indicating the random seed to be used.

verbose

A boolean value indicating whether to use verbose output during the execution of this function. Can be set to FALSE for a cleaner output.

Value

Returns the object with the following added data stored under the provided key:

clusters

Final clusters and stepwise cluster results for each progressive pruning step

parameters

Record of parameter values used

records

Metadata for all recorded permutation test comparisons and feature importance scores from all comparisons

Details

If CHOIR::buildTree() was run prior to this function, most parameters will be retrieved from the object. Alternately, parameter values can be supplied. For multi-modal data, optionally supply parameter inputs as vectors/lists that sequentially specify the value for each modality.