Prune clustering tree using random forest classifiers
pruneTree.Rd
To identify a final set of clusters, this function will move iteratively from the bottom up to prune the provided hierarchical clustering tree using a framework of random forest classifiers and permutation tests.
Usage
pruneTree(
object,
key = "CHOIR",
alpha = NULL,
p_adjust = NULL,
feature_set = NULL,
exclude_features = NULL,
n_iterations = NULL,
n_trees = NULL,
use_variance = NULL,
min_accuracy = NULL,
min_connections = NULL,
max_repeat_errors = NULL,
distance_approx = NULL,
distance_awareness = 2,
collect_all_metrics = FALSE,
sample_max = NULL,
downsampling_rate = NULL,
normalization_method = NULL,
batch_correction_method = NULL,
batch_labels = NULL,
cluster_params = NULL,
use_assay = NULL,
cluster_tree = NULL,
input_matrix = NULL,
nn_matrix = NULL,
dist_matrix = NULL,
reduction = NULL,
n_cores = NULL,
random_seed = NULL,
verbose = TRUE
)
Arguments
- object
An object of class 'Seurat', 'SingleCellExperiment', or 'ArchRProject'.
- key
The name under which CHOIR-related data for this run is stored in the object. Defaults to 'CHOIR'.
- alpha
A numeric value indicating the significance level used for permutation test comparisons of cluster prediction accuracies. Defaults to 0.05.
- p_adjust
A string indicating which multiple comparison adjustment to use. Permitted values are 'bonferroni', 'fdr', and 'none'. Defaults to 'bonferroni'.
- feature_set
A string indicating whether to train random forest classifiers on 'all' features or only variable ('var') features. Defaults to 'var'.
- exclude_features
A character vector indicating features that should be excluded from input to the random forest classifier. Default =
NULL
will not exclude any features.- n_iterations
A numeric value indicating the number of iterations run for each permutation test comparison. Defaults to 100.
- n_trees
A numeric value indicating the number of trees in each random forest. Defaults to 50.
- use_variance
A boolean value indicating whether to use the variance of the random forest accuracy scores as part of the permutation test threshold. Defaults to
TRUE
.- min_accuracy
A numeric value indicating the minimum accuracy required of the random forest classifier, below which clusters will be automatically merged. Defaults to 0.5 (chance).
- min_connections
A numeric value indicating the minimum number of nearest neighbors between two clusters for them to be considered 'adjacent'. Non-adjacent clusters will not be merged. Defaults to 1.
- max_repeat_errors
Used to account for situations in which random forest classifier errors are concentrated among a few cells that are repeatedly misassigned. A numeric value indicating the maximum number of such 'repeat errors' that will be taken into account. If set to 0, 'repeat errors' will not be evaluated. Defaults to 20.
- distance_approx
A boolean value indicating whether or not to use approximate distance calculations. Default =
TRUE
will use centroid-based distances.- distance_awareness
A numeric value representing the distance threshold above which a cluster will not merge with another cluster. Specifically, this value is multiplied by the distance between a cluster and its closest distinguishable neighbor to set the threshold. Default = 2 sets this threshold at a 2-fold increase in distance. Alternately, to omit all distance calculations, set to
FALSE
.- collect_all_metrics
A boolean value indicating whether to collect and save additional metrics from the random forest classifier comparisons, including feature importances and tree depth. Defaults to
FALSE
.- sample_max
A numeric value indicating the maximum number of cells used per cluster to train/test each random forest classifier. Default =
Inf
does not cap the number of cells used.- downsampling_rate
A numeric value indicating the proportion of cells used per cluster to train/test each random forest classifier. Default = "auto" sets the downsampling rate according to the dataset size, for efficiency.
- normalization_method
A character string or vector indicating which normalization method to use. In general, input data should be supplied to CHOIR after normalization, except in cases when the user wishes to use
Seurat::SCTransform()
normalization. Permitted values are 'none' or 'SCTransform'. Defaults to 'none'.- batch_correction_method
A character string or vector indicating which batch correction method to use. Permitted values are 'Harmony' and 'none'. Defaults to 'none'.
- batch_labels
If applying batch correction, a character string or vector indicating the name of the column containing the batch labels. Defaults to
NULL
.- cluster_params
A list of additional parameters to be passed to Seurat::FindClusters() for clustering at each level of the tree. Note that if
group.singletons
is set toTRUE
,CHOIR
relabels initial clusters such that each singleton constitutes its own cluster.- use_assay
For Seurat or SingleCellExperiment objects, a character string or vector indicating the assay(s) to use in the provided object. Default =
NULL
will choose the current active assay for Seurat objects and thelogcounts
assay for SingleCellExperiment objects.- cluster_tree
An optional dataframe containing the cluster IDs of each cell across the levels of a hierarchical clustering tree. Default =
NULL
will use the hierarchical clustering tree generation by functionbuildTree()
.- input_matrix
An optional matrix containing the feature x cell data on which to train the random forest classifiers. Default =
NULL
will use the feature x cell matri(ces) indicated by functionbuildTree()
.- nn_matrix
An optional matrix containing the nearest neighbor adjacency of the cells. Default =
NULL
will look for the adjacency matri(ces) generated by functionbuildTree()
.- dist_matrix
An optional distance matrix of cell to cell distances (based on dimensionality reduction cell embeddings). Default =
NULL
will look for the distance matri(ces) generated by functionbuildTree()
.- reduction
An optional matrix of dimensionality reduction cell embeddings to be used for distance calculations. Defaults =
NULL
will look for the dimensionality reductions generated by functionbuildTree()
.- n_cores
A numeric value indicating the number of cores to use for parallelization. Default =
NULL
will use the number of available cores minus 2.- random_seed
A numeric value indicating the random seed to be used.
- verbose
A boolean value indicating whether to use verbose output during the execution of this function. Can be set to
FALSE
for a cleaner output.
Value
Returns the object with the following added data stored under the provided key:
- clusters
Final clusters and stepwise cluster results for each progressive pruning step
- parameters
Record of parameter values used
- records
Metadata for all recorded permutation test comparisons and feature importance scores from all comparisons
Details
If CHOIR::buildTree()
was run prior to this function, most parameters
will be retrieved from the object. Alternately, parameter values can be
supplied. For multi-modal data, optionally supply parameter inputs as
vectors/lists that sequentially specify the value for each modality.