Compare any two clusters using CHOIR's random forest classifier permutation testing approach
compareClusters.Rd
This function will take two provided clusters and assess whether they are distinguishable by a permutation test using random forest classifier prediction accuracies.
Usage
compareClusters(
object = NULL,
key = "CHOIR",
cluster1_cells = NULL,
cluster2_cells = NULL,
ident1 = NULL,
ident2 = NULL,
group_by = NULL,
alpha = 0.05,
feature_set = "var",
exclude_features = NULL,
n_iterations = 100,
n_trees = 50,
use_variance = TRUE,
min_accuracy = 0.5,
min_connections = 1,
max_repeat_errors = 20,
collect_all_metrics = FALSE,
sample_max = Inf,
downsampling_rate = "auto",
normalization_method = "none",
batch_labels = NULL,
use_assay = NULL,
use_slot = NULL,
ArchR_matrix = NULL,
atac = FALSE,
input_matrix = NULL,
nn_matrix = NULL,
var_features = NULL,
n_cores = NULL,
random_seed = 1,
verbose = TRUE
)
Arguments
- object
An object of class 'Seurat', 'SingleCellExperiment', or 'ArchRProject'. Not used if values are provided for parameters 'input_matrix' and 'nn_matrix'.
- key
The name under which CHOIR-related data is retrieved from the object. Defaults to 'CHOIR'. Not used if values are provided for parameters 'input_matrix' and 'nn_matrix'.
- cluster1_cells
A character vector of cell names belonging to cluster 1.
- cluster2_cells
A character vector of cell names belonging to cluster 2.
- ident1
A string indicating the label for cluster 1.
- ident2
A string indicating the label for cluster 2.
- group_by
A string indicating the column of cluster labels that 'ident1' and 'ident2' belong to.
- alpha
A numeric value indicating the significance level used for permutation test comparisons of cluster prediction accuracies. Defaults to 0.05.
- feature_set
A string indicating whether to train random forest classifiers on 'all' features or only variable ('var') features. Defaults to 'var'.
- exclude_features
A character vector indicating features that should be excluded from input to the random forest classifier. Default =
NULL
will not exclude any features.- n_iterations
A numeric value indicating the number of iterations run for each permutation test comparison. Defaults to 100.
- n_trees
A numeric value indicating the number of trees in each random forest. Defaults to 50.
- use_variance
A boolean value indicating whether to use the variance of the random forest accuracy scores as part of the permutation test threshold. Defaults to
TRUE
.- min_accuracy
A numeric value indicating the minimum accuracy required of the random forest classifier, below which clusters will be automatically merged. Defaults to 0.5 (chance).
- min_connections
A numeric value indicating the minimum number of nearest neighbors between two clusters for them to be considered 'adjacent'. Non-adjacent clusters will not be merged. Defaults to 1.
- max_repeat_errors
Used to account for situations in which random forest classifier errors are concentrated among a few cells that are repeatedly misassigned. A numeric value indicating the maximum number of such 'repeat errors' that will be taken into account. If set to 0, 'repeat errors' will not be evaluated. Defaults to 20.
- collect_all_metrics
A boolean value indicating whether to collect and save additional metrics from the random forest classifier comparisons, including feature importances and tree depth. Defaults to
FALSE
.- sample_max
A numeric value indicating the maximum number of cells used per cluster to train/test each random forest classifier. Default =
Inf
does not cap the number of cells used.- downsampling_rate
A numeric value indicating the proportion of cells used per cluster to train/test each random forest classifier. Default = "auto" sets the downsampling rate according to the dataset size, for efficiency.
- normalization_method
A character string or vector indicating which normalization method to use. In general, input data should be supplied to CHOIR after normalization, except in cases when the user wishes to use
Seurat::SCTransform()
normalization. Permitted values are 'none' or 'SCTransform'. Defaults to 'none'.- batch_labels
If applying batch correction, a character string or vector indicating the name of the column containing the batch labels. Defaults to
NULL
.- use_assay
For Seurat or SingleCellExperiment objects, a character string or vector indicating the assay(s) to use in the provided object. Default =
NULL
will choose the current active assay for Seurat objects and thelogcounts
assay for SingleCellExperiment objects.- use_slot
For Seurat objects, a character string or vector indicating the layers(s) — previously known as slot(s) — to use in the provided object. Default =
NULL
will choose a layer/slot based on the selected assay. If a non-standard assay is provided, do not leaveuse_slot
asNULL
.- ArchR_matrix
For ArchR objects, a character string or vector indicating which matri(ces) to use in the provided object. Default =
NULL
will use the 'TileMatrix' for ATAC-seq data or the 'GeneExpressionMatrix' for RNA-seq data.- atac
A boolean value or vector indicating whether the provided data is ATAC-seq data. Defaults to
FALSE
. For multi-omic datasets containing ATAC-seq data, it is important to supply this parameter as a vector corresponding to each modality in order.- input_matrix
An optional matrix containing the feature x cell data on which to train the random forest classifiers. Default =
NULL
will use the feature x cell matri(ces) indicated by functionbuildTree()
.- nn_matrix
An optional matrix containing the nearest neighbor adjacency of the cells. Default =
NULL
will look for the adjacency matri(ces) generated by functionbuildTree()
.- var_features
An optional character vector of variable features to be used for subsequent clustering steps. Default =
NULL
will use the variable features identified by functionbuildTree()
.- n_cores
A numeric value indicating the number of cores to use for parallelization. Default =
NULL
will use the number of available cores minus 2.- random_seed
A numeric value indicating the random seed to be used.
- verbose
A boolean value indicating whether to use verbose output during the execution of this function. Can be set to
FALSE
for a cleaner output.
Value
Returns a list containing the following elements:
- comparison_result
A string, either "merge" or "split", indicating the result of the comparison.
- comparison_records
A dataframe including the metrics recorded for the comparison
- feature_importances
If 'collect_all_metrics' is true, a dataframe containing the feature importance scores for each gene in the comparison