Compare any two clusters using CHOIR's random forest classifier permutation testing approach
compareClusters.Rd
This function will take two provided clusters and assess whether they are distinguishable by a permutation test using random forest classifier prediction accuracies.
Usage
compareClusters(
object = NULL,
key = "CHOIR",
cluster1_cells = NULL,
cluster2_cells = NULL,
ident1 = NULL,
ident2 = NULL,
group_by = NULL,
alpha = 0.05,
feature_set = "var",
exclude_features = NULL,
n_iterations = 100,
n_trees = 50,
use_variance = TRUE,
min_accuracy = 0.5,
min_connections = 1,
max_repeat_errors = 20,
collect_all_metrics = FALSE,
sample_max = Inf,
downsampling_rate = "auto",
min_reads = NULL,
normalization_method = "none",
batch_labels = NULL,
use_assay = NULL,
use_slot = NULL,
ArchR_matrix = NULL,
atac = FALSE,
input_matrix = NULL,
nn_matrix = NULL,
var_features = NULL,
n_cores = NULL,
random_seed = 1,
verbose = TRUE
)
Arguments
- object
An object of class
Seurat
,SingleCellExperiment
, orArchRProject
. For multi-omic data, we recommend usingArchRProject
objects. Not used if values are provided for parametersinput_matrix
andnn_matrix
.- key
The name under which CHOIR-related data for this run is stored in the object. Defaults to “CHOIR”. Not used if input is provided for parameters
input_matrix
andnn_matrix
.- cluster1_cells
A character vector of cell names belonging to cluster 1.
- cluster2_cells
A character vector of cell names belonging to cluster 2.
- ident1
A string indicating the label for cluster 1.
- ident2
A string indicating the label for cluster 2.
- group_by
A string indicating the column of cluster labels that 'ident1' and 'ident2' belong to.
- alpha
A numerical value indicating the significance level used for permutation test comparisons of cluster distinguishability. Defaults to 0.05.
- feature_set
A string indicating whether to train random forest classifiers on “all” features or only variable (“var”) features. Defaults to “var”. Computational time and memory required may increase if more features are used. Using all features instead of variable features may result in more conservative cluster calls.
- exclude_features
A character vector indicating features that should be excluded from input to the random forest classifier. Defaults to
NULL
, which means that no features will be excluded. This parameter can be used, for example, to exclude features correlated with cell quality, such as mitochondrial genes. Failure to exclude problematic features could result in clusters driven by cell quality, while over-exclusion of features could reduce the ability of CHOIR to distinguish cell populations that differ by those features.- n_iterations
A numerical value indicating the number of iterations run for each permutation test comparison. Increasing the number of iterations will approximately linearly increase the computational time required but provide a more accurate estimation of the significance of the permutation test. Decreasing the number of iterations runs the risk of leading to underclustering due to lack of statistical power. The default value, 100 iterations, was selected because it avoids underclustering, while minimizing computational time and the diminishing returns from running CHOIR with additional iterations.
- n_trees
A numerical value indicating the number of trees in each random forest. Defaults to 50. Increasing the number of trees is likely to increase the computational time required. Though not entirely predictable, increasing the number of trees up to a point may enable more nuanced distinctions, but is likely to provide diminishing returns.
- use_variance
A Boolean value indicating whether to use the variance of the random forest accuracy scores as part of the permutation test threshold. Defaults to
TRUE
. Setting this parameter toFALSE
will make CHOIR considerably less conservative, identifying more clusters, particularly on large datasets.- min_accuracy
A numerical value indicating the minimum accuracy required of the random forest classifier, below which clusters will be automatically merged. Defaults to 0.5, representing the random chance probability of assigning correct cluster labels; therefore, decreasing the minimum accuracy is not recommended. Increasing the minimum accuracy will lead to more conservative cluster assignments and will often decrease the computational time required, because fewer cluster comparisons may be needed.
- min_connections
A numerical value indicating the minimum number of nearest neighbors between two clusters for those clusters to be considered adjacent. Non-adjacent clusters will not be merged. Defaults to 1. This threshold allows CHOIR to avoid running the full permutation test comparison for clusters that are highly likely to be distinct, saving computational time. The intent of this parameter is only to avoid running permutation test comparisons between clusters that are so different that they should not be merged. Therefore, we do not recommend increasing this parameter value beyond 10, as higher values may result in instances of overclustering.
- max_repeat_errors
A numerical value indicating the maximum number of repeatedly mislabeled cells that will be taken into account during the permutation tests. This parameter is used to account for situations in which random forest classifier errors are concentrated among a few cells that are repeatedly misassigned. If set to 0, such repeat errors will not be evaluated. Defaults to 20. These situations are relatively infrequent, but setting this parameter to lower values (especially 0) may result in underclustering due to a small number of intermediate cells. Setting this parameter to higher values may lead to instances of overclustering and is not recommended.
- collect_all_metrics
A Boolean value indicating whether to collect and save additional metrics from the random forest classifiers, including feature importances for every comparison. Defaults to
FALSE
. Setting this parameter toTRUE
will slightly increase the computational time required. This parameter has no effect on the final cluster calls.- sample_max
A numerical value indicating the maximum number of cells to be sampled per cluster to train/test each random forest classifier. Defaults to
Inf
(infinity), which does not cap the number of cells used, so all cells will be used in all comparisons. Decreasing this parameter may decrease the computational time required, but may result in instances of underclustering. If input is provided to both thedownsampling_rate
andsample_max
parameters, the minimum resulting cell number is calculated and used for each comparison.- downsampling_rate
A numerical value indicating the proportion of cells to be sampled per cluster to train/test each random forest classifier. For efficiency, the default value, "auto", sets the downsampling rate according to the dataset size. Decreasing this parameter may decrease the computational time required, but may also make the final cluster calls more conservative. If input is provided to both
downsampling_rate
andsample_max parameters
, the minimum resulting cell number is calculated and used for each comparison.- min_reads
A numeric value used to filter out features prior to input to the random forest classifier. The default value,
NULL
, will filter out features with 0 counts for the current clusters being compared. Higher values should be used with caution, but may increase the signal-to-noise ratio encountered by the random forest classifiers.- normalization_method
A character string or vector indicating which normalization method to use. In general, input data should be supplied to CHOIR after normalization, except when the user wishes to use
Seurat SCTransform
normalization. Permitted values are “none” or “SCTransform”. Defaults to “none”. Because CHOIR has not been tested thoroughly withSCTransform
normalization, we do not recommend this approach at this time. For multi-omic datasets, provide a vector with a value corresponding to each provided value ofuse_assay
orArchR_matrix
in the same order.- batch_labels
A character string that, if applying batch correction, specifies the name of the column in the input object metadata containing the batch labels. Defaults to
NULL
.- use_assay
For
Seurat
orSingleCellExperiment
objects, a character string or vector indicating the assay(s) to use in the provided object. The default value,NULL
, will choose the current active assay forSeurat
objects and thelogcounts
assay forSingleCellExperiment
objects.- use_slot
For
Seurat
objects, a character string or vector indicating the layers(s)—previously known as slot(s)—to use in the provided object. The default value,NULL
, will choose a layer/slot based on the selected assay. If an assay other than "RNA", "sketch”, "SCT”, or "integrated" is provided, you must specify a value foruse_slot
. For multi-omic datasets, provide a vector with a value corresponding to each provided value ofuse_assay
in the same order.- ArchR_matrix
For
ArchR
objects, a character string or vector indicating which matrix or matrices to use in the provided object. The default value,NULL
, will use the “GeneScoreMatrix” for ATAC-seq data or the “GeneExpressionMatrix” for RNA-seq data. For multi-omic datasets, provide a vector with a value corresponding to each modality.- atac
A Boolean value or vector indicating whether the provided data is ATAC-seq data. For multi-omic datasets, provide a vector with a value corresponding to each provided value of
use_assay
orArchR_matrix
in the same order. Defaults toFALSE
.- input_matrix
An optional matrix containing the feature x cell data provided by the user, on which to train the random forest classifiers. By default, this parameter is set to
NULL
, and CHOIR will look for the feature x cell matri(ces) indicated by functionbuildTree
.- nn_matrix
An optional matrix containing the nearest neighbor adjacency of the cells, provided by the user. By default, this parameter is set to
NULL
, and CHOIR will look for the adjacency matri(ces) generated by functionbuildTree
.- var_features
An optional character vector of names of variable features to be used for subsequent clustering steps. By default, this parameter is set to
NULL
, and variable features previously identified by functionbuildTree
will be used. Input to this parameter is required when a dimensionality reduction is supplied to parameterreduction
. For multi-omic datasets, concatenate feature names for all modalities.- n_cores
A numerical value indicating the number of cores to use for parallelization. By default, CHOIR will use the number of available cores minus 2. CHOIR is parallelized at the computation of permutation test iterations. Therefore, any number of cores up to the number of iterations will theoretically decrease the computational time required. In practice, 8–16 cores are recommended for datasets up to 500,000 cells.
- random_seed
A numerical value indicating the random seed to be used. Defaults to 1. CHOIR uses randomization throughout the generation and pruning of the clustering tree. Therefore, changing the random seed may yield slight differences in the final cluster assignments.
- verbose
A Boolean value indicating whether to use verbose output during the execution of CHOIR. Defaults to
TRUE
, but can be set toFALSE
for a cleaner output.
Value
Returns a list containing the following elements:
- comparison_result
A string, either "merge" or "split", indicating the result of the comparison.
- comparison_records
A dataframe including the metrics recorded for the comparison
- feature_importances
If 'collect_all_metrics' is true, a dataframe containing the feature importance scores for each gene in the comparison