Skip to contents

This function will take two provided clusters and assess whether they are distinguishable by a permutation test using random forest classifier prediction accuracies.

Usage

compareClusters(
  object = NULL,
  key = "CHOIR",
  cluster1_cells = NULL,
  cluster2_cells = NULL,
  ident1 = NULL,
  ident2 = NULL,
  group_by = NULL,
  alpha = 0.05,
  feature_set = "var",
  exclude_features = NULL,
  n_iterations = 100,
  n_trees = 50,
  use_variance = TRUE,
  min_accuracy = 0.5,
  min_connections = 1,
  max_repeat_errors = 20,
  collect_all_metrics = FALSE,
  sample_max = Inf,
  downsampling_rate = "auto",
  normalization_method = "none",
  batch_labels = NULL,
  use_assay = NULL,
  use_slot = NULL,
  ArchR_matrix = NULL,
  atac = FALSE,
  input_matrix = NULL,
  nn_matrix = NULL,
  var_features = NULL,
  n_cores = NULL,
  random_seed = 1,
  verbose = TRUE
)

Arguments

object

An object of class 'Seurat', 'SingleCellExperiment', or 'ArchRProject'. Not used if values are provided for parameters 'input_matrix' and 'nn_matrix'.

key

The name under which CHOIR-related data is retrieved from the object. Defaults to 'CHOIR'. Not used if values are provided for parameters 'input_matrix' and 'nn_matrix'.

cluster1_cells

A character vector of cell names belonging to cluster 1.

cluster2_cells

A character vector of cell names belonging to cluster 2.

ident1

A string indicating the label for cluster 1.

ident2

A string indicating the label for cluster 2.

group_by

A string indicating the column of cluster labels that 'ident1' and 'ident2' belong to.

alpha

A numeric value indicating the significance level used for permutation test comparisons of cluster prediction accuracies. Defaults to 0.05.

feature_set

A string indicating whether to train random forest classifiers on 'all' features or only variable ('var') features. Defaults to 'var'.

exclude_features

A character vector indicating features that should be excluded from input to the random forest classifier. Default = NULL will not exclude any features.

n_iterations

A numeric value indicating the number of iterations run for each permutation test comparison. Defaults to 100.

n_trees

A numeric value indicating the number of trees in each random forest. Defaults to 50.

use_variance

A boolean value indicating whether to use the variance of the random forest accuracy scores as part of the permutation test threshold. Defaults to TRUE.

min_accuracy

A numeric value indicating the minimum accuracy required of the random forest classifier, below which clusters will be automatically merged. Defaults to 0.5 (chance).

min_connections

A numeric value indicating the minimum number of nearest neighbors between two clusters for them to be considered 'adjacent'. Non-adjacent clusters will not be merged. Defaults to 1.

max_repeat_errors

Used to account for situations in which random forest classifier errors are concentrated among a few cells that are repeatedly misassigned. A numeric value indicating the maximum number of such 'repeat errors' that will be taken into account. If set to 0, 'repeat errors' will not be evaluated. Defaults to 20.

collect_all_metrics

A boolean value indicating whether to collect and save additional metrics from the random forest classifier comparisons, including feature importances and tree depth. Defaults to FALSE.

sample_max

A numeric value indicating the maximum number of cells used per cluster to train/test each random forest classifier. Default = Inf does not cap the number of cells used.

downsampling_rate

A numeric value indicating the proportion of cells used per cluster to train/test each random forest classifier. Default = "auto" sets the downsampling rate according to the dataset size, for efficiency.

normalization_method

A character string or vector indicating which normalization method to use. In general, input data should be supplied to CHOIR after normalization, except in cases when the user wishes to use Seurat::SCTransform() normalization. Permitted values are 'none' or 'SCTransform'. Defaults to 'none'.

batch_labels

If applying batch correction, a character string or vector indicating the name of the column containing the batch labels. Defaults to NULL.

use_assay

For Seurat or SingleCellExperiment objects, a character string or vector indicating the assay(s) to use in the provided object. Default = NULL will choose the current active assay for Seurat objects and the logcounts assay for SingleCellExperiment objects.

use_slot

For Seurat objects, a character string or vector indicating the layers(s) — previously known as slot(s) — to use in the provided object. Default = NULL will choose a layer/slot based on the selected assay. If a non-standard assay is provided, do not leave use_slot as NULL.

ArchR_matrix

For ArchR objects, a character string or vector indicating which matri(ces) to use in the provided object. Default = NULL will use the 'TileMatrix' for ATAC-seq data or the 'GeneExpressionMatrix' for RNA-seq data.

atac

A boolean value or vector indicating whether the provided data is ATAC-seq data. Defaults to FALSE. For multi-omic datasets containing ATAC-seq data, it is important to supply this parameter as a vector corresponding to each modality in order.

input_matrix

An optional matrix containing the feature x cell data on which to train the random forest classifiers. Default = NULL will use the feature x cell matri(ces) indicated by function buildTree().

nn_matrix

An optional matrix containing the nearest neighbor adjacency of the cells. Default = NULL will look for the adjacency matri(ces) generated by function buildTree().

var_features

An optional character vector of variable features to be used for subsequent clustering steps. Default = NULL will use the variable features identified by function buildTree().

n_cores

A numeric value indicating the number of cores to use for parallelization. Default = NULL will use the number of available cores minus 2.

random_seed

A numeric value indicating the random seed to be used.

verbose

A boolean value indicating whether to use verbose output during the execution of this function. Can be set to FALSE for a cleaner output.

Value

Returns a list containing the following elements:

comparison_result

A string, either "merge" or "split", indicating the result of the comparison.

comparison_records

A dataframe including the metrics recorded for the comparison

feature_importances

If 'collect_all_metrics' is true, a dataframe containing the feature importance scores for each gene in the comparison