Build parent clustering tree
buildParentTree.Rd
This function constructs a hierarchical clustering tree starting from a single cluster encompassing all cells. A parent tree is constructed, from which subtrees can be generated with subsequent steps outside of this function.
Usage
buildParentTree(
object,
key = "CHOIR",
distance_approx = TRUE,
downsampling_rate = "auto",
normalization_method = "none",
reduction_method = NULL,
reduction_params = list(),
n_var_features = NULL,
batch_correction_method = "none",
batch_correction_params = list(),
batch_labels = NULL,
neighbor_params = list(),
cluster_params = list(algorithm = 1, group.singletons = TRUE),
use_assay = NULL,
use_slot = NULL,
ArchR_matrix = NULL,
ArchR_depthcol = NULL,
countsplit = FALSE,
countsplit_suffix = NULL,
reduction = NULL,
var_features = NULL,
atac = FALSE,
n_cores = NULL,
random_seed = 1,
verbose = TRUE
)
Arguments
- object
An object of class
Seurat
,SingleCellExperiment
, orArchRProject
. For multi-omic data, we recommend usingArchRProject
objects.- key
The name under which CHOIR-related data for this run is stored in the object. Defaults to “CHOIR”.
- distance_approx
A Boolean value indicating whether or not to use approximate distance calculations. Defaults to
TRUE
, which will use centroid-based distances. Setting distance approximation toFALSE
will substantially increase the computational time and memory required, particularly for large datasets. Using approximated distances (TRUE
) rather than absolute distances (FALSE
) is unlikely to have a meaningful effect on the distance thresholds imposed by CHOIR.- downsampling_rate
A numerical value indicating the proportion of cells to be sampled per cluster to train/test each random forest classifier. For efficiency, the default value, "auto", sets the downsampling rate according to the dataset size. Decreasing this parameter may decrease the computational time required, but may also make the final cluster calls more conservative. If input is provided to both
downsampling_rate
andsample_max parameters
, the minimum resulting cell number is calculated and used for each comparison. Note that thedownsampling_rate
is set in thebuildParentTree
function so that it can be retrieved in later steps when running CHOIR on atlas-scale data.- normalization_method
A character string or vector indicating which normalization method to use. In general, input data should be supplied to CHOIR after normalization, except when the user wishes to use
Seurat SCTransform
normalization. Permitted values are “none” or “SCTransform”. Defaults to “none”. Because CHOIR has not been tested thoroughly withSCTransform
normalization, we do not recommend this approach at this time. For multi-omic datasets, provide a vector with a value corresponding to each provided value ofuse_assay
orArchR_matrix
in the same order.- reduction_method
A character string or vector indicating which dimensionality reduction method to use. Permitted values are “PCA” for principal component analysis, “LSI” for latent semantic indexing, and “IterativeLSI” for iterative latent semantic indexing. These three methods implement the
Seurat
functionRunPCA
, theSignac
functionRunSVD
, and theArchR
functionaddIterativeLSI
, respectively. The default value,NULL
, will select a method based on the input data type, specifically “IterativeLSI” forArchR
objects, “LSI” forSeurat
orSingleCellExperiment
objects when parameteratac
isTRUE
, and “PCA” in all other cases. For multi-omic datasets, provide a vector with a value corresponding to each provided value ofuse_assay
orArchR_matrix
in the same order.- reduction_params
A list of additional parameters to be passed to the selected dimensionality reduction method. By default, CHOIR will use the default parameter settings of the dimensionality reduction method indicated by the input to parameter reduction_method. Input to this parameter is passed to each downstream dimensionality reduction method and will overwrite or augment those defaults. Altering the performance of the dimensionality reduction in CHOIR will affect downstream clustering results, but not in ways that are easily predictable.
- n_var_features
A numerical value indicating how many variable features to identify. Defaults to 2000 features for most data inputs, or 25000 features for ATAC-seq data. Increasing the number of features may increase the computational time and memory required. If the provided value is either substantially higher or lower, instances of underclustering may occur. For multi-omic datasets, provide a vector with a value corresponding to each provided value of
use_assay
orArchR_matrix
in the same order.- batch_correction_method
A character string indicating which batch correction method to use. Permitted values are “Harmony” and “none”. Defaults to “none”. Batch correction should only be used when the different batches are not expected to also have unique cell types or cell states. Using batch correction would ensure that clusters do not originate from a single batch, thereby making the final cluster calls more conservative.
- batch_correction_params
A list of additional parameters to be passed to the selected batch correction method for each iteration. Only applicable when
batch_correction_method
is “Harmony”.- batch_labels
A character string that, if applying batch correction, specifies the name of the column in the input object metadata containing the batch labels. Defaults to
NULL
.- neighbor_params
A list of additional parameters to be passed to
Seurat
functionFindNeighbors
(or, in the case of multi-modal data forSeurat
orSingleCellExperiment
objects,Seurat
functionFindMultiModalNeighbors
).- cluster_params
A list of additional parameters to be passed to
Seurat
functionFindClusters
for clustering at each level of the tree. By default, when theSeurat::FindClusters
parametergroup.singletons
is set toTRUE
, CHOIR relabels clusters such that each singleton constitutes its own cluster.- use_assay
For
Seurat
orSingleCellExperiment
objects, a character string or vector indicating the assay(s) to use in the provided object. The default value,NULL
, will choose the current active assay forSeurat
objects and thelogcounts
assay forSingleCellExperiment
objects.- use_slot
For
Seurat
objects, a character string or vector indicating the layers(s)—previously known as slot(s)—to use in the provided object. The default value,NULL
, will choose a layer/slot based on the selected assay. If an assay other than "RNA", "sketch”, "SCT”, or "integrated" is provided, you must specify a value foruse_slot
. For multi-omic datasets, provide a vector with a value corresponding to each provided value ofuse_assay
in the same order.- ArchR_matrix
For
ArchR
objects, a character string or vector indicating which matrix or matrices to use in the provided object. The default value,NULL
, will use the “GeneScoreMatrix” for ATAC-seq data or the “GeneExpressionMatrix” for RNA-seq data. For multi-omic datasets, provide a vector with a value corresponding to each modality. When "GeneScoreMatrix" is provided, the "GeneScoreMatrix" will be used as input to the random forest classifiers, but the "TileMatrix" will be used for the initial dimensionality reduction(s).- ArchR_depthcol
For
ArchR
objects, a character string or vector indicating which column to use for correlation with sequencing depth. The default value,NULL
, will use the “nFrags” column for ATAC-seq data or the “Gex_nUMI” for RNA-seq data. For multi-omic datasets, provide a vector with a value corresponding to each provided value ofArchR_matrix
in the same order.- countsplit
A Boolean value indicating whether or not to use count split input data (see
countsplit
package), such that one matrix of counts is used for clustering tree generation and a separate matrix is used for all random forest classifier permutation testing. Defaults toFALSE
. Enabling count splitting is likely to result in more conservative final cluster calls and is likely to perform best in datasets with high read depths.- countsplit_suffix
A character vector indicating the suffixes that distinguish the two count split matrices to be used. Suffixes are appended onto the input string/vector for parameter
use_slot
forSeurat
objects,use_assay
forSingleCellExperiment
objects, orArchR_matrix
forArchR
objects. When count splitting is enabled, the default valueNULL
uses suffixes "_1" and "_2".- reduction
An optional matrix of dimensionality reduction cell embeddings provided by the user for subsequent clustering steps. By default, this parameter is set to
NULL
, and the dimensionality reduction(s) will be calculated using the method specified by thereduction_method
parameter.- var_features
An optional character vector of names of variable features to be used for subsequent clustering steps. By default, this parameter is set to
NULL
, and variable features will be calculated as part of running CHOIR. Input to this parameter is required when a dimensionality reduction is supplied to parameterreduction
. For multi-omic datasets, concatenate feature names for all modalities.- atac
A Boolean value or vector indicating whether the provided data is ATAC-seq data. For multi-omic datasets, provide a vector with a value corresponding to each provided value of
use_assay
orArchR_matrix
in the same order. Defaults toFALSE
.- n_cores
A numerical value indicating the number of cores to use for parallelization. By default, CHOIR will use the number of available cores minus 2. CHOIR is parallelized at the computation of permutation test iterations. Therefore, any number of cores up to the number of iterations will theoretically decrease the computational time required. In practice, 8–16 cores are recommended for datasets up to 500,000 cells.
- random_seed
A numerical value indicating the random seed to be used. Defaults to 1. CHOIR uses randomization throughout the generation and pruning of the clustering tree. Therefore, changing the random seed may yield slight differences in the final cluster assignments.
- verbose
A Boolean value indicating whether to use verbose output during the execution of CHOIR. Defaults to
TRUE
, but can be set toFALSE
for a cleaner output.
Value
Returns the object with the following added data stored under the provided key:
- reduction
Cell embeddings for calculated dimensionality reduction
- var_features
Variable features for calculated dimensionality reduction
- cell_IDs
Cell IDs belonging to parent tree
- graph
Nearest neighbor and shared nearest neighbor adjacency matrices
- clusters
Parent hierarchical cluster tree
- parameters
Record of parameter values used