tools: tl#

batch correction#

combat

Batch effect correction using ComBat [Johnson07].

harmony

Data integration and batch correction using mutual nearest neighbors [Haghverdi19].

mnn

Data integration and batch correction using mutual nearest neighbors [Haghverdi19].

scanorama

Batch correction using Scanorama [Hie19].

scab.tools.batch_correction.combat(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, batch_key: str = 'batch', covariates: ~typing.Iterable | None = None, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Batch effect correction using ComBat [Johnson07].

See also

W. Evan Johnson, Cheng Li, Ariel Rabinovic
Adjusting batch effects in microarray expression data using empirical Bayes methods
Biostatistics 2007, doi: 10.1093/biostatistics/kxj037
Parameters:
  • adata (anndata.AnnData) – AnnData object containing gene counts data.

  • batch_key (str, default='batch') – Name of the column in adata.obs that corresponds to the batch.

  • covariates (iterable object, optional) – List of additional covariates besides the batch variable such as adjustment variables or biological condition. Not including covariates may lead to the removal of real biological signal.

  • post_correction_umap (bool, default=True) – If True, UMAP will be computed on the post-integration data using scab.tl.umap().

  • verbose (bool, default=True) – If True, print progress.

Returns:

adata

Return type:

anndata.AnnData

scab.tools.batch_correction.harmony(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, batch_key: str = 'batch', adjusted_basis: str = 'X_pca_harmony', n_dim: int = 50, force_pca: bool = False, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Data integration and batch correction using mutual nearest neighbors [Haghverdi19]. Uses the scanpy.external.pp.mnn_correct() function.

See also

Laleh Haghverdi, Aaron T L Lun, Michael D Morgan & John C Marioni
Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors
Nature Biotechnology 2019, doi: 10.1038/nbt.4091
Parameters:
  • adata (anndata.AnnData) – AnnData object containing gene counts data.

  • batch_key (str, default='batch') – Name of the column in adata.obs that corresponds to the batch.

  • adjusted_basis (str, default='X_pca_harmony') – Name of the basis in adata.obsm that will be added by harmony.

  • n_dim (int, default=50) – Number of dimensions to use for PCA.

  • force_pca (bool, default=False) – If True, PCA will be run even if adata.obsm['X_pca'] already exists.

  • post_correction_umap (bool, default=True) – If True, UMAP will be computed on the batch corrected data using scab.tl.umap().

  • verbose (bool, default=True) – If True, print progress.

Returns:

adata

Return type:

anndata.AnnData

scab.tools.batch_correction.mnn(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, batch_key: str = 'batch', min_hvg_batches: int = 1, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Data integration and batch correction using mutual nearest neighbors [Haghverdi19]. Uses the scanpy.external.pp.mnn_correct() function.

See also

Laleh Haghverdi, Aaron T L Lun, Michael D Morgan & John C Marioni
Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors
Nature Biotechnology 2019, doi: 10.1038/nbt.4091
Parameters:
  • adata (anndata.AnnData) – AnnData object containing gene counts data.

  • batch_key (str, default='batch') – Name of the column in adata.obs that corresponds to the batch.

  • min_hvg_batches (int, default=1) – Minimum number of batches in which highly variable genes are found in order to be included in the list of genes used for batch correction. Default is 1, which results in the use of all HVGs found in any batch.

  • post_correction_umap (bool, default=True) – If True, UMAP will be computed on the batch corrected data using scab.tl.umap().

  • verbose (bool, default=True) – If True, print progress.

Returns:

adata

Return type:

anndata.AnnData

scab.tools.batch_correction.scanorama(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, batch_key: str = 'batch', scanorama_key: str = 'X_Scanorama', n_dim: int = 50, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Batch correction using Scanorama [Hie19].

See also

Brian Hie, Bryan Bryson, and Bonnie Berger
Efficient integration of heterogeneous single-cell transcriptomes using Scanorama
Nature Biotechnology 2019, doi: 10.1038/s41587-019-0113-3
Parameters:
  • adata (anndata.AnnData) – AnnData object containing gene counts data.

  • batch_key (str, default='batch') – Name of the column in adata.obs that corresponds to the batch.

  • post_correction_umap (bool, default=True) – If True, UMAP will be computed on the batch corrected data using scab.tl.umap().

  • verbose (bool, default=True) – If True, print progress.

Returns:

adata

Return type:

anndata.AnnData

cellhashes#

demultiplex

Demultiplexes cells using cell hashes.

scab.tools.cellhashes.demultiplex(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, hash_names: ~typing.Iterable | None = None, cellhash_regex: str = 'cell ?hash', ignore_cellhash_case: bool = True, rename: dict | None = None, assignment_key: str = 'cellhash_assignment', threshold_minimum: float = 4.0, threshold_maximum: float = 10.0, kde_minimum: float = 0.0, kde_maximum: float = 15.0, assignments_only: bool = False, debug: bool = False) <MagicMock name='mock.AnnData' id='140098729656384'> | <MagicMock name='mock.Series' id='140098728194112'>#

Demultiplexes cells using cell hashes.

Parameters:
  • adata (anndata.Anndata) – AnnData object containing cellhash UMI counts in adata.obs.

  • hash_names (iterable object, optional) – List of hashnames, which correspond to column names in adata.obs. Overrides cellhash name matching using cellhash_regex. If not provided, all columns in adata.obs that match cellhash_regex will be assumed to be hashnames and processed.

  • cellhash_regex (str, default='cell ?hash') – A regular expression (regex) string used to identify cell hashes. The regex must be found in all cellhash names. The default is 'cell ?hash', which combined with the default setting for ignore_cellhash_regex_case, will match 'cellhash' or 'cell hash' anywhere in the cell hash name and in any combination of upper or lower case letters.

  • ignore_cellhash_regex_case (bool, default=True) – If True, matching to cellhash_regex will ignore case.

  • rename (dict, optional) –

    A dict linking cell hash names (column names in adata.obs) to the preferred batch name. For example, if the cell hash name 'Cellhash1' corresponded to the sample 'Sample1', an example rename argument would be:

    {'Cellhash1': 'Sample1'}
    

    This would result in all cells classified as positive for 'Cellhash1' being labeled as 'Sample1' in the resulting assignment column (adata.obs.sample by default, adjustable using assignment_key).

  • assignment_key (str, default='cellhash_assignment') – Column name (in adata.obs) into which cellhash assignments will be stored.

  • threshold_minimum (float, default=4.0) – Minimum acceptable log2-normalized UMI count threshold. Potential thresholds below this cutoff value will be ignored.

  • threshold_maximum (float, default=10.0) – Maximum acceptable log2-normalized UMI count threshold. Potential thresholds above this cutoff value will be ignored.

  • kde_maximum (float, default=15.0) – Upper limit of the KDE plot (in log2-normalized UMI counts). This should be less than threshold_maximum, or you may obtain strange results.

  • assignments_only (bool, default=False) – If True, return a pandas Series object containing only the group assignment. Suitable for appending to an existing dataframe. If False, an updated adata object is returned, containing cell hash group assignemnts at adata.obs.assignment_key

  • debug (bool, default=False) – If True, saves cell hash KDE plots and prints intermediate information for debugging.

Returns:

output – By default, an updated adata is returned with cell hash assignment groups stored in the assignment_key column of adata.obs. If assignments_only is True, a pandas.Series of lineage assignments is returned.

Return type:

anndata.AnnData or pandas.Series

clonality#

clonify

Assigns BCR sequences to clonal lineages using the clonify [Briney16] algorithm.

pairwise_distance

Computes length and mutation adjusted Levenshtein distance for a pair of sequences.

scab.tools.clonify.clonify(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, distance_cutoff: float = 0.32, shared_mutation_bonus: float = 0.65, length_penalty_multiplier: int | float = 2, group_by_v: bool = True, group_by_j: bool = True, group_light_by_v: bool = True, group_light_by_j: bool = True, preclustering: bool = False, preclustering_threshold: float = 0.65, preclustering_field: str = 'cdr3_nt', lineage_field: str = 'lineage', lineage_size_field: str = 'lineage_size', annotation_format: str = 'airr', return_assignment_dict: bool = False, pairs_only: bool = True, use_multiple_heavy_chains: bool = True) dict | <MagicMock name='mock.AnnData' id='140098729656384'>#

Assigns BCR sequences to clonal lineages using the clonify [Briney16] algorithm.

See also

Bryan Briney, Khoa Le, Jiang Zhu, and Dennis R Burton
Clonify: unseeded antibody lineage assignment from next-generation sequencing data.
Scientific Reports 2016. https://doi.org/10.1038/srep23901
Parameters:
  • adata (anndata.AnnData) – AnnData object containing annotated sequence data at adata.obs.bcr. If data was read using scab.read_10x_mtx(), BCR data should already be in the correct location.

  • distance_cutoff (float, default=0.32) – Distance threshold for lineage clustering.

  • shared_mutation_bonus (float, default=0.65) – Bonus applied for each shared V-gene mutation.

  • length_penalty_multiplier (int, default=2) – Multiplier for the CDR3 length penalty. Default is 2, resulting in CDR3s that differ by n amino acids being penalized n * 2.

  • group_by_v (bool, default=True) – If True, sequences are grouped by V-gene use prior to lineage assignment. This option is additive with group_by_j. For example, if group_by_v == True and group_by_j == True, sequences will be grouped by both V-gene and J-gene.

  • group_by_j (bool, default=True) – If True, sequences are grouped by J-gene use prior to lineage assignment. This option is additive with group_by_v. For example, if group_by_v == True and group_by_j == True, sequences will be grouped by both V-gene and J-gene.

  • group_light_by_v (bool, default=True) – If True, heavy chain sequences are grouped by their paired light chain V-gene prior to lineage assignment. The purpose is to ensure that light chains are coherent across an assigned lineage. Also, if multiple light chains are present, this option makes it easier to identify the “best” light chain by identifying the light chain that best fits the largest lineage. This option is additive with group_light_by_j.

  • group_light_by_j (bool, default=True) – If True, heavy chain sequences are grouped by their paired light chain J-gene prior to lineage assignment. The purpose is to ensure that light chains are coherent across an assigned lineage. Also, if multiple light chains are present, this option makes it easier to identify the “best” light chain by identifying the light chain that best fits the largest lineage. This option is additive with group_light_by_v.

  • preclustering (bool, default=False) – If True, V/J groups are pre-clustered on the preclustering_field sequence, which can potentially speed up lineage assignment and reduce memory usage. If False, each V/J group is processed in its entirety without pre-clustering.

  • preclustering_threshold (float, default=0.65) – Identity threshold for pre-clustering the V/J groups prior to lineage assignment.

  • preclustering_field (str, default='cdr3_nt') – Annotation field on which to pre-cluster sequences.

  • lineage_field (str, default='lineage') – Name of the lineage assignment field.

  • lineage_size_field (str, default='lineage_size') – Name of the lineage size field.

  • annotation_format (str, default='airr') – Format of the input sequence annotations. Choices are 'airr' or 'json'.

  • return_assignment_dict (bool, default=False) – If True, a dictionary linking sequence IDs to lineage names will be returned. If False, the input anndata.AnnData object will be returned, with lineage annotations included.

Returns:

output – By default (return_assignment_dict == False), an updated adata object is returned with two additional columns populated - adata.obs.bcr_lineage, which contains the lineage assignment, and adata.obs.bcr_lineage_size, which contains the lineage size. If return_assignment_dict == True, a dict mapping droplet barcodes (adata.obs_names) to lineage names is returned.

Return type:

anndata.AnnData or dict

scab.tools.clonify.pairwise_distance(s1: <MagicMock name='mock.Sequence' id='140098712290256'>, s2: <MagicMock name='mock.Sequence' id='140098712290256'>, shared_mutation_bonus: float = 0.65, length_penalty_multiplier: int | float = 2, cdr3_field: str = 'cdr3', mutations_field: str = 'mutations') float#

Computes length and mutation adjusted Levenshtein distance for a pair of sequences.

Parameters:
  • s1 (abutils.Sequence) – input sequence

  • s2 (abutils.Sequence) – input sequence

  • shared_mutation_bonus (float, optional) – The bonus for each shared mutation, by default 0.65

  • length_penalty_multiplier (Union[int, float], optional) – Used to compute the penalty for differences in CDR3 length. The length difference is multiplied by length_penalty_multiplier, by default 2

  • cdr3_field (str, optional) – Name of the field in s1 and s2 containing the CDR3 sequence, by default “cdr3”

  • mutations_field (str, optional) – Name of the field in s1 and s2 containing mutation information, by default “mutations”

Returns:

distance

Return type:

float

embeddings#

pca

Performs PCA, neighborhood graph construction and UMAP embedding.

umap

Performs PCA, neighborhood graph construction and UMAP embedding.

dimensionality_reduction

Deprecated, but retained for backwards compatibility.

scab.tools.embeddings.pca(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, solver: str = 'arpack', n_pcs: int = 50, ignore_ig: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Performs PCA, neighborhood graph construction and UMAP embedding. PAGA is optional, but is performed by default.

Parameters:
  • adata (anndata.AnnData) – AnnData object containing gene counts data.

  • solver (str, default='arpack') – Solver to use for the PCA.

  • n_pcs (int, default=50) – Number of principal components to use when computing the neighbor graph. Although the default value is generally appropriate, it is sometimes useful to empirically determine the optimal value for n_pcs.

  • ignore_ig (bool, default=True) – Ignores immunoglobulin V, D and J genes when computing the PCA.

Returns:

adata

Return type:

anndata.AnnData

scab.tools.embeddings.umap(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, solver: str = 'arpack', n_neighbors: int | None = None, n_pcs: int | None = None, force_pca: bool = False, ignore_ig: bool = True, paga: bool = True, batch_key: str | None = None, use_rna_velocity: bool = False, use_rep: str | None = None, random_state: int | float | str = 42, resolution: float = 1.0, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Performs PCA, neighborhood graph construction and UMAP embedding. PAGA is optional, but is performed by default.

Parameters:
  • adata (anndata.AnnData) – AnnData object containing gene counts data.

  • solver (str, default='arpack') – Solver to use for the PCA.

  • n_neighbors (int, default=10) – Number of neighbors to calculate for the neighbor graph.

  • n_pcs (int, default=40) – Number of principal components to use when computing the neighbor graph. Although the default value is generally appropriate, it is sometimes useful to empirically determine the optimal value for n_pcs.

  • force_pca (bool, default=False) – Construct the PCA even if it has already been constructed ("X_pcs" exists in adata.obsm). Default is False, which will use an existing PCA.

  • ignore_ig (bool, default=True) – Ignores immunoglobulin V, D and J genes when computing the PCA.

  • paga (bool, default=True) – If True, performs partition-based graph abstraction (PAGA) prior to UMAP embedding.

  • batch_key (str, optional) – If adata contains batch information, this is the key in adata.obs that contains the batch information. If provided, neighbors will be computed using batch-balanced KNN (scanpy.external.pp.bbknn) rather than scanpy.pp.neighbors.

  • use_rna_velocity (bool, default=False) – If True, uses RNA velocity information to compute PAGA. If False, this option is ignored.

  • use_rep (str, optional) – Representation to use when computing neighbors. For example, if data have been batch normalized with scanorama, the representation should be 'Scanorama'. If not provided, scanpy’s default representation is used.

  • random_state (int, optional) – Seed for the random state used by sc.tl.umap.

  • resolution (float, default=1.0) – Resolution for Leiden clustering.

Returns:

adata

Return type:

anndata.AnnData

scab.tools.embeddings.dimensionality_reduction(**kwargs) <MagicMock name='mock.AnnData' id='140098729656384'>#

Deprecated, but retained for backwards compatibility. Use scab.tl.umap instead.

specificity#

classify_specificity

Classifies BCR specificity using antigen barcodes (AgBCs).

scab.tools.specificity.classify_specificity(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, raw: <MagicMock name='mock.AnnData' id='140098729656384'> | str, agbcs: ~typing.Iterable | None = None, groups: dict | None = None, rename: dict | None = None, percentile: float = 0.997, percentile_dict: dict | None = None, threshold_dict: dict | None = None, agbc_regex: str = 'agbc', update: bool = True, uns_batch: str | None = None, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> | <MagicMock name='mock.DataFrame' id='140098728207408'>#

Classifies BCR specificity using antigen barcodes (AgBCs). Thresholds are computed by analyzing background AgBC UMI counts in empty droplets.

Caution

In order to set accurate thresholds, we must remove all cell-containing droplets from the raw counts matrix. Because adata comprises only cell-containing droplets, we simply remove all of the droplet barcodes in adata from raw. Thus, it is very important that adata and raw are well matched.

For example, if processing a single Chromium reaction containing several multiplexed samples, adata should contain all of the multiplexed samples, since the raw matrix produced by CellRanger will also include all droplets in the reaction. If adata was missing one or more samples, cell-containing droplets cannot accurately be removed from raw and classification accuracy will be adversely affected.

Parameters:
  • adata (anndata.AnnData) – Input AnnData object. Log2-normalized AgBC UMI counts should be found in adata.obs. If data was read using scab.read_10x_mtx(), the resulting AnnData object will already be correctly formatted.

  • raw (anndata.AnnData or str) –

    Raw matrix data. Either a path to a directory containing the raw .mtx file produced by CellRanger, or an anndata.AnnData object containing the raw matrix data. As with adata, log2-normalized AgBC UMIs should be found at raw.obs.

    Tip

    If reading the raw counts matrix with scab.read_10x_mtx(), it can be helpful to include ignore_zero_quantile_agbcs=False. In some cases with very little AgBC background, AgBCs can be incorrectly removed from the raw counts matrix.

  • agbcs (iterable object, optional) – A list of AgBCs to be classified. Either agbcs` or groups` is required. If both are provided, both will be used.

  • groups (dict, optional) – A dict mapping specificity names to a list of one or more AgBCs. This is particularly useful when multiple AgBCs correspond to the same antigen (either because dual-labeled AgBCs were used, or because several AgBCs are closely-related molecules that would be expected to compete for BCR binding). Either agbcs or groups is required. If both are provided, both will be used.

  • rename (dict, optional) – A dict mapping AgBC or group names to a new name. Keys should be present in either agbcs or groups.keys(). If only a subset of AgBCs or groups are provided in rename, then only those AgBCs or groups will be renamed.

  • percentile (float, default=0.997) – Percentile used to compute the AgBC classification threshold using raw data. Default is 0.997, which corresponds to three standard deviations.

  • percentile_dict (dict, optional) – A dict mapping AgBC or group names to the desired percentile. If only a subset of AgBCs or groups are provided in percentile_dict, all others will use percentile.

  • update (bool, default=True) – If True, update adata with grouped UMI counts and classifications. If False, a Pandas DataFrame containg classifications will be returned and adata will not be modified.

  • uns_batch (str, default=None) –

    If provided, uns_batch will add batch information to the percentile and threshold data stored in adata.uns. This results in an additional layer of nesting, which allows concatenating multiple AnnData objects represeting different batches for which classification is performed separately. If not provided, the data stored in uns would be formatted like:

    adata.uns['agbc_percentiles'] = {agbc1: percentile1, ...}
    adata.uns['agbc_thresholds'] = {agbc1: threshold1, ...}
    

    If uns_batch is provided, uns will be formatted like:

    adata.ubs['agbc_percentiles'] = {uns_batch: {agbc1: percentile1, ...}}
    adata.ubs['agbc_thresholds'] = {uns_batch: {agbc1: threshold1, ...}}
    

  • verbose (bool, default=True) – If True, calculated threshold values are printed.

Returns:

output – If update is True, an updated adata object containing specificity classifications is returned. Otherwise, a Pandas DataFrame containing specificity classifications is returned.

Return type:

anndata.AnnData or pandas.DataFrame