tools: tl
#
batch correction#
Batch effect correction using ComBat [Johnson07]. |
|
Data integration and batch correction using mutual nearest neighbors [Haghverdi19]. |
|
Data integration and batch correction using mutual nearest neighbors [Haghverdi19]. |
|
- scab.tools.batch_correction.combat(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, batch_key: str = 'batch', covariates: ~typing.Iterable | None = None, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> #
Batch effect correction using ComBat [Johnson07].
See also
W. Evan Johnson, Cheng Li, Ariel RabinovicAdjusting batch effects in microarray expression data using empirical Bayes methodsBiostatistics 2007, doi: 10.1093/biostatistics/kxj037- Parameters:
adata (anndata.AnnData) –
AnnData
object containing gene counts data.batch_key (str, default='batch') – Name of the column in adata.obs that corresponds to the batch.
covariates (iterable object, optional) – List of additional covariates besides the batch variable such as adjustment variables or biological condition. Not including covariates may lead to the removal of real biological signal.
post_correction_umap (bool, default=True) – If
True
, UMAP will be computed on the post-integration data usingscab.tl.umap()
.verbose (bool, default=True) – If
True
, print progress.
- Returns:
adata
- Return type:
anndata.AnnData
- scab.tools.batch_correction.harmony(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, batch_key: str = 'batch', adjusted_basis: str = 'X_pca_harmony', n_dim: int = 50, force_pca: bool = False, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> #
Data integration and batch correction using mutual nearest neighbors [Haghverdi19]. Uses the
scanpy.external.pp.mnn_correct()
function.See also
Laleh Haghverdi, Aaron T L Lun, Michael D Morgan & John C MarioniBatch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighborsNature Biotechnology 2019, doi: 10.1038/nbt.4091- Parameters:
adata (anndata.AnnData) –
AnnData
object containing gene counts data.batch_key (str, default='batch') – Name of the column in adata.obs that corresponds to the batch.
adjusted_basis (str, default='X_pca_harmony') – Name of the basis in
adata.obsm
that will be added by harmony.n_dim (int, default=50) – Number of dimensions to use for PCA.
force_pca (bool, default=False) – If
True
, PCA will be run even ifadata.obsm['X_pca']
already exists.post_correction_umap (bool, default=True) – If
True
, UMAP will be computed on the batch corrected data usingscab.tl.umap()
.verbose (bool, default=True) – If
True
, print progress.
- Returns:
adata
- Return type:
anndata.AnnData
- scab.tools.batch_correction.mnn(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, batch_key: str = 'batch', min_hvg_batches: int = 1, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> #
Data integration and batch correction using mutual nearest neighbors [Haghverdi19]. Uses the
scanpy.external.pp.mnn_correct()
function.See also
Laleh Haghverdi, Aaron T L Lun, Michael D Morgan & John C MarioniBatch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighborsNature Biotechnology 2019, doi: 10.1038/nbt.4091- Parameters:
adata (anndata.AnnData) –
AnnData
object containing gene counts data.batch_key (str, default='batch') – Name of the column in adata.obs that corresponds to the batch.
min_hvg_batches (int, default=1) – Minimum number of batches in which highly variable genes are found in order to be included in the list of genes used for batch correction. Default is
1
, which results in the use of all HVGs found in any batch.post_correction_umap (bool, default=True) – If
True
, UMAP will be computed on the batch corrected data usingscab.tl.umap()
.verbose (bool, default=True) – If
True
, print progress.
- Returns:
adata
- Return type:
anndata.AnnData
- scab.tools.batch_correction.scanorama(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, batch_key: str = 'batch', scanorama_key: str = 'X_Scanorama', n_dim: int = 50, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> #
Batch correction using Scanorama [Hie19].
See also
Brian Hie, Bryan Bryson, and Bonnie BergerEfficient integration of heterogeneous single-cell transcriptomes using ScanoramaNature Biotechnology 2019, doi: 10.1038/s41587-019-0113-3- Parameters:
adata (anndata.AnnData) –
AnnData
object containing gene counts data.batch_key (str, default='batch') – Name of the column in
adata.obs
that corresponds to the batch.post_correction_umap (bool, default=True) – If
True
, UMAP will be computed on the batch corrected data usingscab.tl.umap()
.verbose (bool, default=True) – If
True
, print progress.
- Returns:
adata
- Return type:
anndata.AnnData
cellhashes#
Demultiplexes cells using cell hashes. |
- scab.tools.cellhashes.demultiplex(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, hash_names: ~typing.Iterable | None = None, cellhash_regex: str = 'cell ?hash', ignore_cellhash_case: bool = True, rename: dict | None = None, assignment_key: str = 'cellhash_assignment', threshold_minimum: float = 4.0, threshold_maximum: float = 10.0, kde_minimum: float = 0.0, kde_maximum: float = 15.0, assignments_only: bool = False, debug: bool = False) <MagicMock name='mock.AnnData' id='140098729656384'> | <MagicMock name='mock.Series' id='140098728194112'> #
Demultiplexes cells using cell hashes.
- Parameters:
adata (anndata.Anndata) –
AnnData
object containing cellhash UMI counts inadata.obs
.hash_names (iterable object, optional) – List of hashnames, which correspond to column names in
adata.obs
. Overrides cellhash name matching using cellhash_regex. If not provided, all columns inadata.obs
that match cellhash_regex will be assumed to be hashnames and processed.cellhash_regex (str, default='cell ?hash') – A regular expression (regex) string used to identify cell hashes. The regex must be found in all cellhash names. The default is
'cell ?hash'
, which combined with the default setting for ignore_cellhash_regex_case, will match'cellhash'
or'cell hash'
anywhere in the cell hash name and in any combination of upper or lower case letters.ignore_cellhash_regex_case (bool, default=True) – If
True
, matching to cellhash_regex will ignore case.rename (dict, optional) –
A
dict
linking cell hash names (column names inadata.obs
) to the preferred batch name. For example, if the cell hash name'Cellhash1'
corresponded to the sample'Sample1'
, an example rename argument would be:{'Cellhash1': 'Sample1'}
This would result in all cells classified as positive for
'Cellhash1'
being labeled as'Sample1'
in the resulting assignment column (adata.obs.sample
by default, adjustable using assignment_key).assignment_key (str, default='cellhash_assignment') – Column name (in
adata.obs
) into which cellhash assignments will be stored.threshold_minimum (float, default=4.0) – Minimum acceptable log2-normalized UMI count threshold. Potential thresholds below this cutoff value will be ignored.
threshold_maximum (float, default=10.0) – Maximum acceptable log2-normalized UMI count threshold. Potential thresholds above this cutoff value will be ignored.
kde_maximum (float, default=15.0) – Upper limit of the KDE plot (in log2-normalized UMI counts). This should be less than threshold_maximum, or you may obtain strange results.
assignments_only (bool, default=False) – If
True
, return a pandasSeries
object containing only the group assignment. Suitable for appending to an existing dataframe. IfFalse
, an updated adata object is returned, containing cell hash group assignemnts atadata.obs.assignment_key
debug (bool, default=False) – If
True
, saves cell hash KDE plots and prints intermediate information for debugging.
- Returns:
output – By default, an updated adata is returned with cell hash assignment groups stored in the assignment_key column of
adata.obs
. If assignments_only isTrue
, apandas.Series
of lineage assignments is returned.- Return type:
anndata.AnnData
orpandas.Series
clonality#
Assigns BCR sequences to clonal lineages using the clonify [Briney16] algorithm. |
|
Computes length and mutation adjusted Levenshtein distance for a pair of sequences. |
- scab.tools.clonify.clonify(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, distance_cutoff: float = 0.32, shared_mutation_bonus: float = 0.65, length_penalty_multiplier: int | float = 2, group_by_v: bool = True, group_by_j: bool = True, group_light_by_v: bool = True, group_light_by_j: bool = True, preclustering: bool = False, preclustering_threshold: float = 0.65, preclustering_field: str = 'cdr3_nt', lineage_field: str = 'lineage', lineage_size_field: str = 'lineage_size', annotation_format: str = 'airr', return_assignment_dict: bool = False, pairs_only: bool = True, use_multiple_heavy_chains: bool = True) dict | <MagicMock name='mock.AnnData' id='140098729656384'> #
Assigns BCR sequences to clonal lineages using the clonify [Briney16] algorithm.
See also
Bryan Briney, Khoa Le, Jiang Zhu, and Dennis R BurtonClonify: unseeded antibody lineage assignment from next-generation sequencing data.Scientific Reports 2016. https://doi.org/10.1038/srep23901- Parameters:
adata (anndata.AnnData) –
AnnData
object containing annotated sequence data atadata.obs.bcr
. If data was read usingscab.read_10x_mtx()
, BCR data should already be in the correct location.distance_cutoff (float, default=0.32) – Distance threshold for lineage clustering.
shared_mutation_bonus (float, default=0.65) – Bonus applied for each shared V-gene mutation.
length_penalty_multiplier (int, default=2) – Multiplier for the CDR3 length penalty. Default is
2
, resulting in CDR3s that differ byn
amino acids being penalizedn * 2
.group_by_v (bool, default=True) – If
True
, sequences are grouped by V-gene use prior to lineage assignment. This option is additive withgroup_by_j
. For example, ifgroup_by_v == True
andgroup_by_j == True
, sequences will be grouped by both V-gene and J-gene.group_by_j (bool, default=True) – If
True
, sequences are grouped by J-gene use prior to lineage assignment. This option is additive withgroup_by_v
. For example, ifgroup_by_v == True
andgroup_by_j == True
, sequences will be grouped by both V-gene and J-gene.group_light_by_v (bool, default=True) – If
True
, heavy chain sequences are grouped by their paired light chain V-gene prior to lineage assignment. The purpose is to ensure that light chains are coherent across an assigned lineage. Also, if multiple light chains are present, this option makes it easier to identify the “best” light chain by identifying the light chain that best fits the largest lineage. This option is additive withgroup_light_by_j
.group_light_by_j (bool, default=True) – If
True
, heavy chain sequences are grouped by their paired light chain J-gene prior to lineage assignment. The purpose is to ensure that light chains are coherent across an assigned lineage. Also, if multiple light chains are present, this option makes it easier to identify the “best” light chain by identifying the light chain that best fits the largest lineage. This option is additive withgroup_light_by_v
.preclustering (bool, default=False) – If
True
, V/J groups are pre-clustered on the preclustering_field sequence, which can potentially speed up lineage assignment and reduce memory usage. IfFalse
, each V/J group is processed in its entirety without pre-clustering.preclustering_threshold (float, default=0.65) – Identity threshold for pre-clustering the V/J groups prior to lineage assignment.
preclustering_field (str, default='cdr3_nt') – Annotation field on which to pre-cluster sequences.
lineage_field (str, default='lineage') – Name of the lineage assignment field.
lineage_size_field (str, default='lineage_size') – Name of the lineage size field.
annotation_format (str, default='airr') – Format of the input sequence annotations. Choices are
'airr'
or'json'
.return_assignment_dict (bool, default=False) – If
True
, a dictionary linking sequence IDs to lineage names will be returned. IfFalse
, the inputanndata.AnnData
object will be returned, with lineage annotations included.
- Returns:
output – By default (
return_assignment_dict == False
), an updated adata object is returned with two additional columns populated -adata.obs.bcr_lineage
, which contains the lineage assignment, andadata.obs.bcr_lineage_size
, which contains the lineage size. Ifreturn_assignment_dict == True
, adict
mapping droplet barcodes (adata.obs_names
) to lineage names is returned.- Return type:
anndata.AnnData
ordict
- scab.tools.clonify.pairwise_distance(s1: <MagicMock name='mock.Sequence' id='140098712290256'>, s2: <MagicMock name='mock.Sequence' id='140098712290256'>, shared_mutation_bonus: float = 0.65, length_penalty_multiplier: int | float = 2, cdr3_field: str = 'cdr3', mutations_field: str = 'mutations') float #
Computes length and mutation adjusted Levenshtein distance for a pair of sequences.
- Parameters:
s1 (abutils.Sequence) – input sequence
s2 (abutils.Sequence) – input sequence
shared_mutation_bonus (float, optional) – The bonus for each shared mutation, by default 0.65
length_penalty_multiplier (Union[int, float], optional) – Used to compute the penalty for differences in CDR3 length. The length difference is multiplied by length_penalty_multiplier, by default 2
cdr3_field (str, optional) – Name of the field in s1 and s2 containing the CDR3 sequence, by default “cdr3”
mutations_field (str, optional) – Name of the field in s1 and s2 containing mutation information, by default “mutations”
- Returns:
distance
- Return type:
float
embeddings#
Performs PCA, neighborhood graph construction and UMAP embedding. |
|
Performs PCA, neighborhood graph construction and UMAP embedding. |
|
Deprecated, but retained for backwards compatibility. |
- scab.tools.embeddings.pca(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, solver: str = 'arpack', n_pcs: int = 50, ignore_ig: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> #
Performs PCA, neighborhood graph construction and UMAP embedding. PAGA is optional, but is performed by default.
- Parameters:
adata (anndata.AnnData) –
AnnData
object containing gene counts data.solver (str, default='arpack') – Solver to use for the PCA.
n_pcs (int, default=50) – Number of principal components to use when computing the neighbor graph. Although the default value is generally appropriate, it is sometimes useful to empirically determine the optimal value for n_pcs.
ignore_ig (bool, default=True) – Ignores immunoglobulin V, D and J genes when computing the PCA.
- Returns:
adata
- Return type:
anndata.AnnData
- scab.tools.embeddings.umap(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, solver: str = 'arpack', n_neighbors: int | None = None, n_pcs: int | None = None, force_pca: bool = False, ignore_ig: bool = True, paga: bool = True, batch_key: str | None = None, use_rna_velocity: bool = False, use_rep: str | None = None, random_state: int | float | str = 42, resolution: float = 1.0, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> #
Performs PCA, neighborhood graph construction and UMAP embedding. PAGA is optional, but is performed by default.
- Parameters:
adata (anndata.AnnData) –
AnnData
object containing gene counts data.solver (str, default='arpack') – Solver to use for the PCA.
n_neighbors (int, default=10) – Number of neighbors to calculate for the neighbor graph.
n_pcs (int, default=40) – Number of principal components to use when computing the neighbor graph. Although the default value is generally appropriate, it is sometimes useful to empirically determine the optimal value for n_pcs.
force_pca (bool, default=False) – Construct the PCA even if it has already been constructed (
"X_pcs"
exists inadata.obsm
). Default isFalse
, which will use an existing PCA.ignore_ig (bool, default=True) – Ignores immunoglobulin V, D and J genes when computing the PCA.
paga (bool, default=True) – If
True
, performs partition-based graph abstraction (PAGA) prior to UMAP embedding.batch_key (str, optional) – If
adata
contains batch information, this is the key inadata.obs
that contains the batch information. If provided, neighbors will be computed using batch-balanced KNN (scanpy.external.pp.bbknn
) rather thanscanpy.pp.neighbors
.use_rna_velocity (bool, default=False) – If
True
, uses RNA velocity information to compute PAGA. IfFalse
, this option is ignored.use_rep (str, optional) – Representation to use when computing neighbors. For example, if data have been batch normalized with
scanorama
, the representation should be'Scanorama'
. If not provided,scanpy
’s default representation is used.random_state (int, optional) – Seed for the random state used by
sc.tl.umap
.resolution (float, default=1.0) – Resolution for Leiden clustering.
- Returns:
adata
- Return type:
anndata.AnnData
- scab.tools.embeddings.dimensionality_reduction(**kwargs) <MagicMock name='mock.AnnData' id='140098729656384'> #
Deprecated, but retained for backwards compatibility. Use
scab.tl.umap
instead.
specificity#
Classifies BCR specificity using antigen barcodes (AgBCs). |
- scab.tools.specificity.classify_specificity(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, raw: <MagicMock name='mock.AnnData' id='140098729656384'> | str, agbcs: ~typing.Iterable | None = None, groups: dict | None = None, rename: dict | None = None, percentile: float = 0.997, percentile_dict: dict | None = None, threshold_dict: dict | None = None, agbc_regex: str = 'agbc', update: bool = True, uns_batch: str | None = None, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> | <MagicMock name='mock.DataFrame' id='140098728207408'> #
Classifies BCR specificity using antigen barcodes (AgBCs). Thresholds are computed by analyzing background AgBC UMI counts in empty droplets.
Caution
In order to set accurate thresholds, we must remove all cell-containing droplets from the
raw
counts matrix. Becauseadata
comprises only cell-containing droplets, we simply remove all of the droplet barcodes inadata
fromraw
. Thus, it is very important thatadata
andraw
are well matched.For example, if processing a single Chromium reaction containing several multiplexed samples,
adata
should contain all of the multiplexed samples, since the raw matrix produced by CellRanger will also include all droplets in the reaction. Ifadata
was missing one or more samples, cell-containing droplets cannot accurately be removed fromraw
and classification accuracy will be adversely affected.- Parameters:
adata (anndata.AnnData) – Input
AnnData
object. Log2-normalized AgBC UMI counts should be found inadata.obs
. If data was read usingscab.read_10x_mtx()
, the resultingAnnData
object will already be correctly formatted.raw (anndata.AnnData or str) –
Raw matrix data. Either a path to a directory containing the raw
.mtx
file produced by CellRanger, or ananndata.AnnData
object containing the raw matrix data. As with adata, log2-normalized AgBC UMIs should be found atraw.obs
.Tip
If reading the raw counts matrix with
scab.read_10x_mtx()
, it can be helpful to includeignore_zero_quantile_agbcs=False
. In some cases with very little AgBC background, AgBCs can be incorrectly removed from the raw counts matrix.agbcs (iterable object, optional) – A list of AgBCs to be classified. Either agbcs` or groups` is required. If both are provided, both will be used.
groups (dict, optional) – A
dict
mapping specificity names to a list of one or more AgBCs. This is particularly useful when multiple AgBCs correspond to the same antigen (either because dual-labeled AgBCs were used, or because several AgBCs are closely-related molecules that would be expected to compete for BCR binding). Either agbcs or groups is required. If both are provided, both will be used.rename (dict, optional) – A
dict
mapping AgBC or group names to a new name. Keys should be present in eitheragbcs
orgroups.keys()
. If only a subset of AgBCs or groups are provided inrename
, then only those AgBCs or groups will be renamed.percentile (float, default=0.997) – Percentile used to compute the AgBC classification threshold using raw data. Default is
0.997
, which corresponds to three standard deviations.percentile_dict (dict, optional) – A
dict
mapping AgBC or group names to the desired percentile. If only a subset of AgBCs or groups are provided in percentile_dict, all others will use percentile.update (bool, default=True) – If
True
, update adata with grouped UMI counts and classifications. IfFalse
, a PandasDataFrame
containg classifications will be returned and adata will not be modified.uns_batch (str, default=None) –
If provided, uns_batch will add batch information to the percentile and threshold data stored in
adata.uns
. This results in an additional layer of nesting, which allows concatenating multipleAnnData
objects represeting different batches for which classification is performed separately. If not provided, the data stored inuns
would be formatted like:adata.uns['agbc_percentiles'] = {agbc1: percentile1, ...} adata.uns['agbc_thresholds'] = {agbc1: threshold1, ...}
If uns_batch is provided,
uns
will be formatted like:adata.ubs['agbc_percentiles'] = {uns_batch: {agbc1: percentile1, ...}} adata.ubs['agbc_thresholds'] = {uns_batch: {agbc1: threshold1, ...}}
verbose (bool, default=True) – If
True
, calculated threshold values are printed.
- Returns:
output – If update is
True
, an updated adata object containing specificity classifications is returned. Otherwise, a PandasDataFrame
containing specificity classifications is returned.- Return type:
anndata.AnnData
orpandas.DataFrame