tools: tl¶
batch correction¶
Batch effect correction using ComBat [Johnson07]. |
|
Data integration and batch correction using mutual nearest neighbors [Haghverdi19]. |
|
Data integration and batch correction using mutual nearest neighbors [Haghverdi19]. |
|
- scab.tools.batch_correction.combat(adata: <MagicMock name='mock.AnnData' id='139656335913024'>, batch_key: str = 'batch', covariates: ~typing.Iterable | None = None, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='139656335913024'>¶
Batch effect correction using ComBat [Johnson07].
See also
W. Evan Johnson, Cheng Li, Ariel RabinovicAdjusting batch effects in microarray expression data using empirical Bayes methodsBiostatistics 2007, doi: 10.1093/biostatistics/kxj037- Parameters:
adata (anndata.AnnData) –
AnnDataobject containing gene counts data.batch_key (str, default='batch') – Name of the column in adata.obs that corresponds to the batch.
covariates (iterable object, optional) – List of additional covariates besides the batch variable such as adjustment variables or biological condition. Not including covariates may lead to the removal of real biological signal.
post_correction_umap (bool, default=True) – If
True, UMAP will be computed on the post-integration data usingscab.tl.umap().verbose (bool, default=True) – If
True, print progress.
- Returns:
adata
- Return type:
anndata.AnnData
- scab.tools.batch_correction.harmony(adata: <MagicMock name='mock.AnnData' id='139656335913024'>, batch_key: str = 'batch', adjusted_basis: str = 'X_pca_harmony', n_dim: int = 50, force_pca: bool = False, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='139656335913024'>¶
Data integration and batch correction using mutual nearest neighbors [Haghverdi19]. Uses the
scanpy.external.pp.mnn_correct()function.See also
Laleh Haghverdi, Aaron T L Lun, Michael D Morgan & John C MarioniBatch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighborsNature Biotechnology 2019, doi: 10.1038/nbt.4091- Parameters:
adata (anndata.AnnData) –
AnnDataobject containing gene counts data.batch_key (str, default='batch') – Name of the column in adata.obs that corresponds to the batch.
adjusted_basis (str, default='X_pca_harmony') – Name of the basis in
adata.obsmthat will be added by harmony.n_dim (int, default=50) – Number of dimensions to use for PCA.
force_pca (bool, default=False) – If
True, PCA will be run even ifadata.obsm['X_pca']already exists.post_correction_umap (bool, default=True) – If
True, UMAP will be computed on the batch corrected data usingscab.tl.umap().verbose (bool, default=True) – If
True, print progress.
- Returns:
adata
- Return type:
anndata.AnnData
- scab.tools.batch_correction.mnn(adata: <MagicMock name='mock.AnnData' id='139656335913024'>, batch_key: str = 'batch', min_hvg_batches: int = 1, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='139656335913024'>¶
Data integration and batch correction using mutual nearest neighbors [Haghverdi19]. Uses the
scanpy.external.pp.mnn_correct()function.See also
Laleh Haghverdi, Aaron T L Lun, Michael D Morgan & John C MarioniBatch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighborsNature Biotechnology 2019, doi: 10.1038/nbt.4091- Parameters:
adata (anndata.AnnData) –
AnnDataobject containing gene counts data.batch_key (str, default='batch') – Name of the column in adata.obs that corresponds to the batch.
min_hvg_batches (int, default=1) – Minimum number of batches in which highly variable genes are found in order to be included in the list of genes used for batch correction. Default is
1, which results in the use of all HVGs found in any batch.post_correction_umap (bool, default=True) – If
True, UMAP will be computed on the batch corrected data usingscab.tl.umap().verbose (bool, default=True) – If
True, print progress.
- Returns:
adata
- Return type:
anndata.AnnData
- scab.tools.batch_correction.scanorama(adata: <MagicMock name='mock.AnnData' id='139656335913024'>, batch_key: str = 'batch', scanorama_key: str = 'X_Scanorama', n_dim: int = 50, post_correction_umap: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='139656335913024'>¶
Batch correction using Scanorama [Hie19].
See also
Brian Hie, Bryan Bryson, and Bonnie BergerEfficient integration of heterogeneous single-cell transcriptomes using ScanoramaNature Biotechnology 2019, doi: 10.1038/s41587-019-0113-3- Parameters:
adata (anndata.AnnData) –
AnnDataobject containing gene counts data.batch_key (str, default='batch') – Name of the column in
adata.obsthat corresponds to the batch.post_correction_umap (bool, default=True) – If
True, UMAP will be computed on the batch corrected data usingscab.tl.umap().verbose (bool, default=True) – If
True, print progress.
- Returns:
adata
- Return type:
anndata.AnnData
cellhashes¶
Demultiplexes cells using cell hashes. |
- scab.tools.cellhashes.demultiplex(adata: <MagicMock name='mock.AnnData' id='139656335913024'>, hash_names: ~typing.Iterable | None = None, cellhash_regex: str = 'cell ?hash', ignore_cellhash_case: bool = True, rename: dict | None = None, assignment_key: str = 'cellhash_assignment', threshold_minimum: float = 4.0, threshold_maximum: float = 10.0, kde_minimum: float = 0.0, kde_maximum: float = 15.0, assignments_only: bool = False, debug: bool = False) <MagicMock name='mock.AnnData' id='139656335913024'> | <MagicMock name='mock.Series' id='139656334712784'>¶
Demultiplexes cells using cell hashes.
- Parameters:
adata (anndata.Anndata) –
AnnDataobject containing cellhash UMI counts inadata.obs.hash_names (iterable object, optional) – List of hashnames, which correspond to column names in
adata.obs. Overrides cellhash name matching using cellhash_regex. If not provided, all columns inadata.obsthat match cellhash_regex will be assumed to be hashnames and processed.cellhash_regex (str, default='cell ?hash') – A regular expression (regex) string used to identify cell hashes. The regex must be found in all cellhash names. The default is
'cell ?hash', which combined with the default setting for ignore_cellhash_regex_case, will match'cellhash'or'cell hash'anywhere in the cell hash name and in any combination of upper or lower case letters.ignore_cellhash_regex_case (bool, default=True) – If
True, matching to cellhash_regex will ignore case.rename (dict, optional) –
A
dictlinking cell hash names (column names inadata.obs) to the preferred batch name. For example, if the cell hash name'Cellhash1'corresponded to the sample'Sample1', an example rename argument would be:{'Cellhash1': 'Sample1'}
This would result in all cells classified as positive for
'Cellhash1'being labeled as'Sample1'in the resulting assignment column (adata.obs.sampleby default, adjustable using assignment_key).assignment_key (str, default='cellhash_assignment') – Column name (in
adata.obs) into which cellhash assignments will be stored.threshold_minimum (float, default=4.0) – Minimum acceptable log2-normalized UMI count threshold. Potential thresholds below this cutoff value will be ignored.
threshold_maximum (float, default=10.0) – Maximum acceptable log2-normalized UMI count threshold. Potential thresholds above this cutoff value will be ignored.
kde_maximum (float, default=15.0) – Upper limit of the KDE plot (in log2-normalized UMI counts). This should be less than threshold_maximum, or you may obtain strange results.
assignments_only (bool, default=False) – If
True, return a pandasSeriesobject containing only the group assignment. Suitable for appending to an existing dataframe. IfFalse, an updated adata object is returned, containing cell hash group assignemnts atadata.obs.assignment_keydebug (bool, default=False) – If
True, saves cell hash KDE plots and prints intermediate information for debugging.
- Returns:
output – By default, an updated adata is returned with cell hash assignment groups stored in the assignment_key column of
adata.obs. If assignments_only isTrue, apandas.Seriesof lineage assignments is returned.- Return type:
anndata.AnnDataorpandas.Series
clonality¶
Assigns BCR sequences to clonal lineages using the clonify [Briney16] algorithm. |
|
Computes length and mutation adjusted Levenshtein distance for a pair of sequences. |
- scab.tools.clonify.clonify(adata: <MagicMock name='mock.AnnData' id='139656335913024'>, distance_cutoff: float = 0.32, shared_mutation_bonus: float = 0.65, length_penalty_multiplier: int | float = 2, group_by_v: bool = True, group_by_j: bool = True, group_light_by_v: bool = True, group_light_by_j: bool = True, preclustering: bool = False, preclustering_threshold: float = 0.65, preclustering_field: str = 'cdr3_nt', lineage_field: str = 'lineage', lineage_size_field: str = 'lineage_size', annotation_format: str = 'airr', return_assignment_dict: bool = False, pairs_only: bool = True, use_multiple_heavy_chains: bool = True) dict | <MagicMock name='mock.AnnData' id='139656335913024'>¶
Assigns BCR sequences to clonal lineages using the clonify [Briney16] algorithm.
See also
Bryan Briney, Khoa Le, Jiang Zhu, and Dennis R BurtonClonify: unseeded antibody lineage assignment from next-generation sequencing data.Scientific Reports 2016. https://doi.org/10.1038/srep23901- Parameters:
adata (anndata.AnnData) –
AnnDataobject containing annotated sequence data atadata.obs.bcr. If data was read usingscab.read_10x_mtx(), BCR data should already be in the correct location.distance_cutoff (float, default=0.32) – Distance threshold for lineage clustering.
shared_mutation_bonus (float, default=0.65) – Bonus applied for each shared V-gene mutation.
length_penalty_multiplier (int, default=2) – Multiplier for the CDR3 length penalty. Default is
2, resulting in CDR3s that differ bynamino acids being penalizedn * 2.group_by_v (bool, default=True) – If
True, sequences are grouped by V-gene use prior to lineage assignment. This option is additive withgroup_by_j. For example, ifgroup_by_v == Trueandgroup_by_j == True, sequences will be grouped by both V-gene and J-gene.group_by_j (bool, default=True) – If
True, sequences are grouped by J-gene use prior to lineage assignment. This option is additive withgroup_by_v. For example, ifgroup_by_v == Trueandgroup_by_j == True, sequences will be grouped by both V-gene and J-gene.group_light_by_v (bool, default=True) – If
True, heavy chain sequences are grouped by their paired light chain V-gene prior to lineage assignment. The purpose is to ensure that light chains are coherent across an assigned lineage. Also, if multiple light chains are present, this option makes it easier to identify the “best” light chain by identifying the light chain that best fits the largest lineage. This option is additive withgroup_light_by_j.group_light_by_j (bool, default=True) – If
True, heavy chain sequences are grouped by their paired light chain J-gene prior to lineage assignment. The purpose is to ensure that light chains are coherent across an assigned lineage. Also, if multiple light chains are present, this option makes it easier to identify the “best” light chain by identifying the light chain that best fits the largest lineage. This option is additive withgroup_light_by_v.preclustering (bool, default=False) – If
True, V/J groups are pre-clustered on the preclustering_field sequence, which can potentially speed up lineage assignment and reduce memory usage. IfFalse, each V/J group is processed in its entirety without pre-clustering.preclustering_threshold (float, default=0.65) – Identity threshold for pre-clustering the V/J groups prior to lineage assignment.
preclustering_field (str, default='cdr3_nt') – Annotation field on which to pre-cluster sequences.
lineage_field (str, default='lineage') – Name of the lineage assignment field.
lineage_size_field (str, default='lineage_size') – Name of the lineage size field.
annotation_format (str, default='airr') – Format of the input sequence annotations. Choices are
'airr'or'json'.return_assignment_dict (bool, default=False) – If
True, a dictionary linking sequence IDs to lineage names will be returned. IfFalse, the inputanndata.AnnDataobject will be returned, with lineage annotations included.
- Returns:
output – By default (
return_assignment_dict == False), an updated adata object is returned with two additional columns populated -adata.obs.bcr_lineage, which contains the lineage assignment, andadata.obs.bcr_lineage_size, which contains the lineage size. Ifreturn_assignment_dict == True, adictmapping droplet barcodes (adata.obs_names) to lineage names is returned.- Return type:
anndata.AnnDataordict
- scab.tools.clonify.pairwise_distance(s1: <MagicMock name='mock.Sequence' id='139656344409856'>, s2: <MagicMock name='mock.Sequence' id='139656344409856'>, shared_mutation_bonus: float = 0.65, length_penalty_multiplier: int | float = 2, cdr3_field: str = 'cdr3', mutations_field: str = 'mutations') float¶
Computes length and mutation adjusted Levenshtein distance for a pair of sequences.
- Parameters:
s1 (abutils.Sequence) – input sequence
s2 (abutils.Sequence) – input sequence
shared_mutation_bonus (float, optional) – The bonus for each shared mutation, by default 0.65
length_penalty_multiplier (Union[int, float], optional) – Used to compute the penalty for differences in CDR3 length. The length difference is multiplied by length_penalty_multiplier, by default 2
cdr3_field (str, optional) – Name of the field in s1 and s2 containing the CDR3 sequence, by default “cdr3”
mutations_field (str, optional) – Name of the field in s1 and s2 containing mutation information, by default “mutations”
- Returns:
distance
- Return type:
float
embeddings¶
Performs PCA, neighborhood graph construction and UMAP embedding. |
|
Performs PCA, neighborhood graph construction and UMAP embedding. |
|
Deprecated, but retained for backwards compatibility. |
- scab.tools.embeddings.pca(adata: <MagicMock name='mock.AnnData' id='139656335913024'>, solver: str = 'arpack', n_pcs: int = 50, ignore_ig: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='139656335913024'>¶
Performs PCA, neighborhood graph construction and UMAP embedding. PAGA is optional, but is performed by default.
- Parameters:
adata (anndata.AnnData) –
AnnDataobject containing gene counts data.solver (str, default='arpack') – Solver to use for the PCA.
n_pcs (int, default=50) – Number of principal components to use when computing the neighbor graph. Although the default value is generally appropriate, it is sometimes useful to empirically determine the optimal value for n_pcs.
ignore_ig (bool, default=True) – Ignores immunoglobulin V, D and J genes when computing the PCA.
- Returns:
adata
- Return type:
anndata.AnnData
- scab.tools.embeddings.umap(adata: <MagicMock name='mock.AnnData' id='139656335913024'>, solver: str = 'arpack', n_neighbors: int | None = None, n_pcs: int | None = None, force_pca: bool = False, ignore_ig: bool = True, paga: bool = True, batch_key: str | None = None, use_rna_velocity: bool = False, use_rep: str | None = None, random_state: int | float | str = 42, resolution: float = 1.0, verbose: bool = True) <MagicMock name='mock.AnnData' id='139656335913024'>¶
Performs PCA, neighborhood graph construction and UMAP embedding. PAGA is optional, but is performed by default.
- Parameters:
adata (anndata.AnnData) –
AnnDataobject containing gene counts data.solver (str, default='arpack') – Solver to use for the PCA.
n_neighbors (int, default=10) – Number of neighbors to calculate for the neighbor graph.
n_pcs (int, default=40) – Number of principal components to use when computing the neighbor graph. Although the default value is generally appropriate, it is sometimes useful to empirically determine the optimal value for n_pcs.
force_pca (bool, default=False) – Construct the PCA even if it has already been constructed (
"X_pcs"exists inadata.obsm). Default isFalse, which will use an existing PCA.ignore_ig (bool, default=True) – Ignores immunoglobulin V, D and J genes when computing the PCA.
paga (bool, default=True) – If
True, performs partition-based graph abstraction (PAGA) prior to UMAP embedding.batch_key (str, optional) – If
adatacontains batch information, this is the key inadata.obsthat contains the batch information. If provided, neighbors will be computed using batch-balanced KNN (scanpy.external.pp.bbknn) rather thanscanpy.pp.neighbors.use_rna_velocity (bool, default=False) – If
True, uses RNA velocity information to compute PAGA. IfFalse, this option is ignored.use_rep (str, optional) – Representation to use when computing neighbors. For example, if data have been batch normalized with
scanorama, the representation should be'Scanorama'. If not provided,scanpy’s default representation is used.random_state (int, optional) – Seed for the random state used by
sc.tl.umap.resolution (float, default=1.0) – Resolution for Leiden clustering.
- Returns:
adata
- Return type:
anndata.AnnData
- scab.tools.embeddings.dimensionality_reduction(**kwargs) <MagicMock name='mock.AnnData' id='139656335913024'>¶
Deprecated, but retained for backwards compatibility. Use
scab.tl.umapinstead.
specificity¶
Classifies BCR specificity using antigen barcodes (AgBCs). |
- scab.tools.specificity.classify_specificity(adata: <MagicMock name='mock.AnnData' id='139656335913024'>, raw: <MagicMock name='mock.AnnData' id='139656335913024'> | str, agbcs: ~typing.Iterable | None = None, groups: dict | None = None, rename: dict | None = None, percentile: float = 0.997, percentile_dict: dict | None = None, threshold_dict: dict | None = None, agbc_regex: str = 'agbc', update: bool = True, uns_batch: str | None = None, verbose: bool = True) <MagicMock name='mock.AnnData' id='139656335913024'> | <MagicMock name='mock.DataFrame' id='139656334483856'>¶
Classifies BCR specificity using antigen barcodes (AgBCs). Thresholds are computed by analyzing background AgBC UMI counts in empty droplets.
Caution
In order to set accurate thresholds, we must remove all cell-containing droplets from the
rawcounts matrix. Becauseadatacomprises only cell-containing droplets, we simply remove all of the droplet barcodes inadatafromraw. Thus, it is very important thatadataandraware well matched.For example, if processing a single Chromium reaction containing several multiplexed samples,
adatashould contain all of the multiplexed samples, since the raw matrix produced by CellRanger will also include all droplets in the reaction. Ifadatawas missing one or more samples, cell-containing droplets cannot accurately be removed fromrawand classification accuracy will be adversely affected.- Parameters:
adata (anndata.AnnData) – Input
AnnDataobject. Log2-normalized AgBC UMI counts should be found inadata.obs. If data was read usingscab.read_10x_mtx(), the resultingAnnDataobject will already be correctly formatted.raw (anndata.AnnData or str) –
Raw matrix data. Either a path to a directory containing the raw
.mtxfile produced by CellRanger, or ananndata.AnnDataobject containing the raw matrix data. As with adata, log2-normalized AgBC UMIs should be found atraw.obs.Tip
If reading the raw counts matrix with
scab.read_10x_mtx(), it can be helpful to includeignore_zero_quantile_agbcs=False. In some cases with very little AgBC background, AgBCs can be incorrectly removed from the raw counts matrix.agbcs (iterable object, optional) – A list of AgBCs to be classified. Either agbcs` or groups` is required. If both are provided, both will be used.
groups (dict, optional) – A
dictmapping specificity names to a list of one or more AgBCs. This is particularly useful when multiple AgBCs correspond to the same antigen (either because dual-labeled AgBCs were used, or because several AgBCs are closely-related molecules that would be expected to compete for BCR binding). Either agbcs or groups is required. If both are provided, both will be used.rename (dict, optional) – A
dictmapping AgBC or group names to a new name. Keys should be present in eitheragbcsorgroups.keys(). If only a subset of AgBCs or groups are provided inrename, then only those AgBCs or groups will be renamed.percentile (float, default=0.997) – Percentile used to compute the AgBC classification threshold using raw data. Default is
0.997, which corresponds to three standard deviations.percentile_dict (dict, optional) – A
dictmapping AgBC or group names to the desired percentile. If only a subset of AgBCs or groups are provided in percentile_dict, all others will use percentile.update (bool, default=True) – If
True, update adata with grouped UMI counts and classifications. IfFalse, a PandasDataFramecontaing classifications will be returned and adata will not be modified.uns_batch (str, default=None) –
If provided, uns_batch will add batch information to the percentile and threshold data stored in
adata.uns. This results in an additional layer of nesting, which allows concatenating multipleAnnDataobjects represeting different batches for which classification is performed separately. If not provided, the data stored inunswould be formatted like:adata.uns['agbc_percentiles'] = {agbc1: percentile1, ...} adata.uns['agbc_thresholds'] = {agbc1: threshold1, ...}
If uns_batch is provided,
unswill be formatted like:adata.ubs['agbc_percentiles'] = {uns_batch: {agbc1: percentile1, ...}} adata.ubs['agbc_thresholds'] = {uns_batch: {agbc1: threshold1, ...}}
verbose (bool, default=True) – If
True, calculated threshold values are printed.
- Returns:
output – If update is
True, an updated adata object containing specificity classifications is returned. Otherwise, a PandasDataFramecontaining specificity classifications is returned.- Return type:
anndata.AnnDataorpandas.DataFrame