preprocessing: pp#

Functions:

filter_and_normalize

performs quality filtering and normalization of 10x Genomics count data

scrublet

Predicts doublets using scrublet [Wolock19].

doubletdetection

Predicts doublets using doubletdetection [Gayoso20].

remove_doublets

Removes doublets.

scab.pp.filter_and_normalize(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, make_var_names_unique: bool = True, min_genes: int = 200, min_cells: float | None = None, n_genes_by_counts: int = 2500, percent_mito: int | float = 10, percent_ig: int | float = 100, hvg_batch_key: str | None = None, ig_regex_pattern: str = 'IG[HKL][VDJ][1-9].+|TR[ABDG][VDJ][1-9]', regress_out_mt: bool = False, regress_out_ig: bool = False, target_sum: int | None = None, n_top_genes: int | None = None, normalization_flavor: str = 'cell_ranger', log: bool = True, scale_max_value: float | None = None, save_raw: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

performs quality filtering and normalization of 10x Genomics count data

Parameters:
  • adata (anndata.AnnData) – AnnData object containing gene count data.

  • make_var_names_unique (bool, default=True) – If True, adata.var_names_make_unique() will be called before filtering and normalization.

  • min_genes (int, default=200) – Minimum number of identified genes for a droplet to be considered a valid cell. Passed to sc.pp.filter_cells() as the min_genes parameter.

  • min_cells (int, optional) – Minimum number of cells in which a gene has been identified. Genes below this threshold will be filtered. If not provided, a dynamic threshold equal to 0.1% of the total number of cells in the dataset will be used.

  • n_genes_by_counts (int, default=2500) – Threshold for filtering cells based on the number of genes by counts.

  • percent_mito (int or float, default=10) – Threshold for filtering cells based on the percentage of mitochondrial genes.

  • hvg_batch_key (str, optional) – When processing an AnnData object containing multiple samples that may require integration and batch correction, hvg_batch_key will be passed to sc.pp.highly_variable_genes() to force separate identification of highly variable genes for each batch. If not provided, variable genes will be computed on the entire dataset.

  • ig_regex_pattern (str, default='IG[HKL][VDJ][1-9].+|TR[ABDG][VDJ][1-9]') – Regular expression pattern used to identify immunoglobulin genes. The default is designed to match all immunoglobulin germline gene segments (V, D and J). Constant region genes are not matched.

  • target_sum (int, optional) – Target read count for normalization, passed to sc.pp.normalize_total(). If not provided, the median count of all cells (pre-normalization) is used.

  • n_top_genes (int, optional) – The number of top highly variable genes to retain. If not provided, the default number of genes for the selected normalization flavor is used.

  • normalization_flavor (str, default='cell_ranger') – Options are 'cell_ranger', 'seurat' or 'seurat_v3'.

  • log (bool, default=True) – If True, counts will be log-plus-1 transformed.

  • scale_max_value (float, optional) – Value at which normalized count values will be clipped. Default is no clipping.

  • save_raw (bool, default=True) – If True, normalized and filtered data will be saved to adata.raw prior to scaling and regressing out mitochondrial/immmunoglobulin genes.

  • verbose (bool, default=True) – If True, progress updates will be printed.

scab.pp.scrublet(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Predicts doublets using scrublet [Wolock19].

See also

Samuel L. Wolock, Romain Lopez, Allon M. Klein
Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data
Parameters:
  • adata (anndata.AnnData) – AnnData object containing gene count data.

  • verbose (bool, default=True) – If True, progress updates will be printed.

Return type:

Returns an updated adata object with doublet predictions found at adata.obs.is_doublet and doublet scores at adata.obs.doublet_score.

scab.pp.doubletdetection(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, n_iters: int = 25, use_phenograph: bool = False, standard_scaling: bool = True, p_thresh: float = 1e-16, voter_thresh: float = 0.5, verbose: bool = False) <MagicMock name='mock.AnnData' id='140098729656384'>#

Predicts doublets using doubletdetection [Gayoso20].

See also

Adam Gayoso, Jonathan Shor, Ambrose J Carr, Roshan Sharma, Dana Pe’er
DoubletDetection (Version v3.0)
Parameters:
  • adata (anndata.AnnData) – AnnData object containing gene counts data.

  • n_iters (int, default=25) – Iterations of doubletdetection to perform.

  • use_phenograph (bool, default=False) – Passed directly to doubletdection.BoostClassifier().

  • standard_scaling (bool, default=True) – Passed directly to doubletdection.BoostClassifier().

  • p_thresh (float, default=1e-16) – P-value threshold for doublet classification.

  • voter_thresh (float, default=0.5) – Voter threshold, as a fraction of all voters.

  • verbose (bool, default=True) – If True, progress updates will be printed.

Return type:

Returns an updated adata object with doublet predictions found at adata.obs.is_doublet and doublet scores at adata.obs.doublet_score. Note that adata.obs.is_doublet values are 0.0 and 1.0, not True and False. This is the default output of doubletdetection and is useful for plotting doublets using scanpy.pl.umap(), which cannot handle boolean color values.

scab.pp.remove_doublets(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, doublet_identification_method: str | None = None, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Removes doublets. If not already performed, doublet identification is performed using either doubletdetection or scrublet.

Parameters:
  • adata (anndata.AnnData):) – AnnData object containing gene count data.

  • doublet_identification_method (str, default='doubletdetection') – Method for identifying doublets. Only used if adata.obs.is_doublet does not already exist. Options are 'doubletdetection' and 'scrublet'.

  • verbose (bool, default=True) – If True, progress updates will be printed.

Return type:

An updated adata object that does not contain observations that were identified as doublets.