preprocessing: pp¶
Functions:
performs quality filtering and normalization of 10x Genomics count data |
|
Predicts doublets using scrublet [Wolock19]. |
|
Predicts doublets using doubletdetection [Gayoso20]. |
|
Removes doublets. |
- scab.pp.filter_and_normalize(adata: <MagicMock name='mock.AnnData' id='139661522214976'>, make_var_names_unique: bool = True, min_genes: int = 200, min_cells: float | None = None, n_genes_by_counts: int = 2500, percent_mito: int | float = 10, percent_ig: int | float = 100, hvg_batch_key: str | None = None, ig_regex_pattern: str = 'IG[HKL][VDJ][1-9].+|TR[ABDG][VDJ][1-9]', regress_out_mt: bool = False, regress_out_ig: bool = False, target_sum: int | None = None, n_top_genes: int | None = None, normalization_flavor: str = 'cell_ranger', log: bool = True, scale_max_value: float | None = None, save_raw: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='139661522214976'>¶
performs quality filtering and normalization of 10x Genomics count data
- Parameters:
adata (anndata.AnnData) –
AnnDataobject containing gene count data.make_var_names_unique (bool, default=True) – If
True,adata.var_names_make_unique()will be called before filtering and normalization.min_genes (int, default=200) – Minimum number of identified genes for a droplet to be considered a valid cell. Passed to
sc.pp.filter_cells()as themin_genesparameter.min_cells (int, optional) – Minimum number of cells in which a gene has been identified. Genes below this threshold will be filtered. If not provided, a dynamic threshold equal to 0.1% of the total number of cells in the dataset will be used.
n_genes_by_counts (int, default=2500) – Threshold for filtering cells based on the number of genes by counts.
percent_mito (int or float, default=10) – Threshold for filtering cells based on the percentage of mitochondrial genes.
hvg_batch_key (str, optional) – When processing an
AnnDataobject containing multiple samples that may require integration and batch correction, hvg_batch_key will be passed tosc.pp.highly_variable_genes()to force separate identification of highly variable genes for each batch. If not provided, variable genes will be computed on the entire dataset.ig_regex_pattern (str, default='IG[HKL][VDJ][1-9].+|TR[ABDG][VDJ][1-9]') – Regular expression pattern used to identify immunoglobulin genes. The default is designed to match all immunoglobulin germline gene segments (V, D and J). Constant region genes are not matched.
target_sum (int, optional) – Target read count for normalization, passed to
sc.pp.normalize_total(). If not provided, the median count of all cells (pre-normalization) is used.n_top_genes (int, optional) – The number of top highly variable genes to retain. If not provided, the default number of genes for the selected normalization flavor is used.
normalization_flavor (str, default='cell_ranger') – Options are
'cell_ranger','seurat'or'seurat_v3'.log (bool, default=True) – If
True, counts will be log-plus-1 transformed.scale_max_value (float, optional) – Value at which normalized count values will be clipped. Default is no clipping.
save_raw (bool, default=True) – If
True, normalized and filtered data will be saved toadata.rawprior to scaling and regressing out mitochondrial/immmunoglobulin genes.verbose (bool, default=True) – If
True, progress updates will be printed.
- scab.pp.scrublet(adata: <MagicMock name='mock.AnnData' id='139661522214976'>, verbose: bool = True) <MagicMock name='mock.AnnData' id='139661522214976'>¶
Predicts doublets using scrublet [Wolock19].
See also
Samuel L. Wolock, Romain Lopez, Allon M. KleinScrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic DataCell Systems 2019. https://doi.org/10.1016/j.cels.2018.11.005- Parameters:
adata (anndata.AnnData) –
AnnDataobject containing gene count data.verbose (bool, default=True) – If
True, progress updates will be printed.
- Return type:
Returns an updated adata object with doublet predictions found at
adata.obs.is_doubletand doublet scores atadata.obs.doublet_score.
- scab.pp.doubletdetection(adata: <MagicMock name='mock.AnnData' id='139661522214976'>, n_iters: int = 25, use_phenograph: bool = False, standard_scaling: bool = True, p_thresh: float = 1e-16, voter_thresh: float = 0.5, verbose: bool = False) <MagicMock name='mock.AnnData' id='139661522214976'>¶
Predicts doublets using doubletdetection [Gayoso20].
See also
Adam Gayoso, Jonathan Shor, Ambrose J Carr, Roshan Sharma, Dana Pe’erDoubletDetection (Version v3.0)Zenodo 2020. http://doi.org/10.5281/zenodo.2678041- Parameters:
adata (anndata.AnnData) –
AnnDataobject containing gene counts data.n_iters (int, default=25) – Iterations of doubletdetection to perform.
use_phenograph (bool, default=False) – Passed directly to
doubletdection.BoostClassifier().standard_scaling (bool, default=True) – Passed directly to
doubletdection.BoostClassifier().p_thresh (float, default=1e-16) – P-value threshold for doublet classification.
voter_thresh (float, default=0.5) – Voter threshold, as a fraction of all voters.
verbose (bool, default=True) – If
True, progress updates will be printed.
- Return type:
Returns an updated adata object with doublet predictions found at
adata.obs.is_doubletand doublet scores atadata.obs.doublet_score. Note thatadata.obs.is_doubletvalues are0.0and1.0, notTrueandFalse. This is the default output ofdoubletdetectionand is useful for plotting doublets usingscanpy.pl.umap(), which cannot handle boolean color values.
- scab.pp.remove_doublets(adata: <MagicMock name='mock.AnnData' id='139661522214976'>, doublet_identification_method: str | None = None, verbose: bool = True) <MagicMock name='mock.AnnData' id='139661522214976'>¶
Removes doublets. If not already performed, doublet identification is performed using either doubletdetection or scrublet.
- Parameters:
adata (anndata.AnnData):) –
AnnDataobject containing gene count data.doublet_identification_method (str, default='doubletdetection') – Method for identifying doublets. Only used if
adata.obs.is_doubletdoes not already exist. Options are'doubletdetection'and'scrublet'.verbose (bool, default=True) – If
True, progress updates will be printed.
- Return type:
An updated
adataobject that does not contain observations that were identified as doublets.