read and write: io#

Functions:

read_10x_mtx

Reads 10x Genomics data into an integrated AnnData object.

read

Reads a serialized AnnData object.

load

Loads a serialized AnnData object.

write

Serializes and writes an AnnData object to disk in h5ad format.

save

Serializes and saves an AnnData object to disk in h5ad format.

concat

Concatenates AnnData objects using anndata.concat().

scab.io.read_10x_mtx(mtx_path: str, *, bcr_file: str | None = None, bcr_annot: str | None = None, bcr_format: Literal['fasta', 'delimited', 'json'] = 'fasta', bcr_delimiter: str = '\t', bcr_id_key: str = 'sequence_id', bcr_sequence_key: str = 'sequence', bcr_id_delimiter: str = '_', bcr_id_delimiter_num: int = 1, tcr_file: str | None = None, tcr_annot: str | None = None, tcr_format: Literal['fasta', 'delimited', 'json'] = 'fasta', tcr_delimiter: str = '\t', tcr_id_key: str = 'sequence_id', tcr_sequence_key: str = 'sequence', tcr_id_delimiter: str = '_', tcr_id_delimiter_num: int = 1, chain_selection_func: Callable | None = None, abstar_output_format: Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', gex_only: bool = False, hashes: Iterable | None = None, cellhash_regex: str = 'cell ?hash', ignore_cellhash_case: bool = True, agbcs: Iterable | None = None, agbc_regex: str = 'agbc', ignore_agbc_case: bool = True, log_transform_cellhashes: bool = True, ignore_zero_quantile_cellhashes: bool = True, rename_cellhashes: Dict[str, str] | None = None, log_transform_agbcs: bool = True, ignore_zero_quantile_agbcs: bool = True, rename_agbcs: Dict[str, str] | None = None, log_transform_features: bool = True, ignore_zero_quantile_features: bool = True, rename_features: Dict[str, str] | None = None, feature_suffix: str = '_FBC', cellhash_quantile: float | int = 0.95, agbc_quantile: float | int = 0.95, feature_quantile: float | int = 0.95, cache: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Reads 10x Genomics data into an integrated AnnData object.

Datasets can include gene expression (GEX), cell hashes, antigen barcodes (AgBCs), feature barcodes, and assembled BCR or TCR contig sequences.

Parameters:
  • mtx_path (str) – Path to a CellRanger counts matrix folder, typically either 'sample_feature_bc_matrix' or 'raw_feature_bc_matrix'.

  • [bcr|tcr]_file (str, optional) –

    Path to a file containing BCR/TCR data. The file can be in one of several formats:

    • FASTA-formatted file, as output by CellRanger

    • delimited text file, containing annotated BCR/TCR sequences

    • JSON-formatted file, containing annotated BCR/TCR sequences

  • [bcr|tcr]_annot (str, optional) – Path to the CSV-formatted BCR/TCR annotations file produced by CellRanger. Matching the annotation file to [bcr|tcr]_file is preferred – if 'all_contig.fasta' is the supplied [bcr|tcr]_file, then 'all_contig_annotations.csv' is the appropriate annotation file.

  • [bcr|tcr]_format (str, default='fasta') – Format of the input [bcr|tcr]_file. Options are: 'fasta', 'delimited', and 'json'. If [bcr|tcr]_format is 'fasta', abstar will be run on the input data to obtain annotated BCR/TCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.

  • [bcr|tcr]_delimiter (str, default=' ') – Delimiter used in [bcr|tcr]_file. Only used if [bcr|tcr]_format is 'delimited'. Default is '  ', which conforms to AIRR-C data standards.

  • [bcr|tcr]_id_key (str, default='sequence_id') – Name of the column or field in [bcr|tcr]_file that corresponds to the sequence ID.

  • [bcr|tcr]_sequence_key (str, default='sequence') – Name of the column or field in [bcr|tcr]_file that corresponds to the VDJ sequence.

  • [bcr|tcr]_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as: 'AAACCTGAGAACTGTA-1_contig_1', where 'AAACCTGAGAACTGTA-1' is the droplet identifier and 'contig_1' is the contig identifier.

  • [bcr|tcr]_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the [bcr|tcr]_id_delimiter.

  • abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if [bcr|tcr]_format is 'fasta'. Options are 'airr', 'json' and 'tabular'.

  • abstar_germ_db (str, default='human') – Germline database to be used for annotation of BCR/TCR data. Built-in abstar options include: 'human', 'macaque', 'mouse' and 'humouse'. Only used if one or both of [bcr|tcr]_format is 'fasta'.

  • gex_only (bool, default=False) – If True, return only gene expression data and ignore features and hashes. Note that VDJ data will still be included in the returned AnnData object if [bcr|tcr]_file is provided.

  • cellhash_regex (str, default='cell ?hash') – A regular expression (regex) string used to identify cell hashes. The regex must be found in all hash names. The default, combined with the default setting for ignore_hash_regex_case, will match 'cellhash' or 'cell hash' in any combination of upper and lower case letters.

  • ignore_cellhash_regex_case (bool, default=True) – If True, searching for hash_regex will ignore case.

  • agbc_regex (str, default='agbc') – A regular expression (regex) string used to identify AgBCs. The regex must be found in all AgBC names. The default, combined with the default setting for ignore_hash_regex_case, will match 'agbc' in any combination of upper and lower case letters.

  • ignore_agbc_regex_case (bool, default=True) – If True, searching for agbc_regex will ignore case.

  • log_transform_cellhashes (bool, default=True) – If True, cell hash UMI counts will be log2-plus-1 transformed.

  • log_transform_agbcs (bool, default=True) – If True, AgBC UMI counts will be log2-plus-1 transformed.

  • log_transform_features (bool, default=True) – If True, feature UMI counts will be log2-plus-1 transformed.

  • ignore_zero_quantile_cellhashes (bool, default=True) – If True, any hashes for which the cellhash_quantile percentile have a count of zero are ignored. Default is True and the default cellhash_quantile is 0.95, resulting in cellhashes with zero counts for the 95th percentile being ignored.

  • ignore_zero_quantile_agbcs (bool, default=True) – If True, any AgBCs for which the agbc_quantile percentile have a count of zero are ignored. Default is True and the default agbc_quantile is 0.95, resulting in AgBCs with zero counts for the 95th percentile being ignored.

  • ignore_zero_quantile_features (bool, default=True) – If True, any features for which the feature_quantile percentile have a count of zero are ignored. Default is True and the default feature_quantile is 0.95, resulting in features with zero counts for the 95th percentile being ignored.

  • rename_cellhashes (dict, optional) – A dictionary with keys and values corresponding to the existing and new cellhash names, respectively. For example, {'CellHash1': 'donor123} would result in the renaming of 'CellHash1' to 'donor123'. Cellhashes not found in the rename_cellhashes dictionary will not be renamed.

  • rename_agbcs (dict, optional) – A dictionary with keys and values corresponding to the existing and new AgBC names, respectively. For example, {'AgBC1': 'Influenza H1'} would result in the renaming of 'AgBC1' to 'Influenza H1'. AgBCs not found in the rename_agbcs dictionary will not be renamed.

  • rename_features (dict, optional) – A dictionary with keys and values corresponding to the existing and new feature names, respectively. For example, {'FeatureBC1': 'CD19} would result in the renaming of 'FeatureBC1' to 'CD19'. Features not found in the rename_features dictionary will not be renamed.

  • feature_suffix (str, default='_FBC') – Suffix to add to the end of each feature name. Useful because feature names may overlap with gene names. The default value will result in the feature 'CD19' being renamed to 'CD19_FBC'. The suffix is added after feature renaming. To skip the addition of a feature suffix, simply supply an empty string ('') as the argument.

  • cellhash_quantile (float, default=0.95) – Percentile for which cellhashes with zero counts will be ignored if ignore_zero_quantile_cellhashes is True. Default is 0.95, which is equivalent to the 95th percentile.

  • agbc_quantile (float, default=0.95) – Percentile for which AgBCs with zero counts will be ignored if ignore_zero_quantile_agbcs is True. Default is 0.95, which is equivalent to the 95th percentile.

  • feature_quantile (float, default=0.95) – Percentile for which features with zero counts will be ignored if ignore_zero_quantile_features is True. Default is 0.95, which is equivalent to the 95th percentile.

  • verbose (bool, default=True) – Print progress updates.

Returns:

adata – An AnnData object containing gene expression data, with VDJ information located at adata.obs.bcr and/or adata.obs.tcr, and cellhash and feature barcode data found in adata.obs. If gex_only is True, cellhash and feature barcode data are not returned.

Return type:

anndata.AnnData

scab.io.read(h5ad_file: str | Path) <MagicMock name='mock.AnnData' id='140098729656384'>#

Reads a serialized AnnData object.

Similar to scanpy.read(), except that scanpy does not support serialized BCR/TCR data. If BCR/TCR data is included in the serialized AnnData file, it will be separately deserialized into the original abutils.Pair objects.

Parameters:

h5ad_file (str) – Path to the serialized AnnData object. Must be an ".h5ad" file. Required.

Returns:

adata

Return type:

anndata.AnnData

scab.io.load(h5ad_file: str | Path) <MagicMock name='mock.AnnData' id='140098729656384'>#

Loads a serialized AnnData object.

Similar to scanpy.read(), except that scanpy does not support serialized BCR/TCR data. If BCR/TCR data is included in the serialized AnnData file, it will be separately deserialized into the original abutils.Pair objects.

Parameters:

h5ad_file (str) – Path to the serialized AnnData object. Must be an ".h5ad" file. Required.

Returns:

adata

Return type:

anndata.AnnData

scab.io.write(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, h5ad_file: str | ~pathlib.Path)#

Serializes and writes an AnnData object to disk in h5ad format.

Similar to scanpy.write(), except that scanpy does not support serializing BCR/TCR data. This function serializes abutils.Pair objects stored in either adata.obs.bcr or adata.obs.tcr using pickle prior to writing the AnnData object to disk.

Parameters:
  • adata – An AnnData object containing gene expression, feature barcode and VDJ data. scab.read_10x_mtx() can be used to construct a multi-omics AnnData object from raw CellRanger outputs.

  • h5ad_file – Path to the output file. The output will be written in h5ad format and must include '.h5ad' as the file extension. If it is not included, the extension will automatically be added.

scab.io.save(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, h5ad_file: str | ~pathlib.Path)#

Serializes and saves an AnnData object to disk in h5ad format.

Similar to scanpy.write(), except that scanpy does not support serializing BCR/TCR data. This function serializes abutils.Pair objects stored in either adata.obs.bcr or adata.obs.tcr using pickle prior to writing the AnnData object to disk.

Parameters:
  • adata – An AnnData object containing gene expression, feature barcode and VDJ data. scab.read_10x_mtx() can be used to construct a multi-omics AnnData object from raw CellRanger outputs.

  • h5ad_file – Path to the output file. The output will be written in h5ad format and must include '.h5ad' as the file extension. If it is not included, the extension will automatically be added.

scab.io.concat(adatas: ~typing.Collection[<MagicMock name='mock.AnnData' id='140098729656384'>] | ~typing.Mapping[str, <MagicMock name='mock.AnnData' id='140098729656384'>], *, axis: ~typing.Literal[0, 1] = 0, join: ~typing.Literal['inner', 'outer'] = 'inner', merge: ~typing.Literal['same', 'unique', 'first', 'only'] | ~typing.Callable | None = None, uns_merge: ~typing.Literal['same', 'unique', 'first', 'only'] | ~typing.Callable | None = 'unique', label: str | None = None, keys: ~typing.Collection | None = None, index_unique: str | None = None, fill_value: ~typing.Any | None = None, pairwise: bool = False, obs_names_make_unique: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Concatenates AnnData objects using anndata.concat().

Documentation was copied almost verbatim from the anndata.concat() `docstring`_.

The only major difference is that the default for uns_merge has been changed from None (which doesn’t merge any of the data in adata.uns) to 'unique', which only merges adata.uns elements for which there is only one possible value.

Parameters:
  • adatas – The objects to be concatenated. If a Mapping is passed, keys are used for the keys argument and values are concatenated.

  • axis – Which axis to concatenate along. 0 is row-wise, 1 is column-wise.

  • join – How to align values when concatenating. If "outer", the union of the other axis is taken. If "inner", the intersection is taken. For example:

  • merge – How elements not aligned to the axis being concatenated along are selected. Currently implemented strategies include: * None: No elements are kept. * "same": Elements that are the same in each of the objects. * "unique": Elements for which there is only one possible value. * "first": The first element seen at each from each position. * "only": Elements that show up in only one of the objects.

  • uns_merge – How the elements of .uns are selected. Uses the same set of strategies as the merge argument, except applied recursively.

  • label – Column in axis annotation (i.e. .obs or .var) to place batch information in. If it’s None, no column is added.

  • keys – Names for each object being added. These values are used for column values for label or appended to the index if index_unique is not None. Defaults to incrementing integer labels.

  • index_unique – Whether to make the index unique by using the keys. If provided, this is the delimeter between “{orig_idx}{index_unique}{key}”. When None, the original indices are kept.

  • fill_value – When join="outer", this is the value that will be used to fill the introduced indices. By default, sparse arrays are padded with zeros, while dense arrays and DataFrames are padded with missing values.

  • pairwise – Whether pairwise elements along the concatenated dimension should be included. This is False by default, since the resulting arrays are often not meaningful.

  • obs_names_make_unique – If True, will call obs_names_make_unique() on the concatenated AnnData object prior to returning. Default is True.

Notes

Warning

If you use join='outer' this fills 0s for sparse data when variables are absent in a batch. Use this with care. Dense data is filled with NaN.

Examples

Preparing example objects >>> import anndata as ad, pandas as pd, numpy as np >>> from scipy import sparse >>> a = ad.AnnData( … X=sparse.csr_matrix(np.array([[0, 1], [2, 3]])), … obs=pd.DataFrame({“group”: [“a”, “b”]}, index=[“s1”, “s2”]), … var=pd.DataFrame(index=[“var1”, “var2”]), … varm={“ones”: np.ones((2, 5)), “rand”: np.random.randn(2, 3), “zeros”: np.zeros((2, 5))}, … uns={“a”: 1, “b”: 2, “c”: {“c.a”: 3, “c.b”: 4}}, … ) >>> b = ad.AnnData( … X=sparse.csr_matrix(np.array([[4, 5, 6], [7, 8, 9]])), … obs=pd.DataFrame({“group”: [“b”, “c”], “measure”: [1.2, 4.3]}, index=[“s3”, “s4”]), … var=pd.DataFrame(index=[“var1”, “var2”, “var3”]), … varm={“ones”: np.ones((3, 5)), “rand”: np.random.randn(3, 5)}, … uns={“a”: 1, “b”: 3, “c”: {“c.b”: 4}}, … ) >>> c = ad.AnnData( … X=sparse.csr_matrix(np.array([[10, 11], [12, 13]])), … obs=pd.DataFrame({“group”: [“a”, “b”]}, index=[“s1”, “s2”]), … var=pd.DataFrame(index=[“var3”, “var4”]), … uns={“a”: 1, “b”: 4, “c”: {“c.a”: 3, “c.b”: 4, “c.c”: 5}}, … )

Concatenating along different axes

>>> ad.concat([a, b]).to_df()
    var1  var2
s1   0.0   1.0
s2   2.0   3.0
s3   4.0   5.0
s4   7.0   8.0
>>> ad.concat([a, c], axis=1).to_df()
    var1  var2  var3  var4
s1   0.0   1.0  10.0  11.0
s2   2.0   3.0  12.0  13.0

Inner and outer joins

>>> inner = ad.concat([a, b])  # Joining on intersection of variables
>>> inner
AnnData object with n_obs × n_vars = 4 × 2
    obs: 'group'
>>> (inner.obs_names, inner.var_names)
(Index(['s1', 's2', 's3', 's4'], dtype='object'),
Index(['var1', 'var2'], dtype='object'))
>>> outer = ad.concat([a, b], join="outer") # Joining on union of variables
>>> outer
AnnData object with n_obs × n_vars = 4 × 3
    obs: 'group', 'measure'
>>> outer.var_names
Index(['var1', 'var2', 'var3'], dtype='object')
>>> outer.to_df()  # Sparse arrays are padded with zeroes by default
    var1  var2  var3
s1   0.0   1.0   0.0
s2   2.0   3.0   0.0
s3   4.0   5.0   6.0
s4   7.0   8.0   9.0

Keeping track of source objects

>>> ad.concat({"a": a, "b": b}, label="batch").obs
   group batch
s1     a     a
s2     b     a
s3     b     b
s4     c     b
>>> ad.concat([a, b], label="batch", keys=["a", "b"]).obs  # Equivalent to previous
   group batch
s1     a     a
s2     b     a
s3     b     b
s4     c     b
>>> ad.concat({"a": a, "b": b}, index_unique="-").obs
     group
s1-a     a
s2-a     b
s3-b     b
s4-b     c

Combining values not aligned to axis of concatenation

>>> ad.concat([a, b], merge="same")
AnnData object with n_obs × n_vars = 4 × 2
    obs: 'group'
    varm: 'ones'
>>> ad.concat([a, b], merge="unique")
AnnData object with n_obs × n_vars = 4 × 2
    obs: 'group'
    varm: 'ones', 'zeros'
>>> ad.concat([a, b], merge="first")
AnnData object with n_obs × n_vars = 4 × 2
    obs: 'group'
    varm: 'ones', 'rand', 'zeros'
>>> ad.concat([a, b], merge="only")
AnnData object with n_obs × n_vars = 4 × 2
    obs: 'group'
    varm: 'zeros'

The same merge strategies can be used for elements in .uns

>>> dict(ad.concat([a, b, c], uns_merge="same").uns)
{'a': 1, 'c': {'c.b': 4}}
>>> dict(ad.concat([a, b, c], uns_merge="unique").uns)
{'a': 1, 'c': {'c.a': 3, 'c.b': 4, 'c.c': 5}}
>>> dict(ad.concat([a, b, c], uns_merge="only").uns)
{'c': {'c.c': 5}}
>>> dict(ad.concat([a, b, c], uns_merge="first").uns)
{'a': 1, 'b': 2, 'c': {'c.a': 3, 'c.b': 4, 'c.c': 5}}