read and write: io
#
Functions:
Reads 10x Genomics data into an integrated |
|
Reads a serialized |
|
Loads a serialized |
|
Serializes and writes an |
|
Serializes and saves an |
|
Concatenates |
- scab.io.read_10x_mtx(mtx_path: str, *, bcr_file: str | None = None, bcr_annot: str | None = None, bcr_format: Literal['fasta', 'delimited', 'json'] = 'fasta', bcr_delimiter: str = '\t', bcr_id_key: str = 'sequence_id', bcr_sequence_key: str = 'sequence', bcr_id_delimiter: str = '_', bcr_id_delimiter_num: int = 1, tcr_file: str | None = None, tcr_annot: str | None = None, tcr_format: Literal['fasta', 'delimited', 'json'] = 'fasta', tcr_delimiter: str = '\t', tcr_id_key: str = 'sequence_id', tcr_sequence_key: str = 'sequence', tcr_id_delimiter: str = '_', tcr_id_delimiter_num: int = 1, chain_selection_func: Callable | None = None, abstar_output_format: Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', gex_only: bool = False, hashes: Iterable | None = None, cellhash_regex: str = 'cell ?hash', ignore_cellhash_case: bool = True, agbcs: Iterable | None = None, agbc_regex: str = 'agbc', ignore_agbc_case: bool = True, log_transform_cellhashes: bool = True, ignore_zero_quantile_cellhashes: bool = True, rename_cellhashes: Dict[str, str] | None = None, log_transform_agbcs: bool = True, ignore_zero_quantile_agbcs: bool = True, rename_agbcs: Dict[str, str] | None = None, log_transform_features: bool = True, ignore_zero_quantile_features: bool = True, rename_features: Dict[str, str] | None = None, feature_suffix: str = '_FBC', cellhash_quantile: float | int = 0.95, agbc_quantile: float | int = 0.95, feature_quantile: float | int = 0.95, cache: bool = True, verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> #
Reads 10x Genomics data into an integrated
AnnData
object.Datasets can include gene expression (GEX), cell hashes, antigen barcodes (AgBCs), feature barcodes, and assembled BCR or TCR contig sequences.
- Parameters:
mtx_path (str) – Path to a CellRanger counts matrix folder, typically either
'sample_feature_bc_matrix'
or'raw_feature_bc_matrix'
.[bcr|tcr]_file (str, optional) –
Path to a file containing BCR/TCR data. The file can be in one of several formats:
FASTA-formatted file, as output by CellRanger
delimited text file, containing annotated BCR/TCR sequences
JSON-formatted file, containing annotated BCR/TCR sequences
[bcr|tcr]_annot (str, optional) – Path to the CSV-formatted BCR/TCR annotations file produced by CellRanger. Matching the annotation file to [bcr|tcr]_file is preferred – if
'all_contig.fasta'
is the supplied [bcr|tcr]_file, then'all_contig_annotations.csv'
is the appropriate annotation file.[bcr|tcr]_format (str, default='fasta') – Format of the input [bcr|tcr]_file. Options are:
'fasta'
,'delimited'
, and'json'
. If [bcr|tcr]_format is'fasta'
, abstar will be run on the input data to obtain annotated BCR/TCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.[bcr|tcr]_delimiter (str, default=' ') – Delimiter used in [bcr|tcr]_file. Only used if [bcr|tcr]_format is
'delimited'
. Default is' '
, which conforms to AIRR-C data standards.[bcr|tcr]_id_key (str, default='sequence_id') – Name of the column or field in [bcr|tcr]_file that corresponds to the sequence ID.
[bcr|tcr]_sequence_key (str, default='sequence') – Name of the column or field in [bcr|tcr]_file that corresponds to the VDJ sequence.
[bcr|tcr]_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as:
'AAACCTGAGAACTGTA-1_contig_1'
, where'AAACCTGAGAACTGTA-1'
is the droplet identifier and'contig_1'
is the contig identifier.[bcr|tcr]_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the [bcr|tcr]_id_delimiter.
abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if [bcr|tcr]_format is
'fasta'
. Options are'airr'
,'json'
and'tabular'
.abstar_germ_db (str, default='human') – Germline database to be used for annotation of BCR/TCR data. Built-in abstar options include:
'human'
,'macaque'
,'mouse'
and'humouse'
. Only used if one or both of [bcr|tcr]_format is'fasta'
.gex_only (bool, default=False) – If
True
, return only gene expression data and ignore features and hashes. Note that VDJ data will still be included in the returnedAnnData
object if [bcr|tcr]_file is provided.cellhash_regex (str, default='cell ?hash') – A regular expression (regex) string used to identify cell hashes. The regex must be found in all hash names. The default, combined with the default setting for ignore_hash_regex_case, will match
'cellhash'
or'cell hash'
in any combination of upper and lower case letters.ignore_cellhash_regex_case (bool, default=True) – If
True
, searching for hash_regex will ignore case.agbc_regex (str, default='agbc') – A regular expression (regex) string used to identify AgBCs. The regex must be found in all AgBC names. The default, combined with the default setting for ignore_hash_regex_case, will match
'agbc'
in any combination of upper and lower case letters.ignore_agbc_regex_case (bool, default=True) – If
True
, searching for agbc_regex will ignore case.log_transform_cellhashes (bool, default=True) – If
True
, cell hash UMI counts will be log2-plus-1 transformed.log_transform_agbcs (bool, default=True) – If
True
, AgBC UMI counts will be log2-plus-1 transformed.log_transform_features (bool, default=True) – If
True
, feature UMI counts will be log2-plus-1 transformed.ignore_zero_quantile_cellhashes (bool, default=True) – If
True
, any hashes for which the cellhash_quantile percentile have a count of zero are ignored. Default isTrue
and the default cellhash_quantile is0.95
, resulting in cellhashes with zero counts for the 95th percentile being ignored.ignore_zero_quantile_agbcs (bool, default=True) – If
True
, any AgBCs for which the agbc_quantile percentile have a count of zero are ignored. Default isTrue
and the default agbc_quantile is0.95
, resulting in AgBCs with zero counts for the 95th percentile being ignored.ignore_zero_quantile_features (bool, default=True) – If
True
, any features for which the feature_quantile percentile have a count of zero are ignored. Default isTrue
and the default feature_quantile is0.95
, resulting in features with zero counts for the 95th percentile being ignored.rename_cellhashes (dict, optional) – A dictionary with keys and values corresponding to the existing and new cellhash names, respectively. For example,
{'CellHash1': 'donor123}
would result in the renaming of'CellHash1'
to'donor123'
. Cellhashes not found in the rename_cellhashes dictionary will not be renamed.rename_agbcs (dict, optional) – A dictionary with keys and values corresponding to the existing and new AgBC names, respectively. For example,
{'AgBC1': 'Influenza H1'}
would result in the renaming of'AgBC1'
to'Influenza H1'
. AgBCs not found in the rename_agbcs dictionary will not be renamed.rename_features (dict, optional) – A dictionary with keys and values corresponding to the existing and new feature names, respectively. For example,
{'FeatureBC1': 'CD19}
would result in the renaming of'FeatureBC1'
to'CD19'
. Features not found in the rename_features dictionary will not be renamed.feature_suffix (str, default='_FBC') – Suffix to add to the end of each feature name. Useful because feature names may overlap with gene names. The default value will result in the feature
'CD19'
being renamed to'CD19_FBC'
. The suffix is added after feature renaming. To skip the addition of a feature suffix, simply supply an empty string (''
) as the argument.cellhash_quantile (float, default=0.95) – Percentile for which cellhashes with zero counts will be ignored if ignore_zero_quantile_cellhashes is
True
. Default is0.95
, which is equivalent to the 95th percentile.agbc_quantile (float, default=0.95) – Percentile for which AgBCs with zero counts will be ignored if ignore_zero_quantile_agbcs is
True
. Default is0.95
, which is equivalent to the 95th percentile.feature_quantile (float, default=0.95) – Percentile for which features with zero counts will be ignored if ignore_zero_quantile_features is
True
. Default is0.95
, which is equivalent to the 95th percentile.verbose (bool, default=True) – Print progress updates.
- Returns:
adata – An
AnnData
object containing gene expression data, with VDJ information located atadata.obs.bcr
and/oradata.obs.tcr
, and cellhash and feature barcode data found inadata.obs
. Ifgex_only
isTrue
, cellhash and feature barcode data are not returned.- Return type:
anndata.AnnData
- scab.io.read(h5ad_file: str | Path) <MagicMock name='mock.AnnData' id='140098729656384'> #
Reads a serialized
AnnData
object.Similar to
scanpy.read()
, except thatscanpy
does not support serialized BCR/TCR data. If BCR/TCR data is included in the serializedAnnData
file, it will be separately deserialized into the originalabutils.Pair
objects.- Parameters:
h5ad_file (str) – Path to the serialized
AnnData
object. Must be an".h5ad"
file. Required.- Returns:
adata
- Return type:
anndata.AnnData
- scab.io.load(h5ad_file: str | Path) <MagicMock name='mock.AnnData' id='140098729656384'> #
Loads a serialized
AnnData
object.Similar to
scanpy.read()
, except thatscanpy
does not support serialized BCR/TCR data. If BCR/TCR data is included in the serializedAnnData
file, it will be separately deserialized into the originalabutils.Pair
objects.- Parameters:
h5ad_file (str) – Path to the serialized
AnnData
object. Must be an".h5ad"
file. Required.- Returns:
adata
- Return type:
anndata.AnnData
- scab.io.write(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, h5ad_file: str | ~pathlib.Path)#
Serializes and writes an
AnnData
object to disk inh5ad
format.Similar to
scanpy.write()
, except thatscanpy
does not support serializing BCR/TCR data. This function serializesabutils.Pair
objects stored in eitheradata.obs.bcr
oradata.obs.tcr
usingpickle
prior to writing theAnnData
object to disk.- Parameters:
adata – An
AnnData
object containing gene expression, feature barcode and VDJ data.scab.read_10x_mtx()
can be used to construct a multi-omicsAnnData
object from raw CellRanger outputs.h5ad_file – Path to the output file. The output will be written in
h5ad
format and must include'.h5ad'
as the file extension. If it is not included, the extension will automatically be added.
- scab.io.save(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, h5ad_file: str | ~pathlib.Path)#
Serializes and saves an
AnnData
object to disk inh5ad
format.Similar to
scanpy.write()
, except thatscanpy
does not support serializing BCR/TCR data. This function serializesabutils.Pair
objects stored in eitheradata.obs.bcr
oradata.obs.tcr
usingpickle
prior to writing theAnnData
object to disk.- Parameters:
adata – An
AnnData
object containing gene expression, feature barcode and VDJ data.scab.read_10x_mtx()
can be used to construct a multi-omicsAnnData
object from raw CellRanger outputs.h5ad_file – Path to the output file. The output will be written in
h5ad
format and must include'.h5ad'
as the file extension. If it is not included, the extension will automatically be added.
- scab.io.concat(adatas: ~typing.Collection[<MagicMock name='mock.AnnData' id='140098729656384'>] | ~typing.Mapping[str, <MagicMock name='mock.AnnData' id='140098729656384'>], *, axis: ~typing.Literal[0, 1] = 0, join: ~typing.Literal['inner', 'outer'] = 'inner', merge: ~typing.Literal['same', 'unique', 'first', 'only'] | ~typing.Callable | None = None, uns_merge: ~typing.Literal['same', 'unique', 'first', 'only'] | ~typing.Callable | None = 'unique', label: str | None = None, keys: ~typing.Collection | None = None, index_unique: str | None = None, fill_value: ~typing.Any | None = None, pairwise: bool = False, obs_names_make_unique: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> #
Concatenates
AnnData
objects usinganndata.concat()
.Documentation was copied almost verbatim from the
anndata.concat()
`docstring`_.The only major difference is that the default for uns_merge has been changed from
None
(which doesn’t merge any of the data inadata.uns
) to'unique'
, which only mergesadata.uns
elements for which there is only one possible value.- Parameters:
adatas – The objects to be concatenated. If a Mapping is passed, keys are used for the keys argument and values are concatenated.
axis – Which axis to concatenate along.
0
is row-wise,1
is column-wise.join – How to align values when concatenating. If
"outer"
, the union of the other axis is taken. If"inner"
, the intersection is taken. For example:merge – How elements not aligned to the axis being concatenated along are selected. Currently implemented strategies include: *
None
: No elements are kept. *"same"
: Elements that are the same in each of the objects. *"unique"
: Elements for which there is only one possible value. *"first"
: The first element seen at each from each position. *"only"
: Elements that show up in only one of the objects.uns_merge – How the elements of
.uns
are selected. Uses the same set of strategies as the merge argument, except applied recursively.label – Column in axis annotation (i.e.
.obs
or.var
) to place batch information in. If it’s None, no column is added.keys – Names for each object being added. These values are used for column values for label or appended to the index if index_unique is not
None
. Defaults to incrementing integer labels.index_unique – Whether to make the index unique by using the keys. If provided, this is the delimeter between “{orig_idx}{index_unique}{key}”. When
None
, the original indices are kept.fill_value – When
join="outer"
, this is the value that will be used to fill the introduced indices. By default, sparse arrays are padded with zeros, while dense arrays and DataFrames are padded with missing values.pairwise – Whether pairwise elements along the concatenated dimension should be included. This is False by default, since the resulting arrays are often not meaningful.
obs_names_make_unique – If
True
, will callobs_names_make_unique()
on the concatenatedAnnData
object prior to returning. Default isTrue
.
Notes
Warning
If you use
join='outer'
this fills 0s for sparse data when variables are absent in a batch. Use this with care. Dense data is filled withNaN
.Examples
Preparing example objects >>> import anndata as ad, pandas as pd, numpy as np >>> from scipy import sparse >>> a = ad.AnnData( … X=sparse.csr_matrix(np.array([[0, 1], [2, 3]])), … obs=pd.DataFrame({“group”: [“a”, “b”]}, index=[“s1”, “s2”]), … var=pd.DataFrame(index=[“var1”, “var2”]), … varm={“ones”: np.ones((2, 5)), “rand”: np.random.randn(2, 3), “zeros”: np.zeros((2, 5))}, … uns={“a”: 1, “b”: 2, “c”: {“c.a”: 3, “c.b”: 4}}, … ) >>> b = ad.AnnData( … X=sparse.csr_matrix(np.array([[4, 5, 6], [7, 8, 9]])), … obs=pd.DataFrame({“group”: [“b”, “c”], “measure”: [1.2, 4.3]}, index=[“s3”, “s4”]), … var=pd.DataFrame(index=[“var1”, “var2”, “var3”]), … varm={“ones”: np.ones((3, 5)), “rand”: np.random.randn(3, 5)}, … uns={“a”: 1, “b”: 3, “c”: {“c.b”: 4}}, … ) >>> c = ad.AnnData( … X=sparse.csr_matrix(np.array([[10, 11], [12, 13]])), … obs=pd.DataFrame({“group”: [“a”, “b”]}, index=[“s1”, “s2”]), … var=pd.DataFrame(index=[“var3”, “var4”]), … uns={“a”: 1, “b”: 4, “c”: {“c.a”: 3, “c.b”: 4, “c.c”: 5}}, … )
Concatenating along different axes
>>> ad.concat([a, b]).to_df() var1 var2 s1 0.0 1.0 s2 2.0 3.0 s3 4.0 5.0 s4 7.0 8.0 >>> ad.concat([a, c], axis=1).to_df() var1 var2 var3 var4 s1 0.0 1.0 10.0 11.0 s2 2.0 3.0 12.0 13.0
Inner and outer joins
>>> inner = ad.concat([a, b]) # Joining on intersection of variables >>> inner AnnData object with n_obs × n_vars = 4 × 2 obs: 'group' >>> (inner.obs_names, inner.var_names) (Index(['s1', 's2', 's3', 's4'], dtype='object'), Index(['var1', 'var2'], dtype='object')) >>> outer = ad.concat([a, b], join="outer") # Joining on union of variables >>> outer AnnData object with n_obs × n_vars = 4 × 3 obs: 'group', 'measure' >>> outer.var_names Index(['var1', 'var2', 'var3'], dtype='object') >>> outer.to_df() # Sparse arrays are padded with zeroes by default var1 var2 var3 s1 0.0 1.0 0.0 s2 2.0 3.0 0.0 s3 4.0 5.0 6.0 s4 7.0 8.0 9.0
Keeping track of source objects
>>> ad.concat({"a": a, "b": b}, label="batch").obs group batch s1 a a s2 b a s3 b b s4 c b >>> ad.concat([a, b], label="batch", keys=["a", "b"]).obs # Equivalent to previous group batch s1 a a s2 b a s3 b b s4 c b >>> ad.concat({"a": a, "b": b}, index_unique="-").obs group s1-a a s2-a b s3-b b s4-b c
Combining values not aligned to axis of concatenation
>>> ad.concat([a, b], merge="same") AnnData object with n_obs × n_vars = 4 × 2 obs: 'group' varm: 'ones' >>> ad.concat([a, b], merge="unique") AnnData object with n_obs × n_vars = 4 × 2 obs: 'group' varm: 'ones', 'zeros' >>> ad.concat([a, b], merge="first") AnnData object with n_obs × n_vars = 4 × 2 obs: 'group' varm: 'ones', 'rand', 'zeros' >>> ad.concat([a, b], merge="only") AnnData object with n_obs × n_vars = 4 × 2 obs: 'group' varm: 'zeros'
The same merge strategies can be used for elements in .uns
>>> dict(ad.concat([a, b, c], uns_merge="same").uns) {'a': 1, 'c': {'c.b': 4}} >>> dict(ad.concat([a, b, c], uns_merge="unique").uns) {'a': 1, 'c': {'c.a': 3, 'c.b': 4, 'c.c': 5}} >>> dict(ad.concat([a, b, c], uns_merge="only").uns) {'c': {'c.c': 5}} >>> dict(ad.concat([a, b, c], uns_merge="first").uns) {'a': 1, 'b': 2, 'c': {'c.a': 3, 'c.b': 4, 'c.c': 5}}