vdj#

scab.vdj.merge(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, vdj_file: str | None = None, vdj_annot: str | None = None, vdj_field: str = 'bcr', vdj_format: ~typing.Literal['fasta', 'delimited', 'json'] = 'fasta', vdj_delimiter: str = '\t', vdj_id_key: str = 'sequence_id', vdj_sequence_key: str = 'sequence', vdj_id_delimiter: str = '_', vdj_id_delimiter_num: int = 1, receptor: str = 'bcr', chain_selection_func: ~typing.Callable | None = None, abstar_output_format: ~typing.Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', verbose: bool = False) → <MagicMock name='mock.AnnData' id='140098729656384'>#

Merge VDJ (either BCR or TCR) sequences into an AnnData object.

Parameters:

adata (AnnData) – AnnData object, typically obtained by first running scab.io.read_10x_mtx(). Required
vdj_file (str, optional) –
Path to a file containing BCR data. The file can be in one of several formats:
- FASTA-formatted file, as output by CellRanger
- delimited text file, containing annotated BCR sequences
- JSON-formatted file, containing annotated BCR sequences
vdj_annot (str, optional) – Path to the CSV-formatted BCR annotations file produced by CellRanger. Matching the annotation file to vdj_file is preferred – if 'all_contig.fasta' is the supplied vdj_file, then 'all_contig_annotations.csv' is the appropriate annotation file.
vdj_format (str, default='fasta') – Format of the input vdj_file. Options are: 'fasta', 'delimited', and 'json'. If vdj_format is 'fasta', abstar will be run on the input data to obtain annotated BCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.
vdj_delimiter (str, default=' ') – Delimiter used in vdj_file. Only used if vdj_format is 'delimited'. Default is ' ', which conforms to AIRR-C data standards.
vdj_id_key (str, default='sequence_id') – Name of the column or field in vdj_file that corresponds to the sequence ID.
vdj_sequence_key (str, default='sequence') – Name of the column or field in vdj_file that corresponds to the VDJ sequence.
vdj_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as: 'AAACCTGAGAACTGTA-1_contig_1', where 'AAACCTGAGAACTGTA-1' is the droplet identifier and 'contig_1' is the contig identifier.
vdj_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the vdj_id_delimiter.
abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if bcr_format is 'fasta'. Options are 'airr', 'json' and 'tabular'.
abstar_germ_db (str, default='human') – Germline database to be used for annotation of BCR data. Built-in abstar options include: 'human', 'macaque', 'mouse' and 'humouse'. Only used if one or both of bcr_format is 'fasta'.
verbose (bool, default=True) – Print progress updates.

Returns:

adata – An AnnData object containing gene expression data, with VDJ information located at adata.obs.{vdj_field}.

Return type:

AnnData

scab.vdj.merge_bcr(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, bcr_file: str | None = None, bcr_annot: str | None = None, bcr_format: ~typing.Literal['fasta', 'delimited', 'json'] = 'fasta', bcr_delimiter: str = '\t', bcr_id_key: str = 'sequence_id', bcr_sequence_key: str = 'sequence', bcr_id_delimiter: str = '_', bcr_id_delimiter_num: int = 1, chain_selection_func: ~typing.Callable | None = None, abstar_output_format: ~typing.Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', verbose: bool = True) → <MagicMock name='mock.AnnData' id='140098729656384'>#

Merge BCR sequences into an AnnData object.

Parameters:

adata (AnnData) – AnnData object, typically obtained by first running scab.io.read_10x_mtx(). Required
bcr_file (str, optional) –
Path to a file containing BCR data. The file can be in one of several formats:
- FASTA-formatted file, as output by CellRanger
- delimited text file, containing annotated BCR sequences
- JSON-formatted file, containing annotated BCR sequences
bcr_annot (str, optional) – Path to the CSV-formatted BCR annotations file produced by CellRanger. Matching the annotation file to bcr_file is preferred – if 'all_contig.fasta' is the supplied bcr_file, then 'all_contig_annotations.csv' is the appropriate annotation file.
bcr_format (str, default='fasta') – Format of the input bcr_file. Options are: 'fasta', 'delimited', and 'json'. If bcr_format is 'fasta', abstar will be run on the input data to obtain annotated BCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.
bcr_delimiter (str, default=' ') – Delimiter used in bcr_file. Only used if bcr_format is 'delimited'. Default is ' ', which conforms to AIRR-C data standards.
bcr_id_key (str, default='sequence_id') – Name of the column or field in bcr_file that corresponds to the sequence ID.
bcr_sequence_key (str, default='sequence') – Name of the column or field in bcr_file that corresponds to the VDJ sequence.
bcr_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as: 'AAACCTGAGAACTGTA-1_contig_1', where 'AAACCTGAGAACTGTA-1' is the droplet identifier and 'contig_1' is the contig identifier.
bcr_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the bcr_id_delimiter.
abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if bcr_format is 'fasta'. Options are 'airr', 'json' and 'tabular'.
abstar_germ_db (str, default='human') – Germline database to be used for annotation of BCR data. Built-in abstar options include: 'human', 'macaque', 'mouse' and 'humouse'. Only used if one or both of bcr_format is 'fasta'.
verbose (bool, default=True) – Print progress updates.

Returns:

adata – An AnnData object containing gene expression data, with BCR information located at adata.obs.bcr.

Return type:

AnnData

scab.vdj.merge_tcr(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, tcr_file: str | None = None, tcr_annot: str | None = None, tcr_format: ~typing.Literal['fasta', 'delimited', 'json'] = 'fasta', tcr_delimiter: str = '\t', tcr_id_key: str = 'sequence_id', tcr_sequence_key: str = 'sequence', tcr_id_delimiter: str = '_', tcr_id_delimiter_num: int = 1, chain_selection_func: ~typing.Callable | None = None, abstar_output_format: ~typing.Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', verbose: bool = True) → <MagicMock name='mock.AnnData' id='140098729656384'>#

Merge TCR sequences into an AnnData object.

Parameters:

adata (AnnData) – AnnData object, typically obtained by first running scab.io.read_10x_mtx(). Required
tcr_file (str, optional) –
Path to a file containing TCR data. The file can be in one of several formats:
- FASTA-formatted file, as output by CellRanger
- delimited text file, containing annotated TCR sequences
- JSON-formatted file, containing annotated TCR sequences
tcr_annot (str, optional) – Path to the CSV-formatted TCR annotations file produced by CellRanger. Matching the annotation file to tcr_file is preferred – if 'all_contig.fasta' is the supplied tcr_file, then 'all_contig_annotations.csv' is the appropriate annotation file.
tcr_format (str, default='fasta') – Format of the input tcr_file. Options are: 'fasta', 'delimited', and 'json'. If tcr_format is 'fasta', abstar will be run on the input data to obtain annotated TCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.
tcr_delimiter (str, default=' ') – Delimiter used in tcr_file. Only used if tcr_format is 'delimited'. Default is ' ', which conforms to AIRR-C data standards.
tcr_id_key (str, default='sequence_id') – Name of the column or field in tcr_file that corresponds to the sequence ID.
tcr_sequence_key (str, default='sequence') – Name of the column or field in tcr_file that corresponds to the VDJ sequence.
tcr_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as: 'AAACCTGAGAACTGTA-1_contig_1', where 'AAACCTGAGAACTGTA-1' is the droplet identifier and 'contig_1' is the contig identifier.
tcr_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the tcr_id_delimiter.
abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if tcr_format is 'fasta'. Options are 'airr', 'json' and 'tabular'.
abstar_germ_db (str, default='human') – Germline database to be used for annotation of TCR data. Built-in abstar options include: 'human', 'macaque', 'mouse' and 'humouse'. Only used if one or both of tcr_format is 'fasta'.
verbose (bool, default=True) – Print progress updates.

Returns:

adata – An AnnData object containing gene expression data, with TCR information located at adata.obs.bcr.

Return type:

AnnData

scab.vdj.get_pairing_info(pairs: ~typing.Iterable[<MagicMock name='mock.Pair' id='140098729193424'>], receptor: str) → Iterable#

Get pairing information for a list of Pair objects.

Parameters:

pairs (Iterable[Pair]) – List of Pair objects.
receptor (str) – Receptor type. Options are 'bcr' and 'tcr'.

Returns:

pair_status

Return type:

Iterable

scab.vdj.clonify(adata, distance_cutoff=0.32, shared_mutation_bonus=0.65, length_penalty_multiplier=2, preclustering=False, preclustering_threshold=0.65, preclustering_field='cdr3_nt', lineage_field='lineage', lineage_size_field='lineage_size', annotation_format='airr', return_assignment_dict=False)#

Assigns BCR sequences to clonal lineages using the clonify [Briney16] algorithm.

See also

Thomas Tiller, Eric Meffre, Sergey Yurasov, Makoto Tsuiji, Michel C Nussenzweig, Hedda Wardemann
Efficient generation of monoclonal antibodies from single human B cells by single cell RT-PCR and expression vector cloning
Journal of Immunological Methods 2008, doi: 10.1016/j.jim.2007.09.017

Parameters:

adata (anndata.AnnData) – An anndata.AnnData object containing annotated BCR sequences.
overhang_5 (dict, optional) –
A dict mapping the locus name to 5’ Gibson overhangs. By default, Gibson overhangs corresponding to the expression vectors in Tiller et al, 2008:

IGH: catcctttttctagtagcaactgcaaccggtgtacac

IGK: atcctttttctagtagcaactgcaaccggtgtacac

IGL: atcctttttctagtagcaactgcaaccggtgtacac

To produce constructs without 5’ Gibson overhangs, provide an empty dictionary.
overhang_3 (dict, optional) –
A dict mapping the locus name to 3’ Gibson overhangs. By default, Gibson overhangs corresponding to the expression vectors in Tiller et al, 2008:

IGH: gcgtcgaccaagggcccatcggtcttcc

IGK: cgtacggtggctgcaccatctgtcttcatc

IGL: ggtcagcccaaggctgccccctcggtcactctgttcccgccctcgagtgaggagcttcaagccaacaaggcc

To produce constructs without 3’ Gibson overhangs, provide an empty dictionary.
sequence_key (str, default='sequence_aa') – Field containing the sequence to be codon optimized. Default is 'sequence_aa' if annotation_format == 'airr' or 'vdj_aa' if annotation_format == 'json'. Either nucleotide or amino acid sequences are acceptable.
locus_key (str, default='locus') – Field containing the sequence locus. Default is 'locus' if annotation_key == 'airr', or 'chain' if annotation_key == 'json'. Note that values in locus_key should match the keys in overhang_5 and overhang_3.
name_key (str, optional) – Field (in adata.obs) containing the name of the BCR pair. If not provided, the droplet barcode will be used.
bcr_key (str, default='bcr') – Field (in adata.obs) containing the annotated BCR pair.
sort (bool, default=True) – If True, output will be sorted by sequence name.

Returns:

sequences – A list of abutils.Sequence objects. Each Sequence object has the following descriptive properties:

id: The sequence ID, which includes the pair name and the locus.

sequence: The codon-optimized sequence, including Gibson overhangs.

If sort == True, the output list will be sorted by name_key using natsort.natsorted().

Return type:

list of Sequence objects

scab.vdj.bcr_summary_csv(adata, leading_fields=None, include=None, exclude=None, rename=None, annotation_format='airr', output_file=None)#

docstring for bcr_summary_csv.

Parameters:

adata (anndata.AnnData) – An anndata.AnnData object containing annotated BCR sequences.
leading_fields (iterable object, optional) – A list of fields in adata.obs that should be at the start of the output data. By defauolt, the existing column order in adata.obs is used.
include (iterable object, optional) – A list of columns in adata.obs that should be included in the summary output. By default, all columns in adata.obs are used.
exclude (iterable object, optional) – A list of columns in adata.obs that should be excluded from the summary output. By default, no columns in adata.obs are excluded.
rename (dict, optional) – A dict mapping adata.obs columns to new column names. Any column names not included in rename will not be renamed.
annotation_format (str, default='airr') – Format of the input sequence annotations. Choices are ['airr', 'json'].
output_file (str, optional) – Path to the output file. If not provided, the summary output will be returned as a Pandas DataFrame.

Return type:

If output_file is provided, the summary output will be written to the file in CSV format and noting is returned. If output_file is not provided, the summary data will be returned as a Pandas DataFrame.

scab.vdj.to_fasta(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, name: str | None = None, receptor: str = 'bcr', sequence_field: str = 'sequence', locus_field: str = 'locus', pairs_only: bool = False, pairing_status: ~typing.Iterable | str | None = None, fasta_file: str | None = None) → str | None#

Write BCR or TCR sequences to a FASTA file.

Parameters:

adata (AnnData) – The input data, which should contain annotated BCR or TCR sequences.
name (str, optional) – Sequence name to be used. Can be either a column in adata.obs or the name of an annotation field present in each Pair object. If not provided, pair.name will be used.
receptor (str, default='bcr') – Receptor type. Options are 'bcr' and 'tcr'.
sequence_field (str, default='sequence') – Field containing the sequence to be written to the FASTA file. Default is "sequence". Must be present in each BCR/TCR sequence annotation.
locus_field (str, default='locus') – Field containing the sequence locus. Default is "locus". Must be present in each BCR/TCR sequence annotation.
pairs_only (bool, default=False) – If True, only paired sequences pair will be included. Pairing is determined by calling Pair.is_pair. Default is False, meaning all sequences, even unpaired, will be included.
pairing_status (str or iterable, optional) – Pairing status(es) to include. Options are any of the annotations produced by scab.vdj.get_pairing_info(). Multiple statuses can be included as a list. If not provided, all sequences will be included.
fasta_file (str, optional) – Path to the output FASTA file. If not provided, the FASTA sequences will be printed to the console.

Returns:

Output is written to fasta_file if provided. If not, the FASTA sequences are
printed to the console.