vdj#

scab.vdj.merge(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, vdj_file: str | None = None, vdj_annot: str | None = None, vdj_field: str = 'bcr', vdj_format: ~typing.Literal['fasta', 'delimited', 'json'] = 'fasta', vdj_delimiter: str = '\t', vdj_id_key: str = 'sequence_id', vdj_sequence_key: str = 'sequence', vdj_id_delimiter: str = '_', vdj_id_delimiter_num: int = 1, receptor: str = 'bcr', chain_selection_func: ~typing.Callable | None = None, abstar_output_format: ~typing.Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', verbose: bool = False) <MagicMock name='mock.AnnData' id='140098729656384'>#

Merge VDJ (either BCR or TCR) sequences into an AnnData object.

Parameters:
  • adata (AnnData) – AnnData object, typically obtained by first running scab.io.read_10x_mtx(). Required

  • vdj_file (str, optional) –

    Path to a file containing BCR data. The file can be in one of several formats:

    • FASTA-formatted file, as output by CellRanger

    • delimited text file, containing annotated BCR sequences

    • JSON-formatted file, containing annotated BCR sequences

  • vdj_annot (str, optional) – Path to the CSV-formatted BCR annotations file produced by CellRanger. Matching the annotation file to vdj_file is preferred – if 'all_contig.fasta' is the supplied vdj_file, then 'all_contig_annotations.csv' is the appropriate annotation file.

  • vdj_format (str, default='fasta') – Format of the input vdj_file. Options are: 'fasta', 'delimited', and 'json'. If vdj_format is 'fasta', abstar will be run on the input data to obtain annotated BCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.

  • vdj_delimiter (str, default=' ') – Delimiter used in vdj_file. Only used if vdj_format is 'delimited'. Default is '  ', which conforms to AIRR-C data standards.

  • vdj_id_key (str, default='sequence_id') – Name of the column or field in vdj_file that corresponds to the sequence ID.

  • vdj_sequence_key (str, default='sequence') – Name of the column or field in vdj_file that corresponds to the VDJ sequence.

  • vdj_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as: 'AAACCTGAGAACTGTA-1_contig_1', where 'AAACCTGAGAACTGTA-1' is the droplet identifier and 'contig_1' is the contig identifier.

  • vdj_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the vdj_id_delimiter.

  • abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if bcr_format is 'fasta'. Options are 'airr', 'json' and 'tabular'.

  • abstar_germ_db (str, default='human') – Germline database to be used for annotation of BCR data. Built-in abstar options include: 'human', 'macaque', 'mouse' and 'humouse'. Only used if one or both of bcr_format is 'fasta'.

  • verbose (bool, default=True) – Print progress updates.

Returns:

adata – An AnnData object containing gene expression data, with VDJ information located at adata.obs.{vdj_field}.

Return type:

AnnData

scab.vdj.merge_bcr(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, bcr_file: str | None = None, bcr_annot: str | None = None, bcr_format: ~typing.Literal['fasta', 'delimited', 'json'] = 'fasta', bcr_delimiter: str = '\t', bcr_id_key: str = 'sequence_id', bcr_sequence_key: str = 'sequence', bcr_id_delimiter: str = '_', bcr_id_delimiter_num: int = 1, chain_selection_func: ~typing.Callable | None = None, abstar_output_format: ~typing.Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Merge BCR sequences into an AnnData object.

Parameters:
  • adata (AnnData) – AnnData object, typically obtained by first running scab.io.read_10x_mtx(). Required

  • bcr_file (str, optional) –

    Path to a file containing BCR data. The file can be in one of several formats:

    • FASTA-formatted file, as output by CellRanger

    • delimited text file, containing annotated BCR sequences

    • JSON-formatted file, containing annotated BCR sequences

  • bcr_annot (str, optional) – Path to the CSV-formatted BCR annotations file produced by CellRanger. Matching the annotation file to bcr_file is preferred – if 'all_contig.fasta' is the supplied bcr_file, then 'all_contig_annotations.csv' is the appropriate annotation file.

  • bcr_format (str, default='fasta') – Format of the input bcr_file. Options are: 'fasta', 'delimited', and 'json'. If bcr_format is 'fasta', abstar will be run on the input data to obtain annotated BCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.

  • bcr_delimiter (str, default=' ') – Delimiter used in bcr_file. Only used if bcr_format is 'delimited'. Default is '  ', which conforms to AIRR-C data standards.

  • bcr_id_key (str, default='sequence_id') – Name of the column or field in bcr_file that corresponds to the sequence ID.

  • bcr_sequence_key (str, default='sequence') – Name of the column or field in bcr_file that corresponds to the VDJ sequence.

  • bcr_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as: 'AAACCTGAGAACTGTA-1_contig_1', where 'AAACCTGAGAACTGTA-1' is the droplet identifier and 'contig_1' is the contig identifier.

  • bcr_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the bcr_id_delimiter.

  • abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if bcr_format is 'fasta'. Options are 'airr', 'json' and 'tabular'.

  • abstar_germ_db (str, default='human') – Germline database to be used for annotation of BCR data. Built-in abstar options include: 'human', 'macaque', 'mouse' and 'humouse'. Only used if one or both of bcr_format is 'fasta'.

  • verbose (bool, default=True) – Print progress updates.

Returns:

adata – An AnnData object containing gene expression data, with BCR information located at adata.obs.bcr.

Return type:

AnnData

scab.vdj.merge_tcr(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, tcr_file: str | None = None, tcr_annot: str | None = None, tcr_format: ~typing.Literal['fasta', 'delimited', 'json'] = 'fasta', tcr_delimiter: str = '\t', tcr_id_key: str = 'sequence_id', tcr_sequence_key: str = 'sequence', tcr_id_delimiter: str = '_', tcr_id_delimiter_num: int = 1, chain_selection_func: ~typing.Callable | None = None, abstar_output_format: ~typing.Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'>#

Merge TCR sequences into an AnnData object.

Parameters:
  • adata (AnnData) – AnnData object, typically obtained by first running scab.io.read_10x_mtx(). Required

  • tcr_file (str, optional) –

    Path to a file containing TCR data. The file can be in one of several formats:

    • FASTA-formatted file, as output by CellRanger

    • delimited text file, containing annotated TCR sequences

    • JSON-formatted file, containing annotated TCR sequences

  • tcr_annot (str, optional) – Path to the CSV-formatted TCR annotations file produced by CellRanger. Matching the annotation file to tcr_file is preferred – if 'all_contig.fasta' is the supplied tcr_file, then 'all_contig_annotations.csv' is the appropriate annotation file.

  • tcr_format (str, default='fasta') – Format of the input tcr_file. Options are: 'fasta', 'delimited', and 'json'. If tcr_format is 'fasta', abstar will be run on the input data to obtain annotated TCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.

  • tcr_delimiter (str, default=' ') – Delimiter used in tcr_file. Only used if tcr_format is 'delimited'. Default is '  ', which conforms to AIRR-C data standards.

  • tcr_id_key (str, default='sequence_id') – Name of the column or field in tcr_file that corresponds to the sequence ID.

  • tcr_sequence_key (str, default='sequence') – Name of the column or field in tcr_file that corresponds to the VDJ sequence.

  • tcr_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as: 'AAACCTGAGAACTGTA-1_contig_1', where 'AAACCTGAGAACTGTA-1' is the droplet identifier and 'contig_1' is the contig identifier.

  • tcr_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the tcr_id_delimiter.

  • abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if tcr_format is 'fasta'. Options are 'airr', 'json' and 'tabular'.

  • abstar_germ_db (str, default='human') – Germline database to be used for annotation of TCR data. Built-in abstar options include: 'human', 'macaque', 'mouse' and 'humouse'. Only used if one or both of tcr_format is 'fasta'.

  • verbose (bool, default=True) – Print progress updates.

Returns:

adata – An AnnData object containing gene expression data, with TCR information located at adata.obs.bcr.

Return type:

AnnData

scab.vdj.get_pairing_info(pairs: ~typing.Iterable[<MagicMock name='mock.Pair' id='140098729193424'>], receptor: str) Iterable#

Get pairing information for a list of Pair objects.

Parameters:
  • pairs (Iterable[Pair]) – List of Pair objects.

  • receptor (str) – Receptor type. Options are 'bcr' and 'tcr'.

Returns:

pair_status

Return type:

Iterable

scab.vdj.clonify(adata, distance_cutoff=0.32, shared_mutation_bonus=0.65, length_penalty_multiplier=2, preclustering=False, preclustering_threshold=0.65, preclustering_field='cdr3_nt', lineage_field='lineage', lineage_size_field='lineage_size', annotation_format='airr', return_assignment_dict=False)#

Assigns BCR sequences to clonal lineages using the clonify [Briney16] algorithm.

See also

Bryan Briney, Khoa Le, Jiang Zhu, and Dennis R Burton
Clonify: unseeded antibody lineage assignment from next-generation sequencing data.
Scientific Reports 2016. https://doi.org/10.1038/srep23901
Parameters:
  • adata (anndata.AnnData) – AnnData object containing annotated sequence data at adata.obs.bcr. If data was read using scab.read_10x_mtx(), BCR data should already be in the correct location.

  • distance_cutoff (float, default=0.32) – Distance threshold for lineage clustering.

  • shared_mutation_bonus (float, default=0.65) – Bonus applied for each shared V-gene mutation.

  • length_penalty_multiplier (int, default=2) – Multiplier for the CDR3 length penalty. Default is 2, resulting in CDR3s that differ by n amino acids being penalized n * 2.

  • preclustering (bool, default=False) – If True, V/J groups are pre-clustered on the preclustering_field sequence, which can potentially speed up lineage assignment and reduce memory usage. If False, each V/J group is processed in its entirety without pre-clustering.

  • preclustering_threshold (float, default=0.65) – Identity threshold for pre-clustering the V/J groups prior to lineage assignment.

  • preclustering_field (str, default='cdr3_nt') – Annotation field on which to pre-cluster sequences.

  • lineage_field (str, default='lineage') – Name of the lineage assignment field.

  • lineage_size_field (str, default='lineage_size') – Name of the lineage size field.

  • annotation_format (str, default='airr') – Format of the input sequence annotations. Choices are 'airr' or 'json'.

  • return_assignment_dict (bool, default=False) – If True, a dictionary linking sequence IDs to lineage names will be returned. If False, the input anndata.AnnData object will be returned, with lineage annotations included.

Returns:

output – By default (return_assignment_dict == False), an updated adata object is returned with two additional columns populated - adata.obs.bcr_lineage, which contains the lineage assignment, and adata.obs.bcr_lineage_size, which contains the lineage size. If return_assignment_dict == True, a dict mapping droplet barcodes (adata.obs_names) to lineage names is returned.

Return type:

anndata.AnnData or dict

scab.vdj.build_synthesis_constructs(adata, overhang_5=None, overhang_3=None, annotation_format='airr', sequence_key=None, locus_key=None, name_key=None, bcr_key='bcr', sort=True)#

Builds codon-optimized synthesis constructs, including Gibson overhangs suitable for cloning IGH, IGK and IGL variable region constructs into antibody expression vectors.

See also

Thomas Tiller, Eric Meffre, Sergey Yurasov, Makoto Tsuiji, Michel C Nussenzweig, Hedda Wardemann
Efficient generation of monoclonal antibodies from single human B cells by single cell RT-PCR and expression vector cloning
Journal of Immunological Methods 2008, doi: 10.1016/j.jim.2007.09.017
Parameters:
  • adata (anndata.AnnData) – An anndata.AnnData object containing annotated BCR sequences.

  • overhang_5 (dict, optional) –

    A dict mapping the locus name to 5’ Gibson overhangs. By default, Gibson overhangs corresponding to the expression vectors in Tiller et al, 2008:

    IGH: catcctttttctagtagcaactgcaaccggtgtacac
    IGK: atcctttttctagtagcaactgcaaccggtgtacac
    IGL: atcctttttctagtagcaactgcaaccggtgtacac

    To produce constructs without 5’ Gibson overhangs, provide an empty dictionary.

  • overhang_3 (dict, optional) –

    A dict mapping the locus name to 3’ Gibson overhangs. By default, Gibson overhangs corresponding to the expression vectors in Tiller et al, 2008:

    IGH: gcgtcgaccaagggcccatcggtcttcc
    IGK: cgtacggtggctgcaccatctgtcttcatc
    IGL: ggtcagcccaaggctgccccctcggtcactctgttcccgccctcgagtgaggagcttcaagccaacaaggcc

    To produce constructs without 3’ Gibson overhangs, provide an empty dictionary.

  • sequence_key (str, default='sequence_aa') – Field containing the sequence to be codon optimized. Default is 'sequence_aa' if annotation_format == 'airr' or 'vdj_aa' if annotation_format == 'json'. Either nucleotide or amino acid sequences are acceptable.

  • locus_key (str, default='locus') – Field containing the sequence locus. Default is 'locus' if annotation_key == 'airr', or 'chain' if annotation_key == 'json'. Note that values in locus_key should match the keys in overhang_5 and overhang_3.

  • name_key (str, optional) – Field (in adata.obs) containing the name of the BCR pair. If not provided, the droplet barcode will be used.

  • bcr_key (str, default='bcr') – Field (in adata.obs) containing the annotated BCR pair.

  • sort (bool, default=True) – If True, output will be sorted by sequence name.

Returns:

sequences – A list of abutils.Sequence objects. Each Sequence object has the following descriptive properties:

id: The sequence ID, which includes the pair name and the locus.
sequence: The codon-optimized sequence, including Gibson overhangs.

If sort == True, the output list will be sorted by name_key using natsort.natsorted().

Return type:

list of Sequence objects

scab.vdj.bcr_summary_csv(adata, leading_fields=None, include=None, exclude=None, rename=None, annotation_format='airr', output_file=None)#

docstring for bcr_summary_csv.

Parameters:
  • adata (anndata.AnnData) – An anndata.AnnData object containing annotated BCR sequences.

  • leading_fields (iterable object, optional) – A list of fields in adata.obs that should be at the start of the output data. By defauolt, the existing column order in adata.obs is used.

  • include (iterable object, optional) – A list of columns in adata.obs that should be included in the summary output. By default, all columns in adata.obs are used.

  • exclude (iterable object, optional) – A list of columns in adata.obs that should be excluded from the summary output. By default, no columns in adata.obs are excluded.

  • rename (dict, optional) – A dict mapping adata.obs columns to new column names. Any column names not included in rename will not be renamed.

  • annotation_format (str, default='airr') – Format of the input sequence annotations. Choices are ['airr', 'json'].

  • output_file (str, optional) – Path to the output file. If not provided, the summary output will be returned as a Pandas DataFrame.

Return type:

If output_file is provided, the summary output will be written to the file in CSV format and noting is returned. If output_file is not provided, the summary data will be returned as a Pandas DataFrame.

scab.vdj.to_fasta(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, name: str | None = None, receptor: str = 'bcr', sequence_field: str = 'sequence', locus_field: str = 'locus', pairs_only: bool = False, pairing_status: ~typing.Iterable | str | None = None, fasta_file: str | None = None) str | None#

Write BCR or TCR sequences to a FASTA file.

Parameters:
  • adata (AnnData) – The input data, which should contain annotated BCR or TCR sequences.

  • name (str, optional) – Sequence name to be used. Can be either a column in adata.obs or the name of an annotation field present in each Pair object. If not provided, pair.name will be used.

  • receptor (str, default='bcr') – Receptor type. Options are 'bcr' and 'tcr'.

  • sequence_field (str, default='sequence') – Field containing the sequence to be written to the FASTA file. Default is "sequence". Must be present in each BCR/TCR sequence annotation.

  • locus_field (str, default='locus') – Field containing the sequence locus. Default is "locus". Must be present in each BCR/TCR sequence annotation.

  • pairs_only (bool, default=False) – If True, only paired sequences pair will be included. Pairing is determined by calling Pair.is_pair. Default is False, meaning all sequences, even unpaired, will be included.

  • pairing_status (str or iterable, optional) – Pairing status(es) to include. Options are any of the annotations produced by scab.vdj.get_pairing_info(). Multiple statuses can be included as a list. If not provided, all sequences will be included.

  • fasta_file (str, optional) – Path to the output FASTA file. If not provided, the FASTA sequences will be printed to the console.

Returns:

  • Output is written to fasta_file if provided. If not, the FASTA sequences are

  • printed to the console.