vdj#
- scab.vdj.merge(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, vdj_file: str | None = None, vdj_annot: str | None = None, vdj_field: str = 'bcr', vdj_format: ~typing.Literal['fasta', 'delimited', 'json'] = 'fasta', vdj_delimiter: str = '\t', vdj_id_key: str = 'sequence_id', vdj_sequence_key: str = 'sequence', vdj_id_delimiter: str = '_', vdj_id_delimiter_num: int = 1, receptor: str = 'bcr', chain_selection_func: ~typing.Callable | None = None, abstar_output_format: ~typing.Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', verbose: bool = False) <MagicMock name='mock.AnnData' id='140098729656384'> #
Merge VDJ (either BCR or TCR) sequences into an
AnnData
object.- Parameters:
adata (AnnData) –
AnnData
object, typically obtained by first runningscab.io.read_10x_mtx()
. Requiredvdj_file (str, optional) –
Path to a file containing BCR data. The file can be in one of several formats:
FASTA-formatted file, as output by CellRanger
delimited text file, containing annotated BCR sequences
JSON-formatted file, containing annotated BCR sequences
vdj_annot (str, optional) – Path to the CSV-formatted BCR annotations file produced by CellRanger. Matching the annotation file to vdj_file is preferred – if
'all_contig.fasta'
is the supplied vdj_file, then'all_contig_annotations.csv'
is the appropriate annotation file.vdj_format (str, default='fasta') – Format of the input vdj_file. Options are:
'fasta'
,'delimited'
, and'json'
. If vdj_format is'fasta'
, abstar will be run on the input data to obtain annotated BCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.vdj_delimiter (str, default=' ') – Delimiter used in vdj_file. Only used if vdj_format is
'delimited'
. Default is' '
, which conforms to AIRR-C data standards.vdj_id_key (str, default='sequence_id') – Name of the column or field in vdj_file that corresponds to the sequence ID.
vdj_sequence_key (str, default='sequence') – Name of the column or field in vdj_file that corresponds to the VDJ sequence.
vdj_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as:
'AAACCTGAGAACTGTA-1_contig_1'
, where'AAACCTGAGAACTGTA-1'
is the droplet identifier and'contig_1'
is the contig identifier.vdj_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the vdj_id_delimiter.
abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if bcr_format is
'fasta'
. Options are'airr'
,'json'
and'tabular'
.abstar_germ_db (str, default='human') – Germline database to be used for annotation of BCR data. Built-in abstar options include:
'human'
,'macaque'
,'mouse'
and'humouse'
. Only used if one or both of bcr_format is'fasta'
.verbose (bool, default=True) – Print progress updates.
- Returns:
adata – An
AnnData
object containing gene expression data, with VDJ information located atadata.obs.{vdj_field}
.- Return type:
AnnData
- scab.vdj.merge_bcr(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, bcr_file: str | None = None, bcr_annot: str | None = None, bcr_format: ~typing.Literal['fasta', 'delimited', 'json'] = 'fasta', bcr_delimiter: str = '\t', bcr_id_key: str = 'sequence_id', bcr_sequence_key: str = 'sequence', bcr_id_delimiter: str = '_', bcr_id_delimiter_num: int = 1, chain_selection_func: ~typing.Callable | None = None, abstar_output_format: ~typing.Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> #
Merge BCR sequences into an
AnnData
object.- Parameters:
adata (AnnData) –
AnnData
object, typically obtained by first runningscab.io.read_10x_mtx()
. Requiredbcr_file (str, optional) –
Path to a file containing BCR data. The file can be in one of several formats:
FASTA-formatted file, as output by CellRanger
delimited text file, containing annotated BCR sequences
JSON-formatted file, containing annotated BCR sequences
bcr_annot (str, optional) – Path to the CSV-formatted BCR annotations file produced by CellRanger. Matching the annotation file to bcr_file is preferred – if
'all_contig.fasta'
is the supplied bcr_file, then'all_contig_annotations.csv'
is the appropriate annotation file.bcr_format (str, default='fasta') – Format of the input bcr_file. Options are:
'fasta'
,'delimited'
, and'json'
. If bcr_format is'fasta'
, abstar will be run on the input data to obtain annotated BCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.bcr_delimiter (str, default=' ') – Delimiter used in bcr_file. Only used if bcr_format is
'delimited'
. Default is' '
, which conforms to AIRR-C data standards.bcr_id_key (str, default='sequence_id') – Name of the column or field in bcr_file that corresponds to the sequence ID.
bcr_sequence_key (str, default='sequence') – Name of the column or field in bcr_file that corresponds to the VDJ sequence.
bcr_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as:
'AAACCTGAGAACTGTA-1_contig_1'
, where'AAACCTGAGAACTGTA-1'
is the droplet identifier and'contig_1'
is the contig identifier.bcr_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the bcr_id_delimiter.
abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if bcr_format is
'fasta'
. Options are'airr'
,'json'
and'tabular'
.abstar_germ_db (str, default='human') – Germline database to be used for annotation of BCR data. Built-in abstar options include:
'human'
,'macaque'
,'mouse'
and'humouse'
. Only used if one or both of bcr_format is'fasta'
.verbose (bool, default=True) – Print progress updates.
- Returns:
adata – An
AnnData
object containing gene expression data, with BCR information located atadata.obs.bcr
.- Return type:
AnnData
- scab.vdj.merge_tcr(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, tcr_file: str | None = None, tcr_annot: str | None = None, tcr_format: ~typing.Literal['fasta', 'delimited', 'json'] = 'fasta', tcr_delimiter: str = '\t', tcr_id_key: str = 'sequence_id', tcr_sequence_key: str = 'sequence', tcr_id_delimiter: str = '_', tcr_id_delimiter_num: int = 1, chain_selection_func: ~typing.Callable | None = None, abstar_output_format: ~typing.Literal['airr', 'json'] = 'airr', abstar_germ_db: str = 'human', verbose: bool = True) <MagicMock name='mock.AnnData' id='140098729656384'> #
Merge TCR sequences into an
AnnData
object.- Parameters:
adata (AnnData) –
AnnData
object, typically obtained by first runningscab.io.read_10x_mtx()
. Requiredtcr_file (str, optional) –
Path to a file containing TCR data. The file can be in one of several formats:
FASTA-formatted file, as output by CellRanger
delimited text file, containing annotated TCR sequences
JSON-formatted file, containing annotated TCR sequences
tcr_annot (str, optional) – Path to the CSV-formatted TCR annotations file produced by CellRanger. Matching the annotation file to tcr_file is preferred – if
'all_contig.fasta'
is the supplied tcr_file, then'all_contig_annotations.csv'
is the appropriate annotation file.tcr_format (str, default='fasta') – Format of the input tcr_file. Options are:
'fasta'
,'delimited'
, and'json'
. If tcr_format is'fasta'
, abstar will be run on the input data to obtain annotated TCR data. By default, abstar will produce AIRR-formatted (tab-delimited) annotations.tcr_delimiter (str, default=' ') – Delimiter used in tcr_file. Only used if tcr_format is
'delimited'
. Default is' '
, which conforms to AIRR-C data standards.tcr_id_key (str, default='sequence_id') – Name of the column or field in tcr_file that corresponds to the sequence ID.
tcr_sequence_key (str, default='sequence') – Name of the column or field in tcr_file that corresponds to the VDJ sequence.
tcr_id_delimiter (str, default='_') – The delimiter used to separate the droplet and contig components of the sequence ID. For example, default CellRanger names are formatted as:
'AAACCTGAGAACTGTA-1_contig_1'
, where'AAACCTGAGAACTGTA-1'
is the droplet identifier and'contig_1'
is the contig identifier.tcr_id_delimiter_num (str, default=1) – The occurance (1-based numbering) of the tcr_id_delimiter.
abstar_output_format (str, default='airr') – Format for abstar annotations. Only used if tcr_format is
'fasta'
. Options are'airr'
,'json'
and'tabular'
.abstar_germ_db (str, default='human') – Germline database to be used for annotation of TCR data. Built-in abstar options include:
'human'
,'macaque'
,'mouse'
and'humouse'
. Only used if one or both of tcr_format is'fasta'
.verbose (bool, default=True) – Print progress updates.
- Returns:
adata – An
AnnData
object containing gene expression data, with TCR information located atadata.obs.bcr
.- Return type:
AnnData
- scab.vdj.get_pairing_info(pairs: ~typing.Iterable[<MagicMock name='mock.Pair' id='140098729193424'>], receptor: str) Iterable #
Get pairing information for a list of
Pair
objects.- Parameters:
pairs (Iterable[Pair]) – List of
Pair
objects.receptor (str) – Receptor type. Options are
'bcr'
and'tcr'
.
- Returns:
pair_status
- Return type:
Iterable
- scab.vdj.clonify(adata, distance_cutoff=0.32, shared_mutation_bonus=0.65, length_penalty_multiplier=2, preclustering=False, preclustering_threshold=0.65, preclustering_field='cdr3_nt', lineage_field='lineage', lineage_size_field='lineage_size', annotation_format='airr', return_assignment_dict=False)#
Assigns BCR sequences to clonal lineages using the clonify [Briney16] algorithm.
See also
Bryan Briney, Khoa Le, Jiang Zhu, and Dennis R BurtonClonify: unseeded antibody lineage assignment from next-generation sequencing data.Scientific Reports 2016. https://doi.org/10.1038/srep23901- Parameters:
adata (anndata.AnnData) –
AnnData
object containing annotated sequence data atadata.obs.bcr
. If data was read usingscab.read_10x_mtx()
, BCR data should already be in the correct location.distance_cutoff (float, default=0.32) – Distance threshold for lineage clustering.
shared_mutation_bonus (float, default=0.65) – Bonus applied for each shared V-gene mutation.
length_penalty_multiplier (int, default=2) – Multiplier for the CDR3 length penalty. Default is
2
, resulting in CDR3s that differ byn
amino acids being penalizedn * 2
.preclustering (bool, default=False) – If
True
, V/J groups are pre-clustered on the preclustering_field sequence, which can potentially speed up lineage assignment and reduce memory usage. IfFalse
, each V/J group is processed in its entirety without pre-clustering.preclustering_threshold (float, default=0.65) – Identity threshold for pre-clustering the V/J groups prior to lineage assignment.
preclustering_field (str, default='cdr3_nt') – Annotation field on which to pre-cluster sequences.
lineage_field (str, default='lineage') – Name of the lineage assignment field.
lineage_size_field (str, default='lineage_size') – Name of the lineage size field.
annotation_format (str, default='airr') – Format of the input sequence annotations. Choices are
'airr'
or'json'
.return_assignment_dict (bool, default=False) – If
True
, a dictionary linking sequence IDs to lineage names will be returned. IfFalse
, the inputanndata.AnnData
object will be returned, with lineage annotations included.
- Returns:
output – By default (
return_assignment_dict == False
), an updated adata object is returned with two additional columns populated -adata.obs.bcr_lineage
, which contains the lineage assignment, andadata.obs.bcr_lineage_size
, which contains the lineage size. Ifreturn_assignment_dict == True
, adict
mapping droplet barcodes (adata.obs_names
) to lineage names is returned.- Return type:
anndata.AnnData
ordict
- scab.vdj.build_synthesis_constructs(adata, overhang_5=None, overhang_3=None, annotation_format='airr', sequence_key=None, locus_key=None, name_key=None, bcr_key='bcr', sort=True)#
Builds codon-optimized synthesis constructs, including Gibson overhangs suitable for cloning IGH, IGK and IGL variable region constructs into antibody expression vectors.
See also
Thomas Tiller, Eric Meffre, Sergey Yurasov, Makoto Tsuiji, Michel C Nussenzweig, Hedda WardemannEfficient generation of monoclonal antibodies from single human B cells by single cell RT-PCR and expression vector cloningJournal of Immunological Methods 2008, doi: 10.1016/j.jim.2007.09.017- Parameters:
adata (anndata.AnnData) – An
anndata.AnnData
object containing annotated BCR sequences.overhang_5 (dict, optional) –
A
dict
mapping the locus name to 5’ Gibson overhangs. By default, Gibson overhangs corresponding to the expression vectors in Tiller et al, 2008:IGH:catcctttttctagtagcaactgcaaccggtgtacac
IGK:atcctttttctagtagcaactgcaaccggtgtacac
IGL:atcctttttctagtagcaactgcaaccggtgtacac
To produce constructs without 5’ Gibson overhangs, provide an empty dictionary.
overhang_3 (dict, optional) –
A
dict
mapping the locus name to 3’ Gibson overhangs. By default, Gibson overhangs corresponding to the expression vectors in Tiller et al, 2008:IGH:gcgtcgaccaagggcccatcggtcttcc
IGK:cgtacggtggctgcaccatctgtcttcatc
IGL:ggtcagcccaaggctgccccctcggtcactctgttcccgccctcgagtgaggagcttcaagccaacaaggcc
To produce constructs without 3’ Gibson overhangs, provide an empty dictionary.
sequence_key (str, default='sequence_aa') – Field containing the sequence to be codon optimized. Default is
'sequence_aa'
ifannotation_format == 'airr'
or'vdj_aa'
ifannotation_format == 'json'
. Either nucleotide or amino acid sequences are acceptable.locus_key (str, default='locus') – Field containing the sequence locus. Default is
'locus'
ifannotation_key == 'airr'
, or'chain'
ifannotation_key == 'json'
. Note that values inlocus_key
should match the keys inoverhang_5
andoverhang_3
.name_key (str, optional) – Field (in
adata.obs
) containing the name of the BCR pair. If not provided, the droplet barcode will be used.bcr_key (str, default='bcr') – Field (in
adata.obs
) containing the annotated BCR pair.sort (bool, default=True) – If
True
, output will be sorted by sequence name.
- Returns:
sequences – A
list
ofabutils.Sequence
objects. EachSequence
object has the following descriptive properties:id: The sequence ID, which includes the pair name and the locus.sequence: The codon-optimized sequence, including Gibson overhangs.If
sort == True
, the outputlist
will be sorted by name_key usingnatsort.natsorted()
.- Return type:
list
ofSequence
objects
- scab.vdj.bcr_summary_csv(adata, leading_fields=None, include=None, exclude=None, rename=None, annotation_format='airr', output_file=None)#
docstring for bcr_summary_csv.
- Parameters:
adata (anndata.AnnData) – An
anndata.AnnData
object containing annotated BCR sequences.leading_fields (iterable object, optional) – A list of fields in
adata.obs
that should be at the start of the output data. By defauolt, the existing column order inadata.obs
is used.include (iterable object, optional) – A list of columns in
adata.obs
that should be included in the summary output. By default, all columns inadata.obs
are used.exclude (iterable object, optional) – A list of columns in
adata.obs
that should be excluded from the summary output. By default, no columns inadata.obs
are excluded.rename (dict, optional) – A
dict
mappingadata.obs
columns to new column names. Any column names not included in rename will not be renamed.annotation_format (str, default='airr') – Format of the input sequence annotations. Choices are
['airr', 'json']
.output_file (str, optional) – Path to the output file. If not provided, the summary output will be returned as a Pandas
DataFrame
.
- Return type:
If
output_file
is provided, the summary output will be written to the file in CSV format and noting is returned. Ifoutput_file
is not provided, the summary data will be returned as a PandasDataFrame
.
- scab.vdj.to_fasta(adata: <MagicMock name='mock.AnnData' id='140098729656384'>, name: str | None = None, receptor: str = 'bcr', sequence_field: str = 'sequence', locus_field: str = 'locus', pairs_only: bool = False, pairing_status: ~typing.Iterable | str | None = None, fasta_file: str | None = None) str | None #
Write BCR or TCR sequences to a FASTA file.
- Parameters:
adata (AnnData) – The input data, which should contain annotated BCR or TCR sequences.
name (str, optional) – Sequence name to be used. Can be either a column in
adata.obs
or the name of an annotation field present in eachPair
object. If not provided,pair.name
will be used.receptor (str, default='bcr') – Receptor type. Options are
'bcr'
and'tcr'
.sequence_field (str, default='sequence') – Field containing the sequence to be written to the FASTA file. Default is
"sequence"
. Must be present in each BCR/TCR sequence annotation.locus_field (str, default='locus') – Field containing the sequence locus. Default is
"locus"
. Must be present in each BCR/TCR sequence annotation.pairs_only (bool, default=False) – If
True
, only paired sequences pair will be included. Pairing is determined by callingPair.is_pair
. Default isFalse
, meaning all sequences, even unpaired, will be included.pairing_status (str or iterable, optional) – Pairing status(es) to include. Options are any of the annotations produced by
scab.vdj.get_pairing_info()
. Multiple statuses can be included as a list. If not provided, all sequences will be included.fasta_file (str, optional) – Path to the output FASTA file. If not provided, the FASTA sequences will be printed to the console.
- Returns:
Output is written to
fasta_file
if provided. If not, the FASTA sequences areprinted to the console.