scabranger#

CellRanger is a great way to process 10x Genomics data, but running it can often be cumbersome. This is especially true if you are running multiple samples, or if your data is spread across multiple sequencing runs. A typical workflow might look like this:

$ cellranger mkfastq --run path/to/myrun --csv samplesheet.csv --output-dir fastq
$ cellranger multi --id sample1 --csv sample1_config.csv
$ cellranger multi --id sample2 --csv sample2_config.csv
...
$ cellranger multi --id sampleN --csv sampleN_config.csv

Each of these steps can take quite awhile to run, and if you have a lot of samples, you need to keep coming back to the terminal to start the next command. This is where scabranger comes in. It is a simple wrapper around CellRanger that allows you to automate the processing of multiple samples across one or more sequencing runs. Running scabranger is as simple as:

$ scabranger -p path/to/myproject -c config.yaml

A scabranger run is defined by a single YAML file that specifies the location of your sequencing run data, the samples to process, and the libraries to use for each sample. Here is an example:

# SEQUENCING RUNS
sequencing_runs:
    MyRun:
        path: path/to/my_run
        is_compressed: False
        simple_csv: my_run.csv
        copy_to_project: False
    MyOtherRun:
        url: https://s3.amazonaws.com/brineylab/runs/my_other_run.tar.gz
        samplesheet: my_other_run.csv

# SAMPLES
samples:
    sample1:
        Gene Expression: sample1_gex
        Antibody Capture: sample1_features
        VDJ-B: sample1_ig
    sample2:
        Gene Expression: sample2_gex
        VDJ-T: sample1_tcr

# REFERENCES
gex_reference:
    default: /path/to/default_gex_reference
vdj_reference:
    default: /path/to/default_vdj_reference
    sample2: /path/to/alternate_vdj_reference
feature_reference:
    default: /path/to/default_feature_reference

# CELLRANGER OPTIONS
cli_options:
    mkfastq:
        default: ""
    multi:
        default: ""
        sample2: "--expect-cells=5000 --no-bam=true"

# MISCELLENEOUS OPTIONS
uiport: 40909
cellranger: cellranger

scabranger configuration#

The only thing you need to run scabranger (aside from data, of course) is a YAML-formatted config file and the name of a project directory into which the results will be deposited. The config file specifies the location of your sequencing run data, the samples to process, the libraries that have been generated for each sample, and any other options you want to pass to CellRanger. Let’s walk through each of these sections in a little more detail.


sequencing runs#

This section is required, and you must provide information for at least one sequencing run. Each run is identified by a unique name – you can pick whatever name you want, but names with spaces and/or special characters may cause unexpected problems. The following options are available:

  • path: The path to the sequencing run data. This should be a local path to a directory containing the sequencing run data or to a compressed file containing the sequencing run data. This option is mutually exclusive with url.

  • url: A URL to a compressed file containing the sequencing run data. This option is mutually exclusive with path.

Note

at least one of path or url must be provided.

  • simple_csv: A simple CSV file containing the sample name and the index sequences for each sample. This option is mutually exclusive with samplesheet.

  • samplesheet: A CSV file containing the sample name and the index sequences for each sample. This option is mutually exclusive with simple_csv.

Note

at least one of simple_csv or samplesheet must be provided.

  • is_compressed: A boolean indicating whether the sequencing run data is compressed.

  • copy_to_project: A boolean indicating whether the sequencing run data should be copied to the project directory. If True, the data will be copied to the project directory. Only really applicable when paired with path, because if url is provided, the linked data is downloaded into the project directory regardless of the value of copy_to_project. The default value is True.

The library names in the simple_csv or samplesheet files must match the library names in the samples configuration block.

Tip

If libraries are present in more than one sequencing run (for example, the libraries were re-sequenced to increase the total amount of data generated), the matched libraries should be given identical names in the samplesheet or simple_csv files for each run. If named in this way, scabranger can automatically combine the data from all applicable runs when running CellRanger.


samples#

This section is required, and you must provide information for at least one sample. Each sample is identified by a unique name – you can pick whatever name you want, but names with spaces and/or special characters may cause unexpected problems. For each sample, you You must specify the libraries that have been generated using a key/value pair in which the key is the name of the library type and the value is the name of the library.

Warning

While samples can be given arbitrary names, library names must match the name of a library present in the samplesheet or simple_csv files provided in the sequencing runs configuration block.

The following library types are available:

  • Gene Expression: The name of the library containing the gene expression data for this sample.

  • VDJ-B: The name of the library containing the B-cell VDJ data for this sample.

  • VDJ-T: The name of the library containing the T-cell VDJ data for this sample.

  • VDJ-T-GD: The name of the library containing the T-cell VDJ data (gamma-delta chains) for this sample.

  • Antibody Capture: The name of the library containing the antibody capture data for this sample.

  • Antigen Capture: The name of the library containing Barcode Enabled Antigen Mapping (BEAM) data for this sample.

  • CRISPR Guide Capture: The name of the library containing the CRISPR guide capture data for this sample.

  • Custom: The name of the library containing custom feature barcode data for this sample.

At least one library must be provided for each sample. If you do not have data for a particular library type, you can omit it from the sample definition. For example, if you only have gene expression data for a sample, you can define the sample like this:

samples:
    sample1:
        Gene Expression: sample1_gex

references#

This section is required, and you must provide at least one reference for each library type you are using. Each reference type (GEX, VDJ, and Feature) has a default reference that will be used for all samples unless a sample-specific reference is provided. The default references are specified using the default key. Sample-specific references are specified using the sample name as the key. For example, if you have a sample named sample2 that uses a different VDJ reference than the default, you would specify it like this:

vdj_reference:
    default: /path/to/default_vdj_reference
    sample2: /path/to/alternate_vdj_reference

cli options#

This section is optional, and you can provide options for any or all of the CellRanger commands you want to run. Each command has a default option that will be used for all samples unless a sample-specific option is provided. Sample-specific options are specified using the sample name as the key. For example, if you have a sample named sample2 that uses a different number of expected cells than the default and for which you would prefer that BAMs not be generated, you would specify the additional options (which will be passed diretly to cellranger multi) like this:

cli_options:
    multi:
        default: ""
        sample2: "--expect-cells=5000 --no-bam=true"

miscellaneous options#

This section is optional, and you can provide a few additional options that probably aren’t very common but are included for completeness. The following options are available:

  • uiport: The port to use for the CellRanger runtime UI. The default value is 40909.

  • cellranger: The command to use to invoke CellRanger. The default value is cellranger.

    This is useful if you have multiple versions of CellRanger installed and want to use a specific version for a particular run. If the command is not in your environment’s PATH, you must provide the full path to the command.

The following example shows how to update both the uiport and cellranger options:

uiport: 40404
cellranger: path/to/cellranger7.0