Tutorial¶

Motivation¶

We’ve all found ourselves hard coding paths to input files in analysis scripts and notebooks. Putting paths and metadata in spreadsheets is an improvement, but still requires significant code to work with conveniently and doesn’t handle multiple, loosely defined schemas. Databases are probably the right solution for huge projects but are inconvenient at smaller scale.

With Sefara, you write Python code to specify the files you’re analyzing and their metadata. Defining your datasets with code may seem strange but it allows for a lot flexibility. These files describing your datasets are called “resource collections” and can be put under version control and in general treated like code. Sefara is a Python library and a few commandline tools for reading, writing, and querying these files.

Essentials¶

Specifying a resource collection¶

A “resource” describes an entity used in an analysis. Resources often point to files on a filesystem.

Users define a “resource collection” in a file that sefara reads. Two formats are supported for these files: executable Python and JSON.

Here’s a resource collection defined using Python:

from sefara import export

export(
    "patientA_sequencing_blood_2010",
    path="/path/to/file1.bam",
    tags=["patient_A", "sequencing", "2010"],
    capture_kit="Agilent sureselect",
    description="Sample taken from normal blood"
)

export(
    "patientA_sequencing_tumor_2012",
    path="/path/to/file2.bam",
    tags=["patient_A", "sequencing", "2012"],
    capture_kit="Nimblegen",
    description="Primary tumor, left ovary"
)

export(
    "patientB_sequencing_normal_tissue_2012",
    path="/path/to/file3.bam",
    tags=["patient_B", "sequencing", "2012"],
    capture_kit="unknown",
    description="Matched normal tissue, pancreas"
)

export(
    "patientB_sequencing_tumor_2014",
    path="/path/to/file4.bam",
    tags=["patient_B", "sequencing", "2014"],
    capture_kit="Agilent sureselect",
    description="liver metastasis"
)

export(
    "patientA_somatic_variant_calls",
    path="/path/to/variants.vcf",
    tags=["patient_A", "variants"],
    tool="strelka",
    normal_reads="patientA_sequencing_blood_2010",
    tumor_reads="patientA_sequencing_tumor_2012"
)

This defines five resources named patientA_sequencing_blood_2010, patientA_sequencing_tumor_2012, and so on (this example is from a cancer DNA sequencing analysis).

Sefara resources always have a name attribute, given as the first parameter to the export function. Everything else is optional. If you define a tags attribute for your resource, as in the file above, then Sefara treats that attribute slightly specially, as we’ll see below. All the other fields (like path and capture_kit) are up to the user and are not interpreted by Sefara in any way.

Note

Using Python as a configuration language for resource collections has pros and cons. Python enables concise descriptions of collections that have repetitive structure. However, full-fledged programs can be difficult to debug and maintain.

The best practice is to keep collection files very simple. Loops and string manipulation are fine, but using libraries, filesystems, or network services should be avoided.

If you find yourself writing a complex Python script to define a collection, consider instead writing a script that creates the collection. That script can be called once to write out a collection (in either Python or JSON format), to be used for subsequent analysis.

Loading¶

The collection shown before can be opened in Python using sefara.load:

>>> import sefara
>>> resources = sefara.load("resource-collections/ex1.py")
>>> print(resources)
<ResourceCollection: 5 resources
	patientA_sequencing_blood_2010
	patientA_sequencing_tumor_2012
	patientB_sequencing_normal_tissue_2012
	patientB_sequencing_tumor_2014
	patientA_somatic_variant_calls>

To get a detailed summary of each resource, use the ResourceCollection.summary property:

>>> print(resources.summary)
ResourceCollection: 5 resources from /home/travis/build/timodonnell/sefara/docs/resource-collections/ex1.py
<Resource: name        = patientA_sequencing_blood_2010
           tags        = sequencing patient_A 2010
           capture_kit = Agilent sureselect
           description = Sample taken from normal blood
           path        = /path/to/file1.bam >

<Resource: name        = patientA_sequencing_tumor_2012
           tags        = sequencing patient_A 2012
           capture_kit = Nimblegen
           description = Primary tumor, left ovary
           path        = /path/to/file2.bam >

<Resource: name        = patientB_sequencing_normal_tissue_2012
           tags        = sequencing patient_B 2012
           capture_kit = unknown
           description = Matched normal tissue, pancreas
           path        = /path/to/file3.bam >

<Resource: name        = patientB_sequencing_tumor_2014
           tags        = sequencing patient_B 2014
           capture_kit = Agilent sureselect
           description = liver metastasis
           path        = /path/to/file4.bam >

<Resource: name         = patientA_somatic_variant_calls
           tags         = patient_A variants
           normal_reads = patientA_sequencing_blood_2010
           path         = /path/to/variants.vcf
           tool         = strelka
           tumor_reads  = patientA_sequencing_tumor_2012 >

Individual resources in a collection can be accessed by name or numeric index:

>>> resources["patientA_sequencing_tumor_2012"]
<Resource: name        = patientA_sequencing_tumor_2012
           tags        = sequencing patient_A 2012
           capture_kit = Nimblegen
           description = Primary tumor, left ovary
           path        = /path/to/file2.bam >
>>> resources[0]
<Resource: name        = patientA_sequencing_blood_2010
           tags        = sequencing patient_A 2010
           capture_kit = Agilent sureselect
           description = Sample taken from normal blood
           path        = /path/to/file1.bam >

Each Resource is a dict-like object (an AttrDict). You can access fields in the resource using either brackets (resource["name"]) or attributes (resource.name).

You can iterate over a resource collection to get each resource in turn:

>>> for resource in resources.filter("tags.patient_A"):
...        print(resource.path)
/path/to/file1.bam
/path/to/file2.bam
/path/to/variants.vcf

Here we used the filter method, described next.

Filtering¶

The filter method returns a new ResourceCollection with only the resources for which a Python expression evaluates to True:

>>> resources.filter("capture_kit == 'Nimblegen'")
<ResourceCollection: 1 resources: patientA_sequencing_tumor_2012>

The expressions are evaluated with the resource’s attributes as local variables. Here’s another example:

>>> resources.filter("tags.patient_A")
<ResourceCollection: 3 resources
	patientA_sequencing_blood_2010
	patientA_sequencing_tumor_2012
	patientA_somatic_variant_calls>

Here we used the fact that sefara handles a resource attribute called tags a bit specially. If your resource has an attribute called tags giving a list of strings, it becomes an instance of Tags, a class that inherits from Python’s set class, but also provides attribute-style access for membership testing.

Simple expressions combining tags are a convenient way to filter collections:

>>> resources.filter("tags.patient_A and not tags.sequencing")
<ResourceCollection: 1 resources: patientA_somatic_variant_calls>

Instead of a string expression, you can also pass a Python callable to filter:

>>> resources.filter(lambda resource: resource.tags.variants)
<ResourceCollection: 1 resources: patientA_somatic_variant_calls>

Selecting¶

Use ResourceCollection.select to pick out fields across a collection in a pandas.DataFrame:

>>> resources.select("name", "path")
                                     name                   path
        patientA_sequencing_blood_2010     /path/to/file1.bam
        patientA_sequencing_tumor_2012     /path/to/file2.bam
patientB_sequencing_normal_tissue_2012     /path/to/file3.bam
        patientB_sequencing_tumor_2014     /path/to/file4.bam
        patientA_somatic_variant_calls  /path/to/variants.vcf

You can pass Python expressions to select:

>>> resources.select("name.upper()", "os.path.basename(path)")
                             name.upper() os.path.basename(path)
        PATIENTA_SEQUENCING_BLOOD_2010              file1.bam
        PATIENTA_SEQUENCING_TUMOR_2012              file2.bam
PATIENTB_SEQUENCING_NORMAL_TISSUE_2012              file3.bam
        PATIENTB_SEQUENCING_TUMOR_2014              file4.bam
        PATIENTA_SOMATIC_VARIANT_CALLS           variants.vcf

To set the column names in the dataframe, give a “name: expression” string:

>>> resources.select("name: name.upper()", "filename: os.path.basename(path)")
                                     name      filename
        PATIENTA_SEQUENCING_BLOOD_2010     file1.bam
        PATIENTA_SEQUENCING_TUMOR_2012     file2.bam
PATIENTB_SEQUENCING_NORMAL_TISSUE_2012     file3.bam
        PATIENTB_SEQUENCING_TUMOR_2014     file4.bam
        PATIENTA_SOMATIC_VARIANT_CALLS  variants.vcf

To select one column as a pandas series instead of a dataframe, use select_series:

>>> resources.select_series("path")
     /path/to/file1.bam
     /path/to/file2.bam
     /path/to/file3.bam
     /path/to/file4.bam
  /path/to/variants.vcf
Name: value, dtype: object

Note that in Python expressions evaluated by select or filter, if a field exists in some resources but not others, it defaults to None in the resources where it is missing:

>>> resources.select("name", "capture_kit")
                                     name         capture_kit
        patientA_sequencing_blood_2010  Agilent sureselect
        patientA_sequencing_tumor_2012           Nimblegen
patientB_sequencing_normal_tissue_2012             unknown
        patientB_sequencing_tumor_2014  Agilent sureselect
        patientA_somatic_variant_calls                None

Loading with URLs and filters¶

The sefara.load function supports two other features worth knowing about:

you can pass an http or https URL.

you can specify a filter as part of the filename.

The syntax for filtering is “<filename>#filter=<filter expression>”. Here’s an example:

>>> sefara.load("resource-collections/ex1.py#filter=tags.variants")
<ResourceCollection: 1 resources: patientA_somatic_variant_calls>

This comes in handy for analysis scripts that take a resource collection as a commandline argument. If the script passes the argument to load, then it automatically supports filtering.

Writing¶

The ResourceCollection.to_python method gives string of Python code representing the collection:

>>> print(resources.filter("tags.patient_B").to_python())
# Generated on 2015-07-31 06:16:28.218137 by travis with the command:
# /home/travis/virtualenv/python2.7.9/bin/sphinx-build -b html -d _build/doctrees . _build/html

from sefara import export

export(
    name="patientB_sequencing_normal_tissue_2012",
    tags=["sequencing", "patient_B", "2012"],
    capture_kit="unknown",
    path="/path/to/file3.bam",
    description="Matched normal tissue, pancreas")

export(
    name="patientB_sequencing_tumor_2014",
    tags=["sequencing", "patient_B", "2014"],
    capture_kit="Agilent sureselect",
    path="/path/to/file4.bam",
    description="liver metastasis")

See the write method for a convenient way to write this to a file.

JSON¶

Resource collections can also be specified with JSON. They are loaded with load the same way as python collections, and there is a ResourceCollection.to_json method to get the JSON representation of any collection.

Here’s what the same collection from before looks like using JSON:

# Generated on 2015-07-31 06:16:28.509231 by travis with the command:
# /home/travis/virtualenv/python2.7.9/bin/sefara-dump resource-collections/ex1.py

from sefara import export

export(
    name="patientA_sequencing_blood_2010",
    tags=["sequencing", "patient_A", "2010"],
    capture_kit="Agilent sureselect",
    path="/path/to/file1.bam",
    description="Sample taken from normal blood")

export(
    name="patientA_sequencing_tumor_2012",
    tags=["sequencing", "patient_A", "2012"],
    capture_kit="Nimblegen",
    path="/path/to/file2.bam",
    description="Primary tumor, left ovary")

export(
    name="patientB_sequencing_normal_tissue_2012",
    tags=["sequencing", "patient_B", "2012"],
    capture_kit="unknown",
    path="/path/to/file3.bam",
    description="Matched normal tissue, pancreas")

export(
    name="patientB_sequencing_tumor_2014",
    tags=["sequencing", "patient_B", "2014"],
    capture_kit="Agilent sureselect",
    path="/path/to/file4.bam",
    description="liver metastasis")

export(
    name="patientA_somatic_variant_calls",
    tags=["patient_A", "variants"],
    tool="strelka",
    normal_reads="patientA_sequencing_blood_2010",
    path="/path/to/variants.vcf",
    tumor_reads="patientA_sequencing_tumor_2012")

Commandline tools¶

sefara-dump¶

The sefara-dump tool is a utility for reading in a collection and writing it out again after filtering, transforming, or changing the format. For example, we can write out a filtered version of our example collection:

$ sefara-dump resource-collections/ex1.py --filter tags.patient_B --format python
# Generated on 2015-07-31 06:16:28.762339 by travis with the command:
# /home/travis/virtualenv/python2.7.9/bin/sefara-dump resource-collections/ex1.py --filter tags.patient_B --format python

from sefara import export

export(
    name="patientB_sequencing_normal_tissue_2012",
    tags=["sequencing", "patient_B", "2012"],
    capture_kit="unknown",
    path="/path/to/file3.bam",
    description="Matched normal tissue, pancreas")

export(
    name="patientB_sequencing_tumor_2014",
    tags=["sequencing", "patient_B", "2014"],
    capture_kit="Agilent sureselect",
    path="/path/to/file4.bam",
    description="liver metastasis")

Or in JSON format:

$ sefara-dump resource-collections/ex1.py --filter tags.patient_B --format json
{
    "patientB_sequencing_normal_tissue_2012": {
        "tags": [
            "sequencing", 
            "patient_B", 
            "2012"
        ], 
        "capture_kit": "unknown", 
        "path": "/path/to/file3.bam", 
        "description": "Matched normal tissue, pancreas"
    }, 
    "patientB_sequencing_tumor_2014": {
        "tags": [
            "sequencing", 
            "patient_B", 
            "2014"
        ], 
        "capture_kit": "Agilent sureselect", 
        "path": "/path/to/file4.bam", 
        "description": "liver metastasis"
    }
}

The --code argument is an experimental feature that makes it possible to add, remove, and modify attributes of the collection by specifying Python code on the commandline. For example, we can capitalize all the resource names like this:

$ sefara-dump resource-collections/ex1.py --code 'name = name.upper()'
# Generated on 2015-07-31 06:16:29.264684 by travis with the command:
# /home/travis/virtualenv/python2.7.9/bin/sefara-dump resource-collections/ex1.py --code 'name = name.upper()'

from sefara import export

export(
    name="PATIENTA_SEQUENCING_BLOOD_2010",
    tags=["sequencing", "patient_A", "2010"],
    capture_kit="Agilent sureselect",
    path="/path/to/file1.bam",
    description="Sample taken from normal blood")

export(
    name="PATIENTA_SEQUENCING_TUMOR_2012",
    tags=["sequencing", "patient_A", "2012"],
    capture_kit="Nimblegen",
    path="/path/to/file2.bam",
    description="Primary tumor, left ovary")

export(
    name="PATIENTB_SEQUENCING_NORMAL_TISSUE_2012",
    tags=["sequencing", "patient_B", "2012"],
    capture_kit="unknown",
    path="/path/to/file3.bam",
    description="Matched normal tissue, pancreas")

export(
    name="PATIENTB_SEQUENCING_TUMOR_2014",
    tags=["sequencing", "patient_B", "2014"],
    capture_kit="Agilent sureselect",
    path="/path/to/file4.bam",
    description="liver metastasis")

export(
    name="PATIENTA_SOMATIC_VARIANT_CALLS",
    tags=["patient_A", "variants"],
    tool="strelka",
    normal_reads="patientA_sequencing_blood_2010",
    path="/path/to/variants.vcf",
    tumor_reads="patientA_sequencing_tumor_2012")

sefara-select¶

The sefara-select tool picks out fields in your collection and gives results in CSV or a few other formats:

$ sefara-select resource-collections/ex1.py name path
# name,path
patientA_sequencing_blood_2010,/path/to/file1.bam
patientA_sequencing_tumor_2012,/path/to/file2.bam
patientB_sequencing_normal_tissue_2012,/path/to/file3.bam
patientB_sequencing_tumor_2014,/path/to/file4.bam
patientA_somatic_variant_calls,/path/to/variants.vcf

Like the select method, you can also use Python expressions, and set the column names:

$ sefara-select resource-collections/ex1.py name 'filename: os.path.basename(path)'
# name,filename
patientA_sequencing_blood_2010,file1.bam
patientA_sequencing_tumor_2012,file2.bam
patientB_sequencing_normal_tissue_2012,file3.bam
patientB_sequencing_tumor_2014,file4.bam
patientA_somatic_variant_calls,variants.vcf

Filters are supported:

$ sefara-select resource-collections/ex1.py name --filter tags.patient_B
patientB_sequencing_normal_tissue_2012
patientB_sequencing_tumor_2014

If you only select one field, by default no header is printed. That’s convenient for writing shell loops like this:

$ for p in $(sefara-select resource-collections/ex1.py path) ; do echo $p ; done
/path/to/file1.bam
/path/to/file2.bam
/path/to/file3.bam
/path/to/file4.bam
/path/to/variants.vcf

The tool provides rudimentary support for building up argument lists, suitable for passing to analysis scripts that don’t use sefara. Here are two examples (note that the first line is the command we typed – the second line is the output of the command, which is an argument string):

$ sefara-select resource-collections/ex1.py name path --format args
 --name patientA_sequencing_blood_2010 patientA_sequencing_tumor_2012 patientB_sequencing_normal_tissue_2012 patientB_sequencing_tumor_2014 patientA_somatic_variant_calls --path /path/to/file1.bam /path/to/file2.bam /path/to/file3.bam /path/to/file4.bam /path/to/variants.vcf

$ sefara-select resource-collections/ex1.py 'filename:path' --format args-repeated
 --filename /path/to/file1.bam --filename /path/to/file2.bam --filename /path/to/file3.bam --filename /path/to/file4.bam --filename /path/to/variants.vcf

The usual way to use this functionality is to put the sefara-select command as a “command substitution” using the shell’s $() operator, for example:

$ ./example_tool.py --foo bar $(sefara-select resource-collections/ex1.py path --format args-repeated)

# The tool was invoked with these arguments:
# ['--foo', 'bar', '--path', '/path/to/file1.bam', '--path',
# '/path/to/file2.bam', '--path', '/path/to/file3.bam', '--path',
# '/path/to/file4.bam', '--path', '/path/to/variants.vcf']

Note that for any sefara tool (and the load function), you can pass - as a path to a resource collection to read from stdin.

With creative use of the --code argument to sefara-dump and piping the results to sefara-select, it’s often possible to munge a resource collection into the argument format your tool expects.

$ ./example_tool.py $(sefara-dump resource-collections/ex1.py --code 'kind = os.path.splitext(path)[1][1:]' | sefara-select - kind path --format args-repeated)

# The tool was invoked with these arguments:
# ['--kind', 'bam', '--path', '/path/to/file1.bam', '--kind', 'bam', '--
# path', '/path/to/file2.bam', '--kind', 'bam', '--path',
# '/path/to/file3.bam', '--kind', 'bam', '--path', '/path/to/file4.bam',
# '--kind', 'vcf', '--path', '/path/to/variants.vcf']

Hooks¶

Sefara supports a mechanism to transform resources as they are loaded, and also to configure arbitrary validation routines. This advanced feature comes in handy if you want to integrate Sefara with other pieces of infrastructure at your site. If you’re just getting started with sefara, you can probably skip this section.

The features described here are configured with a few environment variables. To print out what these variables are set to in your environment, run sefara-env.

Checking¶

Sefara doesn’t concern itself with what attributes your resources have or what they signify, but you probably do. For example, your resources might have a path attribute that should always be a filename. You might want to verify that your resources with this attribute point to a file that exists. Sefara provides a mechanism to help make this a bit more convenient than doing it manually.

Make a file that defines a check function, like the following:

import os

def check(collection):
    '''
    Check that all resources with a ``path`` attribute specify a valid path
    that exists on the filesystem.

    For each resource in the collection, we yield a tuple:

        (resource, attempted, problem)
    
    where:
        
        resource : `Resource`
            the `Resource` instance

        attempted : boolean
            whether we attempted to validate this resource. We only attempt to
            validate resources that have a ``path`` attribute.

        problem : string or None
            None if validation was successful, error message if validation failed
    '''

    for resource in collection:
        if 'path' in resource:
            try:
                with open(resource.path):
                    pass
            except Exception as e:
                # Failure
                message = "Couldn't open path: %s" % e
                yield (resource, True, message)
            else:
               # Success
               yield (resource, True, None)
        else:
            # Can't validate this resource.
            yield (resource, False, "No path specified.")

The check function should yield tuples, as documented in the code above.

Now set the environment variable SEFARA_CHECKER to the path to this file:

$ export SEFARA_CHECKER=/path/to/example_hook.py

This environment variable can specify any number of checkers, separated by a colon.

You can now use the sefara-check tool to validate this file:

sefara-check resource-collections/ex1.py

which, if all the resources validate, will give this output:

Checkers:
	[0]	hook_always_success.py

[  1 /   5] patientA_sequencing_blood_2010                          OK
[  2 /   5] patientA_sequencing_tumor_2012                          OK
[  3 /   5] patientB_sequencing_normal_tissue_2012                  OK
[  4 /   5] patientB_sequencing_tumor_2014                          OK
[  5 /   5] patientA_somatic_variant_calls                          OK

ALL OK

or will spot errors:

Checkers:
	[0]	hook_always_success.py

[  1 /   5] patientA_sequencing_blood_2010                          OK
[  2 /   5] patientA_sequencing_tumor_2012                          OK
[  3 /   5] patientB_sequencing_normal_tissue_2012                  ER
    [0] Couldn't open path: [Errno 2] No such file or directory: '/path/to/file3.bam'

[  4 /   5] patientB_sequencing_tumor_2014                          OK
[  5 /   5] patientA_somatic_variant_calls                          OK

PROBLEMS (1 failed / 5 total):
-------------------------------patientB_sequencing_normal_tissue_2012-------------------------------
<Resource: name        = patientB_sequencing_normal_tissue_2012
           tags        = sequencing patient_B 2012
           capture_kit = unknown
           description = Matched normal tissue, pancreas
           path        = /path/to/file3.bam >

	hook_always_success.py
	---> Couldn't open path: [Errno 2] No such file or directory: '/path/to/file3.bam'

You may have multiple kinds of resources, each with their own concept of validation. One way to handle this is to write a checker for each type of resource, and include all of them in the SEFARA_CHECKER environment variable. Each checker should skip the resources that don’t match the schema it knows how to validate, as our example does for resources that don’t have a path attribute. The sefara-check tool will raise errors for any resources that are not validated by at least one checker.

Transforming¶

In some specialized circumstances, you may want to change resources as they are loaded. To do this, we can add a transform function to our example_hook.py file:

def transform(collection):
    '''
    For resources that have a 'path' attribute pointing to a certain
    subdirectory, add a "http_url" attribute.
    '''
    for resource in collection:
        if 'path' in resource and resource.path.startswith("/path/to"):
            # Add http_url
            resource.http_url = "http://our-internal-file-server" + resource.path

Note that the transform function mutates the resources in the collection. It does not return a new collection.

We can configure this by setting the SEFARA_TRANSFORM environment variable, which, like SEFARA_CHECKER, can be a colon separated list of files:

$ export SEFARA_TRANSFORM=/path/to/example_hook.py

Now, whenever we call sefara.load or use any of sefara’s commandline tools, we will be working with the transformed resources:

sefara-select resource-collections/ex1.py name http_url

gives this output:

# name,http_url
patientA_sequencing_blood_2010,http://our-internal-file-server/path/to/file1.bam
patientA_sequencing_tumor_2012,http://our-internal-file-server/path/to/file2.bam
patientB_sequencing_normal_tissue_2012,http://our-internal-file-server/path/to/file3.bam
patientB_sequencing_tumor_2014,http://our-internal-file-server/path/to/file4.bam
patientA_somatic_variant_calls,http://our-internal-file-server/path/to/variants.vcf