Tutorial¶
Motivation¶
We’ve all found ourselves hard coding paths to input files in analysis scripts and notebooks. Putting paths and metadata in spreadsheets is an improvement, but still requires significant code to work with conveniently and doesn’t handle multiple, loosely defined schemas. Databases are probably the right solution for huge projects but are inconvenient at smaller scale.
With Sefara, you write Python code to specify the files you’re analyzing and their metadata. Defining your datasets with code may seem strange but it allows for a lot flexibility. These files describing your datasets are called “resource collections” and can be put under version control and in general treated like code. Sefara is a Python library and a few commandline tools for reading, writing, and querying these files.
Essentials¶
Specifying a resource collection¶
A “resource” describes an entity used in an analysis. Resources often point to files on a filesystem.
Users define a “resource collection” in a file that sefara reads. Two formats are supported for these files: executable Python and JSON.
Here’s a resource collection defined using Python:
from sefara import export
export(
"patientA_sequencing_blood_2010",
path="/path/to/file1.bam",
tags=["patient_A", "sequencing", "2010"],
capture_kit="Agilent sureselect",
description="Sample taken from normal blood"
)
export(
"patientA_sequencing_tumor_2012",
path="/path/to/file2.bam",
tags=["patient_A", "sequencing", "2012"],
capture_kit="Nimblegen",
description="Primary tumor, left ovary"
)
export(
"patientB_sequencing_normal_tissue_2012",
path="/path/to/file3.bam",
tags=["patient_B", "sequencing", "2012"],
capture_kit="unknown",
description="Matched normal tissue, pancreas"
)
export(
"patientB_sequencing_tumor_2014",
path="/path/to/file4.bam",
tags=["patient_B", "sequencing", "2014"],
capture_kit="Agilent sureselect",
description="liver metastasis"
)
export(
"patientA_somatic_variant_calls",
path="/path/to/variants.vcf",
tags=["patient_A", "variants"],
tool="strelka",
normal_reads="patientA_sequencing_blood_2010",
tumor_reads="patientA_sequencing_tumor_2012"
)
This defines five resources named patientA_sequencing_blood_2010
,
patientA_sequencing_tumor_2012
, and so on (this example is from a cancer DNA sequencing analysis).
Sefara resources always have a name
attribute, given as the first parameter to the export
function.
Everything else is optional. If you define a tags
attribute for your
resource, as in the file above, then Sefara treats that attribute slightly
specially, as we’ll see below. All the other fields (like path
and
capture_kit
) are up to the user and are not interpreted by Sefara in
any way.
Note
Using Python as a configuration language for resource collections has pros and cons. Python enables concise descriptions of collections that have repetitive structure. However, full-fledged programs can be difficult to debug and maintain.
The best practice is to keep collection files very simple. Loops and string manipulation are fine, but using libraries, filesystems, or network services should be avoided.
If you find yourself writing a complex Python script to define a collection, consider instead writing a script that creates the collection. That script can be called once to write out a collection (in either Python or JSON format), to be used for subsequent analysis.
Loading¶
The collection shown before can be opened in Python using sefara.load
:
>>> import sefara
>>> resources = sefara.load("resource-collections/ex1.py")
>>> print(resources)
<ResourceCollection: 5 resources
patientA_sequencing_blood_2010
patientA_sequencing_tumor_2012
patientB_sequencing_normal_tissue_2012
patientB_sequencing_tumor_2014
patientA_somatic_variant_calls>
To get a detailed summary of each resource, use the ResourceCollection.summary
property:
>>> print(resources.summary)
ResourceCollection: 5 resources from /home/travis/build/timodonnell/sefara/docs/resource-collections/ex1.py
<Resource: name = patientA_sequencing_blood_2010
tags = sequencing patient_A 2010
capture_kit = Agilent sureselect
description = Sample taken from normal blood
path = /path/to/file1.bam >
<Resource: name = patientA_sequencing_tumor_2012
tags = sequencing patient_A 2012
capture_kit = Nimblegen
description = Primary tumor, left ovary
path = /path/to/file2.bam >
<Resource: name = patientB_sequencing_normal_tissue_2012
tags = sequencing patient_B 2012
capture_kit = unknown
description = Matched normal tissue, pancreas
path = /path/to/file3.bam >
<Resource: name = patientB_sequencing_tumor_2014
tags = sequencing patient_B 2014
capture_kit = Agilent sureselect
description = liver metastasis
path = /path/to/file4.bam >
<Resource: name = patientA_somatic_variant_calls
tags = patient_A variants
normal_reads = patientA_sequencing_blood_2010
path = /path/to/variants.vcf
tool = strelka
tumor_reads = patientA_sequencing_tumor_2012 >
Individual resources in a collection can be accessed by name or numeric index:
>>> resources["patientA_sequencing_tumor_2012"]
<Resource: name = patientA_sequencing_tumor_2012
tags = sequencing patient_A 2012
capture_kit = Nimblegen
description = Primary tumor, left ovary
path = /path/to/file2.bam >
>>> resources[0]
<Resource: name = patientA_sequencing_blood_2010
tags = sequencing patient_A 2010
capture_kit = Agilent sureselect
description = Sample taken from normal blood
path = /path/to/file1.bam >
Each Resource
is a dict-like object (an AttrDict
). You can access fields in the resource using either brackets (resource["name"]
) or attributes (resource.name
).
You can iterate over a resource collection to get each resource in turn:
>>> for resource in resources.filter("tags.patient_A"):
... print(resource.path)
/path/to/file1.bam
/path/to/file2.bam
/path/to/variants.vcf
Here we used the filter
method, described next.
Filtering¶
The filter
method returns a new ResourceCollection
with only the resources for which a Python expression evaluates to True
:
>>> resources.filter("capture_kit == 'Nimblegen'")
<ResourceCollection: 1 resources: patientA_sequencing_tumor_2012>
The expressions are evaluated with the resource’s attributes as local variables. Here’s another example:
>>> resources.filter("tags.patient_A")
<ResourceCollection: 3 resources
patientA_sequencing_blood_2010
patientA_sequencing_tumor_2012
patientA_somatic_variant_calls>
Here we used the fact that sefara handles a resource attribute called tags
a bit specially. If your resource has an attribute called tags
giving a list of strings, it becomes an instance of Tags
, a class that inherits from Python’s set
class, but also provides attribute-style access for membership testing.
Simple expressions combining tags are a convenient way to filter collections:
>>> resources.filter("tags.patient_A and not tags.sequencing")
<ResourceCollection: 1 resources: patientA_somatic_variant_calls>
Instead of a string expression, you can also pass a Python callable to filter
:
>>> resources.filter(lambda resource: resource.tags.variants)
<ResourceCollection: 1 resources: patientA_somatic_variant_calls>
Selecting¶
Use ResourceCollection.select
to pick out fields across a collection
in a pandas.DataFrame
:
>>> resources.select("name", "path")
name path
0 patientA_sequencing_blood_2010 /path/to/file1.bam
1 patientA_sequencing_tumor_2012 /path/to/file2.bam
2 patientB_sequencing_normal_tissue_2012 /path/to/file3.bam
3 patientB_sequencing_tumor_2014 /path/to/file4.bam
4 patientA_somatic_variant_calls /path/to/variants.vcf
You can pass Python expressions to select
:
>>> resources.select("name.upper()", "os.path.basename(path)")
name.upper() os.path.basename(path)
0 PATIENTA_SEQUENCING_BLOOD_2010 file1.bam
1 PATIENTA_SEQUENCING_TUMOR_2012 file2.bam
2 PATIENTB_SEQUENCING_NORMAL_TISSUE_2012 file3.bam
3 PATIENTB_SEQUENCING_TUMOR_2014 file4.bam
4 PATIENTA_SOMATIC_VARIANT_CALLS variants.vcf
To set the column names in the dataframe, give a “name: expression” string:
>>> resources.select("name: name.upper()", "filename: os.path.basename(path)")
name filename
0 PATIENTA_SEQUENCING_BLOOD_2010 file1.bam
1 PATIENTA_SEQUENCING_TUMOR_2012 file2.bam
2 PATIENTB_SEQUENCING_NORMAL_TISSUE_2012 file3.bam
3 PATIENTB_SEQUENCING_TUMOR_2014 file4.bam
4 PATIENTA_SOMATIC_VARIANT_CALLS variants.vcf
To select one column as a pandas series instead of a dataframe, use select_series
:
>>> resources.select_series("path")
0 /path/to/file1.bam
1 /path/to/file2.bam
2 /path/to/file3.bam
3 /path/to/file4.bam
4 /path/to/variants.vcf
Name: value, dtype: object
Note that in Python expressions evaluated by select
or filter
, if a field exists in some resources but not others, it defaults to None
in the resources where it is missing:
>>> resources.select("name", "capture_kit")
name capture_kit
0 patientA_sequencing_blood_2010 Agilent sureselect
1 patientA_sequencing_tumor_2012 Nimblegen
2 patientB_sequencing_normal_tissue_2012 unknown
3 patientB_sequencing_tumor_2014 Agilent sureselect
4 patientA_somatic_variant_calls None
Loading with URLs and filters¶
The sefara.load
function supports two other features worth knowing about:
- you can pass an http or https URL.
- you can specify a filter as part of the filename.
The syntax for filtering is “<filename>#filter=<filter expression>”. Here’s an example:
>>> sefara.load("resource-collections/ex1.py#filter=tags.variants")
<ResourceCollection: 1 resources: patientA_somatic_variant_calls>
This comes in handy for analysis scripts that take a resource collection as a commandline argument. If the script passes the argument to load
, then it automatically supports filtering.
Writing¶
The ResourceCollection.to_python
method gives string of Python code representing the collection:
>>> print(resources.filter("tags.patient_B").to_python())
# Generated on 2015-07-31 06:16:28.218137 by travis with the command:
# /home/travis/virtualenv/python2.7.9/bin/sphinx-build -b html -d _build/doctrees . _build/html
from sefara import export
export(
name="patientB_sequencing_normal_tissue_2012",
tags=["sequencing", "patient_B", "2012"],
capture_kit="unknown",
path="/path/to/file3.bam",
description="Matched normal tissue, pancreas")
export(
name="patientB_sequencing_tumor_2014",
tags=["sequencing", "patient_B", "2014"],
capture_kit="Agilent sureselect",
path="/path/to/file4.bam",
description="liver metastasis")
See the write
method for a convenient way to write this to a file.
JSON¶
Resource collections can also be specified with JSON. They are loaded with load
the same way as python collections, and there is a ResourceCollection.to_json
method to get the JSON representation of any collection.
Here’s what the same collection from before looks like using JSON:
# Generated on 2015-07-31 06:16:28.509231 by travis with the command:
# /home/travis/virtualenv/python2.7.9/bin/sefara-dump resource-collections/ex1.py
from sefara import export
export(
name="patientA_sequencing_blood_2010",
tags=["sequencing", "patient_A", "2010"],
capture_kit="Agilent sureselect",
path="/path/to/file1.bam",
description="Sample taken from normal blood")
export(
name="patientA_sequencing_tumor_2012",
tags=["sequencing", "patient_A", "2012"],
capture_kit="Nimblegen",
path="/path/to/file2.bam",
description="Primary tumor, left ovary")
export(
name="patientB_sequencing_normal_tissue_2012",
tags=["sequencing", "patient_B", "2012"],
capture_kit="unknown",
path="/path/to/file3.bam",
description="Matched normal tissue, pancreas")
export(
name="patientB_sequencing_tumor_2014",
tags=["sequencing", "patient_B", "2014"],
capture_kit="Agilent sureselect",
path="/path/to/file4.bam",
description="liver metastasis")
export(
name="patientA_somatic_variant_calls",
tags=["patient_A", "variants"],
tool="strelka",
normal_reads="patientA_sequencing_blood_2010",
path="/path/to/variants.vcf",
tumor_reads="patientA_sequencing_tumor_2012")
Commandline tools¶
sefara-dump¶
The sefara-dump
tool is a utility for reading in a collection and writing it out again after filtering, transforming, or changing the format. For example, we can write out a filtered version of our example collection:
$ sefara-dump resource-collections/ex1.py --filter tags.patient_B --format python
# Generated on 2015-07-31 06:16:28.762339 by travis with the command:
# /home/travis/virtualenv/python2.7.9/bin/sefara-dump resource-collections/ex1.py --filter tags.patient_B --format python
from sefara import export
export(
name="patientB_sequencing_normal_tissue_2012",
tags=["sequencing", "patient_B", "2012"],
capture_kit="unknown",
path="/path/to/file3.bam",
description="Matched normal tissue, pancreas")
export(
name="patientB_sequencing_tumor_2014",
tags=["sequencing", "patient_B", "2014"],
capture_kit="Agilent sureselect",
path="/path/to/file4.bam",
description="liver metastasis")
Or in JSON format:
$ sefara-dump resource-collections/ex1.py --filter tags.patient_B --format json
{
"patientB_sequencing_normal_tissue_2012": {
"tags": [
"sequencing",
"patient_B",
"2012"
],
"capture_kit": "unknown",
"path": "/path/to/file3.bam",
"description": "Matched normal tissue, pancreas"
},
"patientB_sequencing_tumor_2014": {
"tags": [
"sequencing",
"patient_B",
"2014"
],
"capture_kit": "Agilent sureselect",
"path": "/path/to/file4.bam",
"description": "liver metastasis"
}
}
The --code
argument is an experimental feature that makes it possible to add, remove, and modify
attributes of the collection by specifying Python code on the commandline. For example, we can capitalize
all the resource names like this:
$ sefara-dump resource-collections/ex1.py --code 'name = name.upper()'
# Generated on 2015-07-31 06:16:29.264684 by travis with the command:
# /home/travis/virtualenv/python2.7.9/bin/sefara-dump resource-collections/ex1.py --code 'name = name.upper()'
from sefara import export
export(
name="PATIENTA_SEQUENCING_BLOOD_2010",
tags=["sequencing", "patient_A", "2010"],
capture_kit="Agilent sureselect",
path="/path/to/file1.bam",
description="Sample taken from normal blood")
export(
name="PATIENTA_SEQUENCING_TUMOR_2012",
tags=["sequencing", "patient_A", "2012"],
capture_kit="Nimblegen",
path="/path/to/file2.bam",
description="Primary tumor, left ovary")
export(
name="PATIENTB_SEQUENCING_NORMAL_TISSUE_2012",
tags=["sequencing", "patient_B", "2012"],
capture_kit="unknown",
path="/path/to/file3.bam",
description="Matched normal tissue, pancreas")
export(
name="PATIENTB_SEQUENCING_TUMOR_2014",
tags=["sequencing", "patient_B", "2014"],
capture_kit="Agilent sureselect",
path="/path/to/file4.bam",
description="liver metastasis")
export(
name="PATIENTA_SOMATIC_VARIANT_CALLS",
tags=["patient_A", "variants"],
tool="strelka",
normal_reads="patientA_sequencing_blood_2010",
path="/path/to/variants.vcf",
tumor_reads="patientA_sequencing_tumor_2012")
sefara-select¶
The sefara-select
tool picks out fields in your collection and gives results in CSV or a few other formats:
$ sefara-select resource-collections/ex1.py name path
# name,path
patientA_sequencing_blood_2010,/path/to/file1.bam
patientA_sequencing_tumor_2012,/path/to/file2.bam
patientB_sequencing_normal_tissue_2012,/path/to/file3.bam
patientB_sequencing_tumor_2014,/path/to/file4.bam
patientA_somatic_variant_calls,/path/to/variants.vcf
Like the select
method, you can also use Python expressions, and set the column names:
$ sefara-select resource-collections/ex1.py name 'filename: os.path.basename(path)'
# name,filename
patientA_sequencing_blood_2010,file1.bam
patientA_sequencing_tumor_2012,file2.bam
patientB_sequencing_normal_tissue_2012,file3.bam
patientB_sequencing_tumor_2014,file4.bam
patientA_somatic_variant_calls,variants.vcf
Filters are supported:
$ sefara-select resource-collections/ex1.py name --filter tags.patient_B
patientB_sequencing_normal_tissue_2012
patientB_sequencing_tumor_2014
If you only select one field, by default no header is printed. That’s convenient for writing shell loops like this:
$ for p in $(sefara-select resource-collections/ex1.py path) ; do echo $p ; done
/path/to/file1.bam
/path/to/file2.bam
/path/to/file3.bam
/path/to/file4.bam
/path/to/variants.vcf
The tool provides rudimentary support for building up argument lists, suitable for passing to analysis scripts that don’t use sefara. Here are two examples (note that the first line is the command we typed – the second line is the output of the command, which is an argument string):
$ sefara-select resource-collections/ex1.py name path --format args
--name patientA_sequencing_blood_2010 patientA_sequencing_tumor_2012 patientB_sequencing_normal_tissue_2012 patientB_sequencing_tumor_2014 patientA_somatic_variant_calls --path /path/to/file1.bam /path/to/file2.bam /path/to/file3.bam /path/to/file4.bam /path/to/variants.vcf
$ sefara-select resource-collections/ex1.py 'filename:path' --format args-repeated
--filename /path/to/file1.bam --filename /path/to/file2.bam --filename /path/to/file3.bam --filename /path/to/file4.bam --filename /path/to/variants.vcf
The usual way to use this functionality is to put the sefara-select
command as a “command substitution” using the shell’s $()
operator, for example:
$ ./example_tool.py --foo bar $(sefara-select resource-collections/ex1.py path --format args-repeated)
# The tool was invoked with these arguments:
# ['--foo', 'bar', '--path', '/path/to/file1.bam', '--path',
# '/path/to/file2.bam', '--path', '/path/to/file3.bam', '--path',
# '/path/to/file4.bam', '--path', '/path/to/variants.vcf']
Note that for any sefara tool (and the load
function), you can pass -
as a path to a resource collection to read from stdin.
With creative use of the --code
argument to sefara-dump
and piping the results to sefara-select
, it’s often possible to munge a resource collection into the argument format your tool expects.
$ ./example_tool.py $(sefara-dump resource-collections/ex1.py --code 'kind = os.path.splitext(path)[1][1:]' | sefara-select - kind path --format args-repeated)
# The tool was invoked with these arguments:
# ['--kind', 'bam', '--path', '/path/to/file1.bam', '--kind', 'bam', '--
# path', '/path/to/file2.bam', '--kind', 'bam', '--path',
# '/path/to/file3.bam', '--kind', 'bam', '--path', '/path/to/file4.bam',
# '--kind', 'vcf', '--path', '/path/to/variants.vcf']
Hooks¶
Sefara supports a mechanism to transform resources as they are loaded, and also to configure arbitrary validation routines. This advanced feature comes in handy if you want to integrate Sefara with other pieces of infrastructure at your site. If you’re just getting started with sefara, you can probably skip this section.
The features described here are configured with a few environment variables. To print out what these variables are set to in your environment, run sefara-env
.
Checking¶
Sefara doesn’t concern itself with what attributes your resources have or what they signify, but you probably do. For example, your resources might have a path
attribute that should always be a filename. You might want to verify that your resources with this attribute point to a file that exists. Sefara provides a mechanism to help make this a bit more convenient than doing it manually.
Make a file that defines a check
function, like the following:
import os
def check(collection):
'''
Check that all resources with a ``path`` attribute specify a valid path
that exists on the filesystem.
For each resource in the collection, we yield a tuple:
(resource, attempted, problem)
where:
resource : `Resource`
the `Resource` instance
attempted : boolean
whether we attempted to validate this resource. We only attempt to
validate resources that have a ``path`` attribute.
problem : string or None
None if validation was successful, error message if validation failed
'''
for resource in collection:
if 'path' in resource:
try:
with open(resource.path):
pass
except Exception as e:
# Failure
message = "Couldn't open path: %s" % e
yield (resource, True, message)
else:
# Success
yield (resource, True, None)
else:
# Can't validate this resource.
yield (resource, False, "No path specified.")
The check
function should yield tuples, as documented in the code above.
Now set the environment variable SEFARA_CHECKER
to the path to this file:
$ export SEFARA_CHECKER=/path/to/example_hook.py
This environment variable can specify any number of checkers, separated by a colon.
You can now use the sefara-check
tool to validate this file:
sefara-check resource-collections/ex1.py
which, if all the resources validate, will give this output:
Checkers:
[0] hook_always_success.py
[ 1 / 5] patientA_sequencing_blood_2010 OK
[ 2 / 5] patientA_sequencing_tumor_2012 OK
[ 3 / 5] patientB_sequencing_normal_tissue_2012 OK
[ 4 / 5] patientB_sequencing_tumor_2014 OK
[ 5 / 5] patientA_somatic_variant_calls OK
ALL OK
or will spot errors:
Checkers:
[0] hook_always_success.py
[ 1 / 5] patientA_sequencing_blood_2010 OK
[ 2 / 5] patientA_sequencing_tumor_2012 OK
[ 3 / 5] patientB_sequencing_normal_tissue_2012 ER
[0] Couldn't open path: [Errno 2] No such file or directory: '/path/to/file3.bam'
[ 4 / 5] patientB_sequencing_tumor_2014 OK
[ 5 / 5] patientA_somatic_variant_calls OK
PROBLEMS (1 failed / 5 total):
-------------------------------patientB_sequencing_normal_tissue_2012-------------------------------
<Resource: name = patientB_sequencing_normal_tissue_2012
tags = sequencing patient_B 2012
capture_kit = unknown
description = Matched normal tissue, pancreas
path = /path/to/file3.bam >
hook_always_success.py
---> Couldn't open path: [Errno 2] No such file or directory: '/path/to/file3.bam'
You may have multiple kinds of resources, each with their own concept of validation. One way to handle this is to write a checker for each type of resource, and include all of them in the SEFARA_CHECKER
environment variable. Each checker should skip the resources that don’t match the schema it knows how to validate, as our example does for resources that don’t have a path
attribute. The sefara-check
tool will raise errors for any resources that are not validated by at least one checker.
Transforming¶
In some specialized circumstances, you may want to change resources as they are loaded. To do this, we can add a transform
function to our example_hook.py
file:
def transform(collection):
'''
For resources that have a 'path' attribute pointing to a certain
subdirectory, add a "http_url" attribute.
'''
for resource in collection:
if 'path' in resource and resource.path.startswith("/path/to"):
# Add http_url
resource.http_url = "http://our-internal-file-server" + resource.path
Note that the transform
function mutates the resources in the collection. It does not return a new collection.
We can configure this by setting the SEFARA_TRANSFORM
environment variable, which, like SEFARA_CHECKER
, can be a colon separated list of files:
$ export SEFARA_TRANSFORM=/path/to/example_hook.py
Now, whenever we call sefara.load
or use any of sefara’s commandline tools, we will be working with the transformed resources:
sefara-select resource-collections/ex1.py name http_url
gives this output:
# name,http_url
patientA_sequencing_blood_2010,http://our-internal-file-server/path/to/file1.bam
patientA_sequencing_tumor_2012,http://our-internal-file-server/path/to/file2.bam
patientB_sequencing_normal_tissue_2012,http://our-internal-file-server/path/to/file3.bam
patientB_sequencing_tumor_2014,http://our-internal-file-server/path/to/file4.bam
patientA_somatic_variant_calls,http://our-internal-file-server/path/to/variants.vcf