sefara package¶

Submodules¶

sefara.environment module¶

Environment variables used by sefara.

sefara.exporting module¶

Functions to define a resource collection.

These functions should be called only from sefara resource collections, not normal scripts.

sefara.exporting.export(*args, **kwargs)[source]¶

Create and export a Resource with the specified attributes.

All arguments are passed to Resource.

sefara.exporting.export_resources(resources)[source]¶

Export one or more Resource instances.

Parameters:

resources : list of Resource instances

Resource instances to be exported.

sefara.exporting.transform_exports(path_or_callable)[source]¶

Transform the resources exported by this collection.

Parameters:

path_or_callable : string or callable

Passed to hooks.transform; see those docs for details.

sefara.hooks module¶

Hooks are a mechanism for running site-specific transforms or validation routines on Sefara resources.

exception sefara.hooks.NoCheckers[source]¶: Bases: exceptions.Exception

sefara.hooks.check(collection, checkers=None, include_environment_checkers=True)[source]¶

Run “checkers”, either specified as an argument or using environment variables, on the resource collection.

Checkers are used to validate that the resources in a collection meet user-defined criteria, for example, that the paths they point to exist.

Parameters:

checkers : list of either callables, strings, or tuples [optional]

If tuples, the elements are (path, name, args, kwargs).

Like transforms, checkers are called with this ResourceCollection instance as an argument. Unlike transforms, checkers should NOT mutate the resources. They are expected to return a list or generator of three element tuples, (resource, attempted, problem) where resource is a Resource in this collection, attempted is True if validation was attempted on this resource, and problem is a string error message if validation was unsuccesful (None otherwise).

By returning attempted=False in the tuple above, checkers indicate that a resource did not conform to the schema the checker knows how to validate. For example, a checker might be verifying that files pointed to by resources exist, but some resource may not specify any file. In this case, the checker should set attempted=False. Another checker in use may know how to validate that resource. The sefara-check tool reports any resources that were not attempted by any checker as an error.

Checkers should generate a tuple for every resource in the collection in the order they appear in the collection.

include_environment_checkers : boolean [optional, default: True]

If True, then checkers configured in environment variables are run, in addition to any checkers specified in the first argument.

See the environment module for the definition of the environment variable used here.

Returns:

Generator giving (resource, tuples) pairs, where resource is a Resource

in this collection, and tuples is a list of

(checker, attempted, error) giving whether each checker

attempted validation of that resource, and, if so, the result.

error is None if validation was successful, otherwise a string

giving the error.

sefara.hooks.run_hook(collection, path_or_callable, name, *args, **kwargs)[source]¶

Invoke a Python callable (either passed directly or defined in the specified Python file) on the ResourceCollection and return the result.

Parameters:

path_or_callable : string or callable

If string, then this is interpreted as a path to a Python file. The file will be exec’d and is expected to define a module attribute given by the name argument. This attribute will be used as the callable. It should take a ResourceCollection instance as an argument.

Otherwise, this parameter should be a callable that takes a ResourceCollection instance. It will be invoked on this ResourceCollection.

name : string [optional, default ‘transform’]

If path_or_callable is a string giving a path to a Python file to execute, this parameter gives the attribute in that module to use as the callable.

*args, **kwargs

Additional args and kwargs are passed to the callable after the ResourceCollection.

sefara.hooks.transform(collection, path_or_callable, name='transform', *args, **kwargs)[source]¶

Run a function on the resources in the collection. The function is expected to mutate the resources in the collection; a new collection is NOT returned.

Parameters:

path_or_callable : string or callable

If string, then this is interpreted as a path to a Python file. The file will be exec’d and is expected to define a module attribute given by the name argument. This attribute will be used as the callable. It should take a ResourceCollection instance as an argument.

Otherwise, this parameter should be a callable that takes a ResourceCollection instance. It will be invoked on this ResourceCollection.

name : string [optional, default ‘transform’]

If path_or_callable is a string giving a path to a Python file to execute, this parameter gives the attribute in that module to use as the callable. Defaults to “transform”, i.e. the Python file specified by path_or_callable is expected to define a function called “transform”.

*args, **kwargs

Additional args and kwargs are passed to the transform function.

sefara.hooks.transform_from_environment(collection)[source]¶

Run the environment-variable-defined transforms on the resources in the collection. The transforms will mutate the resources in this collection; a new ResourceCollection is NOT returned.

See the environment module for the definition of the environment variable used here.

sefara.loading module¶

Instantiate ResourceCollection instances from files or strings.

sefara.loading.load(filename, format=None, filters=None, transforms=None, environment_transforms=None)[source]¶

Load a ResourceCollection from a file or URL.

Collections can be defined using either Python or JSON.

Parameters:

filename : string

Path or URL to resource collection. If a path is given, it is equivalent to a “file://<path>” URL. Supports any protocol handled by urlopen, such as HTTP and HTTPS.

Can be the string ‘-‘ to read from stdin.

May include a “fragment”, the part of a URL following a “#” symbol, e.g. “file1.py#filter=tags.foo”. The fragment is a query string of key/value pairs separated by “&” symbols, e.g. “file1.txt#filter=tags.foo&format=json”.

Valid fragment keys are:

filter

The value is a Python expression giving a sefara filter.

transform

The value is a path to a Python file with a sefara transform.

format

The value gives the format of the data, either “python” or “json”.

environment_transforms

The value should be “true” or “false” indicating whether to run transforms configured with environment variables.

The fragment operations are processed in order, left to right, and can be specified multiple times. That is, the URL:

file.py#filter=tags.bar&filter=tags.baz

is equivalent to:

file.py#filter=tags.bar and tags.baz

Fragment values can have spaces, and should not be quoted even if they do (as in the above example).

format : “python” or “json” [optional]

Format of data. Overrides any setting specified in the filename URL. If it is not specified in either place, it is guessed from the filename extension.

filters : list of strings or callbles [optional]

Filters to run on the ResourceCollection, in addition to any specified in the filename. Anything you can pass to ResourceCollection.filter is accepted.

transforms : list of strings [optional]

Transforms to run on the ResourceCollection, in addition to any specified in the filename or from the environment. Anything you you can pass to hooks.transform is accepted.

environment_transforms : Boolean [optional]

Whether to run environment_transforms. If specified, this will override the “environment_transforms” fragment setting specified in the filename URL. If not specified in either place, the default is True.

Returns:

ResourceCollection instance.

sefara.loading.loads(data, filename=None, format=None, environment_transforms=True)[source]¶

Load a ResourceCollection from a string.

Parameters:

data : string

ResourceCollection specification in either Python or JSON.

filename : string [optional]

filename where this data originally came from to use in error messages

format : string, either “python” or “json” [default: guess from data]

format of the data

environment_transforms : Boolean [default: True]

whether to run transforms configured in environment variables.

Returns:

ResourceCollection instance.

sefara.resource module¶

class sefara.resource.Resource(name=None, **fields)[source]¶

Bases: attrdict.mapping.AttrMap

A Resource gives information on how to access some dataset under analysis (such as a file), along with any optional metadata meaningful to the user.

The only required attribute for a resource is name. If name is not specified when the Resource is defined, one is automatically generated.

The tags attribute, if it is specified, is handled slightly specially. It is stored using the Tags class, which is a Python set that also supports a convenient method of membership testing. If tags is a Tags instance, then tags.foo returns whether the string “foo” is in the set.

RAISE = <object object>¶

evaluate(expression, error_value=<object object>, extra_bindings={})[source]¶

Evaluate a Python expression or callable in the context of this resource.

Parameters:

expression : string or callable

If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a name attribute, the expression “name.lower()” would return the name in lower case. Tags can be accessed through the tags variable. If the resource has a tag called foo, then the expression “tags.foo” will evaluate to True. If there is no such tag, then “tags.foo” will evaluate to False.

A few common modules are included in the evaluation namespace, including os, sys, collections, re, and json. The resource object itself is also available in the resource variable.

As a hack to support a primitive form of exception handling, a function called on_error is also included in the evaluation namespace. This function takes a single argument, value, of any type and returns None. If on_error is called while evaluating the expression, and the expression subsequently raises an exception, then the exception is caught and value is returned as the value of the expression. This means you can write expressions like:

on_error(False) or foo.startswith("bar")

and if the right side of the expression raises an error (for example, if there is no such attribute foo in the resource), then the value False will be used as the expression’s value. Note that you must write the expression as it is here: put the on_error clause first, and connect it with the main expression with or (this ensures that it gets called before the rest of the expression).

If expression is a callable, then it will be called and passed this Resource instance as its argument.

error_value : object [optional]

If evaluating the expression results in an uncaught exception, the error_value value will be returned instead. If not specified, then evaluate will raise the exception to the caller.

extra_bindings : dict [optional]

Additional local variables to include in the evaluation context.

Returns:

The Python object returned by evaluating the expression.

to_plain_types()[source]¶: Return this resource represented using Python dicts, lists, and strings.

class sefara.resource.Tags(tags)[source]¶

Bases: set

A set of strings used to group resources.

This class inherits from Python’s set class and supports all the functionality of that class. Additionally, it supports attribute access as a way to test membership: tags.foo will return True if the string “foo” is in the set, and False otherwise.

to_plain_types()[source]¶

sefara.resource.check_valid_tag(tag)[source]¶: Raise an error if the given name is not a valid tag name. Tags must be valid Python identifiers and also not builtin methods of the set class, to avoid ambiguity in using attribute access to test membership in the set.

sefara.resource_collection module¶

exception sefara.resource_collection.NoCheckers[source]¶: Bases: exceptions.Exception

class sefara.resource_collection.ResourceCollection(resources, filename='<no file>')[source]¶

Bases: object

Collection of zero or more resources.

Resources in a ResourceCollection can be accessed either by name:: resource_collection[“my_dataset”]
or by index (or slice):: resource_collection[0]

attributes¶: The attribute names used by resouces in this collection.

filter(expression)[source]¶

Return a new collection containing only those resources for which expression evaluated to True.

Parameters:

expression : string or callable

If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a name attribute, the expression “name.startswith(‘bar’)” is a valid expression. Tags can be accessed through the tags variable. If the resource has a tag called foo, then the expression “tags.foo” will evaluate to True. If there is no such tag, then “tags.foo” will evaluate to False.

If a callable, then it will be called and passed this Resource instance as its argument.

Returns:

A new ResourceCollection containing those resources for which

expression evaluated to True.

select(*expressions, **kwargs)[source]¶

Select fields (or expressions) from each resource as a pandas DataFrame.

Parameters:

*expressions : string, callable, or (string, string or callable) pair

One or more expressions giving the fields to select.

Each expression can be either a string expression, a callable, or a (string, string or callable) pair giving a label and an expression.

Labels give the column names in the result. Labels can be specified either by giving a (label, expression) pair, or giving a string of the form “LABEL: EXPRESSION”, such as “upper_name: name.upper()”. Here “upper_name” is the label, and “name.upper()” is the expression that will be evaluated. If not specified, labels default to the text of the expression if expression is a string, and an automatically generated label if expression is a callable.

Each expression will be passed to Resource.evaluate for each resource in the collection. See that method’s docs for details on expression evaluation.

if_error : string, one of “raise”, “skip”, or “none” [default: “raise”]

Must be specified as a keyword argument. Controls the behavior when evaluation of an expression raises an uncaught exception. One of:

raise

Raise the exception to the caller. This is the default.

skip

Skip resources where evaluation of any of the expressions raises an error. These resources will be omitted from the result.

none

If evaluating an expression on a resource raises an exception, set that entry in the result to None.

Returns:

A pandas.DataFrame. Rows correspond to resources. Columns correspond

to the specified expressions.

select_series(expression)[source]¶

Select a single field as a pandas series.

See select.

singleton(raise_on_multiple=True)[source]¶: If this ResourceCollection contains exactly 1 resource, return it. Otherwise, raise a ValueError.

summary¶: A string summarzing the resources in this collection, including their attributes.

tags¶: The tags associated with any resources in this collection.

to_json(indent=4)[source]¶: Return a string giving this collection represented as JSON.

to_plain_types()[source]¶: Return a representation of this collection using Python dicts, lists, and strings.

to_python(indent=4)[source]¶: Return a string giving this collection represented as Python code.

write(file=None, format=None, indent=None)[source]¶

Serialize this collection to disk.

Parameters:

file : string or file handle [optional, default: sys.stdout]

Path or file handle to write to.

format : string, one of “python” or “json” [optional]

Output format. If not specified, it is guessed from the filename extension.

indent : int [optional]

Number of spaces to use for indentation.

sefara.util module¶

A few utility functions and imports.

sefara.util.exec_in_directory(filename=None, code=None)[source]¶

Execute Python code from either a file or passed as an argument. If a file is specified, the code will be executed with the current working directory set to the directory where the file resides.

If both filename and code are specified, then code is executed, but filename is used to set the current working directory, and in error messages.

Parameters:

filename : string [optional]

Path to file with Python code to execute.

code : string [optional]

Python code to execute

Returns:

dict giving module-level attributes defined by the executed code

sefara.util.move_to_front(lst, *items)[source]¶

Move the specified items to the front of the given list. If an item is not in the list, it is ignored.

Mutates the list. Does not return anything.

Module contents¶

sefara.load(filename, format=None, filters=None, transforms=None, environment_transforms=None)[source]¶

Load a ResourceCollection from a file or URL.

Collections can be defined using either Python or JSON.

Parameters:

filename : string

Path or URL to resource collection. If a path is given, it is equivalent to a “file://<path>” URL. Supports any protocol handled by urlopen, such as HTTP and HTTPS.

Can be the string ‘-‘ to read from stdin.

May include a “fragment”, the part of a URL following a “#” symbol, e.g. “file1.py#filter=tags.foo”. The fragment is a query string of key/value pairs separated by “&” symbols, e.g. “file1.txt#filter=tags.foo&format=json”.

Valid fragment keys are:

filter

The value is a Python expression giving a sefara filter.

transform

The value is a path to a Python file with a sefara transform.

format

The value gives the format of the data, either “python” or “json”.

environment_transforms

The value should be “true” or “false” indicating whether to run transforms configured with environment variables.

The fragment operations are processed in order, left to right, and can be specified multiple times. That is, the URL:

file.py#filter=tags.bar&filter=tags.baz

is equivalent to:

file.py#filter=tags.bar and tags.baz

Fragment values can have spaces, and should not be quoted even if they do (as in the above example).

format : “python” or “json” [optional]

Format of data. Overrides any setting specified in the filename URL. If it is not specified in either place, it is guessed from the filename extension.

filters : list of strings or callbles [optional]

Filters to run on the ResourceCollection, in addition to any specified in the filename. Anything you can pass to ResourceCollection.filter is accepted.

transforms : list of strings [optional]

Transforms to run on the ResourceCollection, in addition to any specified in the filename or from the environment. Anything you you can pass to hooks.transform is accepted.

environment_transforms : Boolean [optional]

Whether to run environment_transforms. If specified, this will override the “environment_transforms” fragment setting specified in the filename URL. If not specified in either place, the default is True.

Returns:

ResourceCollection instance.

sefara.loads(data, filename=None, format=None, environment_transforms=True)[source]¶

Load a ResourceCollection from a string.

Parameters:

data : string

ResourceCollection specification in either Python or JSON.

filename : string [optional]

filename where this data originally came from to use in error messages

format : string, either “python” or “json” [default: guess from data]

format of the data

environment_transforms : Boolean [default: True]

whether to run transforms configured in environment variables.

Returns:

ResourceCollection instance.

class sefara.ResourceCollection(resources, filename='<no file>')[source]¶

Bases: object

Collection of zero or more resources.

Resources in a ResourceCollection can be accessed either by name:: resource_collection[“my_dataset”]
or by index (or slice):: resource_collection[0]

attributes¶: The attribute names used by resouces in this collection.

filter(expression)[source]¶

Return a new collection containing only those resources for which expression evaluated to True.

Parameters:

expression : string or callable

If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a name attribute, the expression “name.startswith(‘bar’)” is a valid expression. Tags can be accessed through the tags variable. If the resource has a tag called foo, then the expression “tags.foo” will evaluate to True. If there is no such tag, then “tags.foo” will evaluate to False.

If a callable, then it will be called and passed this Resource instance as its argument.

Returns:

A new ResourceCollection containing those resources for which

expression evaluated to True.

select(*expressions, **kwargs)[source]¶

Select fields (or expressions) from each resource as a pandas DataFrame.

Parameters:

*expressions : string, callable, or (string, string or callable) pair

One or more expressions giving the fields to select.

Each expression can be either a string expression, a callable, or a (string, string or callable) pair giving a label and an expression.

Labels give the column names in the result. Labels can be specified either by giving a (label, expression) pair, or giving a string of the form “LABEL: EXPRESSION”, such as “upper_name: name.upper()”. Here “upper_name” is the label, and “name.upper()” is the expression that will be evaluated. If not specified, labels default to the text of the expression if expression is a string, and an automatically generated label if expression is a callable.

Each expression will be passed to Resource.evaluate for each resource in the collection. See that method’s docs for details on expression evaluation.

if_error : string, one of “raise”, “skip”, or “none” [default: “raise”]

Must be specified as a keyword argument. Controls the behavior when evaluation of an expression raises an uncaught exception. One of:

raise

Raise the exception to the caller. This is the default.

skip

Skip resources where evaluation of any of the expressions raises an error. These resources will be omitted from the result.

none

If evaluating an expression on a resource raises an exception, set that entry in the result to None.

Returns:

A pandas.DataFrame. Rows correspond to resources. Columns correspond

to the specified expressions.

select_series(expression)[source]¶

Select a single field as a pandas series.

See select.

singleton(raise_on_multiple=True)[source]¶: If this ResourceCollection contains exactly 1 resource, return it. Otherwise, raise a ValueError.

summary¶: A string summarzing the resources in this collection, including their attributes.

tags¶: The tags associated with any resources in this collection.

to_json(indent=4)[source]¶: Return a string giving this collection represented as JSON.

to_plain_types()[source]¶: Return a representation of this collection using Python dicts, lists, and strings.

to_python(indent=4)[source]¶: Return a string giving this collection represented as Python code.

write(file=None, format=None, indent=None)[source]¶

Serialize this collection to disk.

Parameters:

file : string or file handle [optional, default: sys.stdout]

Path or file handle to write to.

format : string, one of “python” or “json” [optional]

Output format. If not specified, it is guessed from the filename extension.

indent : int [optional]

Number of spaces to use for indentation.

class sefara.Resource(name=None, **fields)[source]¶

Bases: attrdict.mapping.AttrMap

A Resource gives information on how to access some dataset under analysis (such as a file), along with any optional metadata meaningful to the user.

The only required attribute for a resource is name. If name is not specified when the Resource is defined, one is automatically generated.

The tags attribute, if it is specified, is handled slightly specially. It is stored using the Tags class, which is a Python set that also supports a convenient method of membership testing. If tags is a Tags instance, then tags.foo returns whether the string “foo” is in the set.

RAISE = <object object>¶

evaluate(expression, error_value=<object object>, extra_bindings={})[source]¶

Evaluate a Python expression or callable in the context of this resource.

Parameters:

expression : string or callable

If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a name attribute, the expression “name.lower()” would return the name in lower case. Tags can be accessed through the tags variable. If the resource has a tag called foo, then the expression “tags.foo” will evaluate to True. If there is no such tag, then “tags.foo” will evaluate to False.

A few common modules are included in the evaluation namespace, including os, sys, collections, re, and json. The resource object itself is also available in the resource variable.

As a hack to support a primitive form of exception handling, a function called on_error is also included in the evaluation namespace. This function takes a single argument, value, of any type and returns None. If on_error is called while evaluating the expression, and the expression subsequently raises an exception, then the exception is caught and value is returned as the value of the expression. This means you can write expressions like:

on_error(False) or foo.startswith("bar")

and if the right side of the expression raises an error (for example, if there is no such attribute foo in the resource), then the value False will be used as the expression’s value. Note that you must write the expression as it is here: put the on_error clause first, and connect it with the main expression with or (this ensures that it gets called before the rest of the expression).

If expression is a callable, then it will be called and passed this Resource instance as its argument.

error_value : object [optional]

If evaluating the expression results in an uncaught exception, the error_value value will be returned instead. If not specified, then evaluate will raise the exception to the caller.

extra_bindings : dict [optional]

Additional local variables to include in the evaluation context.

Returns:

The Python object returned by evaluating the expression.

to_plain_types()[source]¶: Return this resource represented using Python dicts, lists, and strings.

sefara.export(*args, **kwargs)[source]¶

Create and export a Resource with the specified attributes.

All arguments are passed to Resource.

sefara.export_resources(resources)[source]¶

Export one or more Resource instances.

Parameters:

resources : list of Resource instances

Resource instances to be exported.

sefara.transform_exports(path_or_callable)[source]¶

Transform the resources exported by this collection.

Parameters:

path_or_callable : string or callable

Passed to hooks.transform; see those docs for details.