sefara package¶
Submodules¶
sefara.environment module¶
Environment variables used by sefara.
sefara.exporting module¶
Functions to define a resource collection.
These functions should be called only from sefara resource collections, not normal scripts.
-
sefara.exporting.
export
(*args, **kwargs)[source]¶ Create and export a Resource with the specified attributes.
All arguments are passed to
Resource
.
-
sefara.exporting.
export_resources
(resources)[source]¶ Export one or more Resource instances.
Parameters: resources : list of
Resource
instancesResource instances to be exported.
-
sefara.exporting.
transform_exports
(path_or_callable)[source]¶ Transform the resources exported by this collection.
Parameters: path_or_callable : string or callable
Passed to
hooks.transform
; see those docs for details.
sefara.hooks module¶
Hooks are a mechanism for running site-specific transforms or validation routines on Sefara resources.
-
sefara.hooks.
check
(collection, checkers=None, include_environment_checkers=True)[source]¶ Run “checkers”, either specified as an argument or using environment variables, on the resource collection.
Checkers are used to validate that the resources in a collection meet user-defined criteria, for example, that the paths they point to exist.
Parameters: checkers : list of either callables, strings, or tuples [optional]
If tuples, the elements are
(path, name, args, kwargs)
.Like transforms, checkers are called with this ResourceCollection instance as an argument. Unlike transforms, checkers should NOT mutate the resources. They are expected to return a list or generator of three element tuples,
(resource, attempted, problem)
whereresource
is a Resource in this collection,attempted
is True if validation was attempted on this resource, andproblem
is a string error message if validation was unsuccesful (None otherwise).By returning
attempted=False
in the tuple above, checkers indicate that a resource did not conform to the schema the checker knows how to validate. For example, a checker might be verifying that files pointed to by resources exist, but some resource may not specify any file. In this case, the checker should setattempted=False
. Another checker in use may know how to validate that resource. Thesefara-check
tool reports any resources that were not attempted by any checker as an error.Checkers should generate a tuple for every resource in the collection in the order they appear in the collection.
include_environment_checkers : boolean [optional, default: True]
If True, then checkers configured in environment variables are run, in addition to any checkers specified in the first argument.
See the
environment
module for the definition of the environment variable used here.Returns: Generator giving (resource, tuples) pairs, where resource is a Resource
in this collection, and tuples is a list of
(checker, attempted, error)
giving whether each checkerattempted validation of that resource, and, if so, the result.
error
is None if validation was successful, otherwise a stringgiving the error.
-
sefara.hooks.
run_hook
(collection, path_or_callable, name, *args, **kwargs)[source]¶ Invoke a Python callable (either passed directly or defined in the specified Python file) on the ResourceCollection and return the result.
Parameters: path_or_callable : string or callable
If string, then this is interpreted as a path to a Python file. The file will be exec’d and is expected to define a module attribute given by the
name
argument. This attribute will be used as the callable. It should take a ResourceCollection instance as an argument.Otherwise, this parameter should be a callable that takes a ResourceCollection instance. It will be invoked on this ResourceCollection.
name : string [optional, default ‘transform’]
If
path_or_callable
is a string giving a path to a Python file to execute, this parameter gives the attribute in that module to use as the callable.*args, **kwargs
Additional args and kwargs are passed to the callable after the ResourceCollection.
-
sefara.hooks.
transform
(collection, path_or_callable, name='transform', *args, **kwargs)[source]¶ Run a function on the resources in the collection. The function is expected to mutate the resources in the collection; a new collection is NOT returned.
Parameters: path_or_callable : string or callable
If string, then this is interpreted as a path to a Python file. The file will be exec’d and is expected to define a module attribute given by the
name
argument. This attribute will be used as the callable. It should take a ResourceCollection instance as an argument.Otherwise, this parameter should be a callable that takes a ResourceCollection instance. It will be invoked on this ResourceCollection.
name : string [optional, default ‘transform’]
If
path_or_callable
is a string giving a path to a Python file to execute, this parameter gives the attribute in that module to use as the callable. Defaults to “transform”, i.e. the Python file specified bypath_or_callable
is expected to define a function called “transform”.*args, **kwargs
Additional args and kwargs are passed to the transform function.
-
sefara.hooks.
transform_from_environment
(collection)[source]¶ Run the environment-variable-defined transforms on the resources in the collection. The transforms will mutate the resources in this collection; a new ResourceCollection is NOT returned.
See the
environment
module for the definition of the environment variable used here.
sefara.loading module¶
Instantiate ResourceCollection instances from files or strings.
-
sefara.loading.
load
(filename, format=None, filters=None, transforms=None, environment_transforms=None)[source]¶ Load a
ResourceCollection
from a file or URL.Collections can be defined using either Python or JSON.
Parameters: filename : string
Path or URL to resource collection. If a path is given, it is equivalent to a “file://<path>” URL. Supports any protocol handled by
urlopen
, such as HTTP and HTTPS.Can be the string ‘-‘ to read from stdin.
May include a “fragment”, the part of a URL following a “#” symbol, e.g. “file1.py#filter=tags.foo”. The fragment is a query string of key/value pairs separated by “&” symbols, e.g. “file1.txt#filter=tags.foo&format=json”.
Valid fragment keys are:
- filter
The value is a Python expression giving a sefara filter.
- transform
The value is a path to a Python file with a sefara transform.
- format
The value gives the format of the data, either “python” or “json”.
- environment_transforms
The value should be “true” or “false” indicating whether to run transforms configured with environment variables.
The fragment operations are processed in order, left to right, and can be specified multiple times. That is, the URL:
file.py#filter=tags.bar&filter=tags.baz
is equivalent to:
file.py#filter=tags.bar and tags.baz
Fragment values can have spaces, and should not be quoted even if they do (as in the above example).
format : “python” or “json” [optional]
Format of data. Overrides any setting specified in the filename URL. If it is not specified in either place, it is guessed from the filename extension.
filters : list of strings or callbles [optional]
Filters to run on the ResourceCollection, in addition to any specified in the filename. Anything you can pass to
ResourceCollection.filter
is accepted.transforms : list of strings [optional]
Transforms to run on the ResourceCollection, in addition to any specified in the filename or from the environment. Anything you you can pass to
hooks.transform
is accepted.environment_transforms : Boolean [optional]
Whether to run environment_transforms. If specified, this will override the “environment_transforms” fragment setting specified in the filename URL. If not specified in either place, the default is True.
Returns: ResourceCollection
instance.
-
sefara.loading.
loads
(data, filename=None, format=None, environment_transforms=True)[source]¶ Load a ResourceCollection from a string.
Parameters: data : string
ResourceCollection specification in either Python or JSON.
filename : string [optional]
filename where this data originally came from to use in error messages
format : string, either “python” or “json” [default: guess from data]
format of the data
environment_transforms : Boolean [default: True]
whether to run transforms configured in environment variables.
Returns: ResourceCollection instance.
sefara.resource module¶
-
class
sefara.resource.
Resource
(name=None, **fields)[source]¶ Bases:
attrdict.mapping.AttrMap
A Resource gives information on how to access some dataset under analysis (such as a file), along with any optional metadata meaningful to the user.
The only required attribute for a resource is
name
. Ifname
is not specified when the Resource is defined, one is automatically generated.The
tags
attribute, if it is specified, is handled slightly specially. It is stored using theTags
class, which is a Python set that also supports a convenient method of membership testing. Iftags
is aTags
instance, thentags.foo
returns whether the string “foo” is in the set.-
RAISE
= <object object>¶
-
evaluate
(expression, error_value=<object object>, extra_bindings={})[source]¶ Evaluate a Python expression or callable in the context of this resource.
Parameters: expression : string or callable
If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a
name
attribute, the expression “name.lower()” would return the name in lower case. Tags can be accessed through thetags
variable. If the resource has a tag calledfoo
, then the expression “tags.foo” will evaluate toTrue
. If there is no such tag, then “tags.foo” will evaluate toFalse
.A few common modules are included in the evaluation namespace, including
os
,sys
,collections
,re
, andjson
. The resource object itself is also available in theresource
variable.As a hack to support a primitive form of exception handling, a function called
on_error
is also included in the evaluation namespace. This function takes a single argument,value
, of any type and returns None. Ifon_error
is called while evaluating the expression, and the expression subsequently raises an exception, then the exception is caught andvalue
is returned as the value of the expression. This means you can write expressions like:on_error(False) or foo.startswith("bar")
and if the right side of the expression raises an error (for example, if there is no such attribute
foo
in the resource), then the valueFalse
will be used as the expression’s value. Note that you must write the expression as it is here: put theon_error
clause first, and connect it with the main expression withor
(this ensures that it gets called before the rest of the expression).If
expression
is a callable, then it will be called and passed this Resource instance as its argument.error_value : object [optional]
If evaluating the expression results in an uncaught exception, the
error_value
value will be returned instead. If not specified, thenevaluate
will raise the exception to the caller.extra_bindings : dict [optional]
Additional local variables to include in the evaluation context.
Returns: The Python object returned by evaluating the expression.
-
-
class
sefara.resource.
Tags
(tags)[source]¶ Bases:
set
A set of strings used to group resources.
This class inherits from Python’s
set
class and supports all the functionality of that class. Additionally, it supports attribute access as a way to test membership:tags.foo
will returnTrue
if the string “foo” is in the set, andFalse
otherwise.
sefara.resource_collection module¶
-
class
sefara.resource_collection.
ResourceCollection
(resources, filename='<no file>')[source]¶ Bases:
object
Collection of zero or more resources.
- Resources in a ResourceCollection can be accessed either by name:
- resource_collection[“my_dataset”]
- or by index (or slice):
- resource_collection[0]
-
attributes
¶ The attribute names used by resouces in this collection.
-
filter
(expression)[source]¶ Return a new collection containing only those resources for which
expression
evaluated to True.Parameters: expression : string or callable
If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a
name
attribute, the expression “name.startswith(‘bar’)” is a valid expression. Tags can be accessed through thetags
variable. If the resource has a tag calledfoo
, then the expression “tags.foo” will evaluate to True. If there is no such tag, then “tags.foo” will evaluate to False.If a callable, then it will be called and passed this
Resource
instance as its argument.Returns: A new ResourceCollection containing those resources for which
expression
evaluated to True.
-
select
(*expressions, **kwargs)[source]¶ Select fields (or expressions) from each resource as a pandas DataFrame.
Parameters: *expressions : string, callable, or (string, string or callable) pair
One or more expressions giving the fields to select.
Each expression can be either a
string
expression, acallable
, or a(string, string or callable)
pair giving a label and an expression.Labels give the column names in the result. Labels can be specified either by giving a
(label, expression)
pair, or giving a string of the form “LABEL: EXPRESSION”, such as “upper_name: name.upper()”. Here “upper_name” is the label, and “name.upper()” is the expression that will be evaluated. If not specified, labels default to the text of theexpression
ifexpression
is a string, and an automatically generated label ifexpression
is a callable.Each
expression
will be passed toResource.evaluate
for each resource in the collection. See that method’s docs for details on expression evaluation.if_error : string, one of “raise”, “skip”, or “none” [default: “raise”]
Must be specified as a keyword argument. Controls the behavior when evaluation of an expression raises an uncaught exception. One of:
- raise
Raise the exception to the caller. This is the default.
- skip
Skip resources where evaluation of any of the expressions raises an error. These resources will be omitted from the result.
- none
If evaluating an expression on a resource raises an exception, set that entry in the result to
None
.
Returns: A
pandas.DataFrame
. Rows correspond to resources. Columns correspondto the specified expressions.
-
singleton
(raise_on_multiple=True)[source]¶ If this ResourceCollection contains exactly 1 resource, return it. Otherwise, raise a ValueError.
-
summary
¶ A string summarzing the resources in this collection, including their attributes.
The tags associated with any resources in this collection.
-
to_plain_types
()[source]¶ Return a representation of this collection using Python dicts, lists, and strings.
-
write
(file=None, format=None, indent=None)[source]¶ Serialize this collection to disk.
Parameters: file : string or file handle [optional, default: sys.stdout]
Path or file handle to write to.
format : string, one of “python” or “json” [optional]
Output format. If not specified, it is guessed from the filename extension.
indent : int [optional]
Number of spaces to use for indentation.
sefara.util module¶
A few utility functions and imports.
-
sefara.util.
exec_in_directory
(filename=None, code=None)[source]¶ Execute Python code from either a file or passed as an argument. If a file is specified, the code will be executed with the current working directory set to the directory where the file resides.
If both
filename
andcode
are specified, thencode
is executed, butfilename
is used to set the current working directory, and in error messages.Parameters: filename : string [optional]
Path to file with Python code to execute.
code : string [optional]
Python code to execute
Returns: dict giving module-level attributes defined by the executed code
Module contents¶
-
sefara.
load
(filename, format=None, filters=None, transforms=None, environment_transforms=None)[source]¶ Load a
ResourceCollection
from a file or URL.Collections can be defined using either Python or JSON.
Parameters: filename : string
Path or URL to resource collection. If a path is given, it is equivalent to a “file://<path>” URL. Supports any protocol handled by
urlopen
, such as HTTP and HTTPS.Can be the string ‘-‘ to read from stdin.
May include a “fragment”, the part of a URL following a “#” symbol, e.g. “file1.py#filter=tags.foo”. The fragment is a query string of key/value pairs separated by “&” symbols, e.g. “file1.txt#filter=tags.foo&format=json”.
Valid fragment keys are:
- filter
The value is a Python expression giving a sefara filter.
- transform
The value is a path to a Python file with a sefara transform.
- format
The value gives the format of the data, either “python” or “json”.
- environment_transforms
The value should be “true” or “false” indicating whether to run transforms configured with environment variables.
The fragment operations are processed in order, left to right, and can be specified multiple times. That is, the URL:
file.py#filter=tags.bar&filter=tags.baz
is equivalent to:
file.py#filter=tags.bar and tags.baz
Fragment values can have spaces, and should not be quoted even if they do (as in the above example).
format : “python” or “json” [optional]
Format of data. Overrides any setting specified in the filename URL. If it is not specified in either place, it is guessed from the filename extension.
filters : list of strings or callbles [optional]
Filters to run on the ResourceCollection, in addition to any specified in the filename. Anything you can pass to
ResourceCollection.filter
is accepted.transforms : list of strings [optional]
Transforms to run on the ResourceCollection, in addition to any specified in the filename or from the environment. Anything you you can pass to
hooks.transform
is accepted.environment_transforms : Boolean [optional]
Whether to run environment_transforms. If specified, this will override the “environment_transforms” fragment setting specified in the filename URL. If not specified in either place, the default is True.
Returns: ResourceCollection
instance.
-
sefara.
loads
(data, filename=None, format=None, environment_transforms=True)[source]¶ Load a ResourceCollection from a string.
Parameters: data : string
ResourceCollection specification in either Python or JSON.
filename : string [optional]
filename where this data originally came from to use in error messages
format : string, either “python” or “json” [default: guess from data]
format of the data
environment_transforms : Boolean [default: True]
whether to run transforms configured in environment variables.
Returns: ResourceCollection instance.
-
class
sefara.
ResourceCollection
(resources, filename='<no file>')[source]¶ Bases:
object
Collection of zero or more resources.
- Resources in a ResourceCollection can be accessed either by name:
- resource_collection[“my_dataset”]
- or by index (or slice):
- resource_collection[0]
-
attributes
¶ The attribute names used by resouces in this collection.
-
filter
(expression)[source]¶ Return a new collection containing only those resources for which
expression
evaluated to True.Parameters: expression : string or callable
If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a
name
attribute, the expression “name.startswith(‘bar’)” is a valid expression. Tags can be accessed through thetags
variable. If the resource has a tag calledfoo
, then the expression “tags.foo” will evaluate to True. If there is no such tag, then “tags.foo” will evaluate to False.If a callable, then it will be called and passed this
Resource
instance as its argument.Returns: A new ResourceCollection containing those resources for which
expression
evaluated to True.
-
select
(*expressions, **kwargs)[source]¶ Select fields (or expressions) from each resource as a pandas DataFrame.
Parameters: *expressions : string, callable, or (string, string or callable) pair
One or more expressions giving the fields to select.
Each expression can be either a
string
expression, acallable
, or a(string, string or callable)
pair giving a label and an expression.Labels give the column names in the result. Labels can be specified either by giving a
(label, expression)
pair, or giving a string of the form “LABEL: EXPRESSION”, such as “upper_name: name.upper()”. Here “upper_name” is the label, and “name.upper()” is the expression that will be evaluated. If not specified, labels default to the text of theexpression
ifexpression
is a string, and an automatically generated label ifexpression
is a callable.Each
expression
will be passed toResource.evaluate
for each resource in the collection. See that method’s docs for details on expression evaluation.if_error : string, one of “raise”, “skip”, or “none” [default: “raise”]
Must be specified as a keyword argument. Controls the behavior when evaluation of an expression raises an uncaught exception. One of:
- raise
Raise the exception to the caller. This is the default.
- skip
Skip resources where evaluation of any of the expressions raises an error. These resources will be omitted from the result.
- none
If evaluating an expression on a resource raises an exception, set that entry in the result to
None
.
Returns: A
pandas.DataFrame
. Rows correspond to resources. Columns correspondto the specified expressions.
-
singleton
(raise_on_multiple=True)[source]¶ If this ResourceCollection contains exactly 1 resource, return it. Otherwise, raise a ValueError.
-
summary
¶ A string summarzing the resources in this collection, including their attributes.
The tags associated with any resources in this collection.
-
to_plain_types
()[source]¶ Return a representation of this collection using Python dicts, lists, and strings.
-
write
(file=None, format=None, indent=None)[source]¶ Serialize this collection to disk.
Parameters: file : string or file handle [optional, default: sys.stdout]
Path or file handle to write to.
format : string, one of “python” or “json” [optional]
Output format. If not specified, it is guessed from the filename extension.
indent : int [optional]
Number of spaces to use for indentation.
-
class
sefara.
Resource
(name=None, **fields)[source]¶ Bases:
attrdict.mapping.AttrMap
A Resource gives information on how to access some dataset under analysis (such as a file), along with any optional metadata meaningful to the user.
The only required attribute for a resource is
name
. Ifname
is not specified when the Resource is defined, one is automatically generated.The
tags
attribute, if it is specified, is handled slightly specially. It is stored using theTags
class, which is a Python set that also supports a convenient method of membership testing. Iftags
is aTags
instance, thentags.foo
returns whether the string “foo” is in the set.-
RAISE
= <object object>¶
-
evaluate
(expression, error_value=<object object>, extra_bindings={})[source]¶ Evaluate a Python expression or callable in the context of this resource.
Parameters: expression : string or callable
If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a
name
attribute, the expression “name.lower()” would return the name in lower case. Tags can be accessed through thetags
variable. If the resource has a tag calledfoo
, then the expression “tags.foo” will evaluate toTrue
. If there is no such tag, then “tags.foo” will evaluate toFalse
.A few common modules are included in the evaluation namespace, including
os
,sys
,collections
,re
, andjson
. The resource object itself is also available in theresource
variable.As a hack to support a primitive form of exception handling, a function called
on_error
is also included in the evaluation namespace. This function takes a single argument,value
, of any type and returns None. Ifon_error
is called while evaluating the expression, and the expression subsequently raises an exception, then the exception is caught andvalue
is returned as the value of the expression. This means you can write expressions like:on_error(False) or foo.startswith("bar")
and if the right side of the expression raises an error (for example, if there is no such attribute
foo
in the resource), then the valueFalse
will be used as the expression’s value. Note that you must write the expression as it is here: put theon_error
clause first, and connect it with the main expression withor
(this ensures that it gets called before the rest of the expression).If
expression
is a callable, then it will be called and passed this Resource instance as its argument.error_value : object [optional]
If evaluating the expression results in an uncaught exception, the
error_value
value will be returned instead. If not specified, thenevaluate
will raise the exception to the caller.extra_bindings : dict [optional]
Additional local variables to include in the evaluation context.
Returns: The Python object returned by evaluating the expression.
-
-
sefara.
export
(*args, **kwargs)[source]¶ Create and export a Resource with the specified attributes.
All arguments are passed to
Resource
.
-
sefara.
export_resources
(resources)[source]¶ Export one or more Resource instances.
Parameters: resources : list of
Resource
instancesResource instances to be exported.
-
sefara.
transform_exports
(path_or_callable)[source]¶ Transform the resources exported by this collection.
Parameters: path_or_callable : string or callable
Passed to
hooks.transform
; see those docs for details.