sefara package¶
Submodules¶
sefara.environment module¶
Environment variables used by sefara.
sefara.exporting module¶
Functions to define a resource collection.
These functions should be called only from sefara resource collections, not normal scripts.
-
sefara.exporting.export(*args, **kwargs)[source]¶ Create and export a Resource with the specified attributes.
All arguments are passed to
Resource.
-
sefara.exporting.export_resources(resources)[source]¶ Export one or more Resource instances.
Parameters: resources : list of
ResourceinstancesResource instances to be exported.
-
sefara.exporting.transform_exports(path_or_callable)[source]¶ Transform the resources exported by this collection.
Parameters: path_or_callable : string or callable
Passed to
hooks.transform; see those docs for details.
sefara.hooks module¶
Hooks are a mechanism for running site-specific transforms or validation routines on Sefara resources.
-
sefara.hooks.check(collection, checkers=None, include_environment_checkers=True)[source]¶ Run “checkers”, either specified as an argument or using environment variables, on the resource collection.
Checkers are used to validate that the resources in a collection meet user-defined criteria, for example, that the paths they point to exist.
Parameters: checkers : list of either callables, strings, or tuples [optional]
If tuples, the elements are
(path, name, args, kwargs).Like transforms, checkers are called with this ResourceCollection instance as an argument. Unlike transforms, checkers should NOT mutate the resources. They are expected to return a list or generator of three element tuples,
(resource, attempted, problem)whereresourceis a Resource in this collection,attemptedis True if validation was attempted on this resource, andproblemis a string error message if validation was unsuccesful (None otherwise).By returning
attempted=Falsein the tuple above, checkers indicate that a resource did not conform to the schema the checker knows how to validate. For example, a checker might be verifying that files pointed to by resources exist, but some resource may not specify any file. In this case, the checker should setattempted=False. Another checker in use may know how to validate that resource. Thesefara-checktool reports any resources that were not attempted by any checker as an error.Checkers should generate a tuple for every resource in the collection in the order they appear in the collection.
include_environment_checkers : boolean [optional, default: True]
If True, then checkers configured in environment variables are run, in addition to any checkers specified in the first argument.
See the
environmentmodule for the definition of the environment variable used here.Returns: Generator giving (resource, tuples) pairs, where resource is a Resource
in this collection, and tuples is a list of
(checker, attempted, error)giving whether each checkerattempted validation of that resource, and, if so, the result.
erroris None if validation was successful, otherwise a stringgiving the error.
-
sefara.hooks.run_hook(collection, path_or_callable, name, *args, **kwargs)[source]¶ Invoke a Python callable (either passed directly or defined in the specified Python file) on the ResourceCollection and return the result.
Parameters: path_or_callable : string or callable
If string, then this is interpreted as a path to a Python file. The file will be exec’d and is expected to define a module attribute given by the
nameargument. This attribute will be used as the callable. It should take a ResourceCollection instance as an argument.Otherwise, this parameter should be a callable that takes a ResourceCollection instance. It will be invoked on this ResourceCollection.
name : string [optional, default ‘transform’]
If
path_or_callableis a string giving a path to a Python file to execute, this parameter gives the attribute in that module to use as the callable.*args, **kwargs
Additional args and kwargs are passed to the callable after the ResourceCollection.
-
sefara.hooks.transform(collection, path_or_callable, name='transform', *args, **kwargs)[source]¶ Run a function on the resources in the collection. The function is expected to mutate the resources in the collection; a new collection is NOT returned.
Parameters: path_or_callable : string or callable
If string, then this is interpreted as a path to a Python file. The file will be exec’d and is expected to define a module attribute given by the
nameargument. This attribute will be used as the callable. It should take a ResourceCollection instance as an argument.Otherwise, this parameter should be a callable that takes a ResourceCollection instance. It will be invoked on this ResourceCollection.
name : string [optional, default ‘transform’]
If
path_or_callableis a string giving a path to a Python file to execute, this parameter gives the attribute in that module to use as the callable. Defaults to “transform”, i.e. the Python file specified bypath_or_callableis expected to define a function called “transform”.*args, **kwargs
Additional args and kwargs are passed to the transform function.
-
sefara.hooks.transform_from_environment(collection)[source]¶ Run the environment-variable-defined transforms on the resources in the collection. The transforms will mutate the resources in this collection; a new ResourceCollection is NOT returned.
See the
environmentmodule for the definition of the environment variable used here.
sefara.loading module¶
Instantiate ResourceCollection instances from files or strings.
-
sefara.loading.load(filename, format=None, filters=None, transforms=None, environment_transforms=None)[source]¶ Load a
ResourceCollectionfrom a file or URL.Collections can be defined using either Python or JSON.
Parameters: filename : string
Path or URL to resource collection. If a path is given, it is equivalent to a “file://<path>” URL. Supports any protocol handled by
urlopen, such as HTTP and HTTPS.Can be the string ‘-‘ to read from stdin.
May include a “fragment”, the part of a URL following a “#” symbol, e.g. “file1.py#filter=tags.foo”. The fragment is a query string of key/value pairs separated by “&” symbols, e.g. “file1.txt#filter=tags.foo&format=json”.
Valid fragment keys are:
- filter
The value is a Python expression giving a sefara filter.
- transform
The value is a path to a Python file with a sefara transform.
- format
The value gives the format of the data, either “python” or “json”.
- environment_transforms
The value should be “true” or “false” indicating whether to run transforms configured with environment variables.
The fragment operations are processed in order, left to right, and can be specified multiple times. That is, the URL:
file.py#filter=tags.bar&filter=tags.bazis equivalent to:
file.py#filter=tags.bar and tags.bazFragment values can have spaces, and should not be quoted even if they do (as in the above example).
format : “python” or “json” [optional]
Format of data. Overrides any setting specified in the filename URL. If it is not specified in either place, it is guessed from the filename extension.
filters : list of strings or callbles [optional]
Filters to run on the ResourceCollection, in addition to any specified in the filename. Anything you can pass to
ResourceCollection.filteris accepted.transforms : list of strings [optional]
Transforms to run on the ResourceCollection, in addition to any specified in the filename or from the environment. Anything you you can pass to
hooks.transformis accepted.environment_transforms : Boolean [optional]
Whether to run environment_transforms. If specified, this will override the “environment_transforms” fragment setting specified in the filename URL. If not specified in either place, the default is True.
Returns: ResourceCollectioninstance.
-
sefara.loading.loads(data, filename=None, format=None, environment_transforms=True)[source]¶ Load a ResourceCollection from a string.
Parameters: data : string
ResourceCollection specification in either Python or JSON.
filename : string [optional]
filename where this data originally came from to use in error messages
format : string, either “python” or “json” [default: guess from data]
format of the data
environment_transforms : Boolean [default: True]
whether to run transforms configured in environment variables.
Returns: ResourceCollection instance.
sefara.resource module¶
-
class
sefara.resource.Resource(name=None, **fields)[source]¶ Bases:
attrdict.mapping.AttrMapA Resource gives information on how to access some dataset under analysis (such as a file), along with any optional metadata meaningful to the user.
The only required attribute for a resource is
name. Ifnameis not specified when the Resource is defined, one is automatically generated.The
tagsattribute, if it is specified, is handled slightly specially. It is stored using theTagsclass, which is a Python set that also supports a convenient method of membership testing. Iftagsis aTagsinstance, thentags.fooreturns whether the string “foo” is in the set.-
RAISE= <object object>¶
-
evaluate(expression, error_value=<object object>, extra_bindings={})[source]¶ Evaluate a Python expression or callable in the context of this resource.
Parameters: expression : string or callable
If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a
nameattribute, the expression “name.lower()” would return the name in lower case. Tags can be accessed through thetagsvariable. If the resource has a tag calledfoo, then the expression “tags.foo” will evaluate toTrue. If there is no such tag, then “tags.foo” will evaluate toFalse.A few common modules are included in the evaluation namespace, including
os,sys,collections,re, andjson. The resource object itself is also available in theresourcevariable.As a hack to support a primitive form of exception handling, a function called
on_erroris also included in the evaluation namespace. This function takes a single argument,value, of any type and returns None. Ifon_erroris called while evaluating the expression, and the expression subsequently raises an exception, then the exception is caught andvalueis returned as the value of the expression. This means you can write expressions like:on_error(False) or foo.startswith("bar")and if the right side of the expression raises an error (for example, if there is no such attribute
fooin the resource), then the valueFalsewill be used as the expression’s value. Note that you must write the expression as it is here: put theon_errorclause first, and connect it with the main expression withor(this ensures that it gets called before the rest of the expression).If
expressionis a callable, then it will be called and passed this Resource instance as its argument.error_value : object [optional]
If evaluating the expression results in an uncaught exception, the
error_valuevalue will be returned instead. If not specified, thenevaluatewill raise the exception to the caller.extra_bindings : dict [optional]
Additional local variables to include in the evaluation context.
Returns: The Python object returned by evaluating the expression.
-
-
class
sefara.resource.Tags(tags)[source]¶ Bases:
setA set of strings used to group resources.
This class inherits from Python’s
setclass and supports all the functionality of that class. Additionally, it supports attribute access as a way to test membership:tags.foowill returnTrueif the string “foo” is in the set, andFalseotherwise.
sefara.resource_collection module¶
-
class
sefara.resource_collection.ResourceCollection(resources, filename='<no file>')[source]¶ Bases:
objectCollection of zero or more resources.
- Resources in a ResourceCollection can be accessed either by name:
- resource_collection[“my_dataset”]
- or by index (or slice):
- resource_collection[0]
-
attributes¶ The attribute names used by resouces in this collection.
-
filter(expression)[source]¶ Return a new collection containing only those resources for which
expressionevaluated to True.Parameters: expression : string or callable
If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a
nameattribute, the expression “name.startswith(‘bar’)” is a valid expression. Tags can be accessed through thetagsvariable. If the resource has a tag calledfoo, then the expression “tags.foo” will evaluate to True. If there is no such tag, then “tags.foo” will evaluate to False.If a callable, then it will be called and passed this
Resourceinstance as its argument.Returns: A new ResourceCollection containing those resources for which
expressionevaluated to True.
-
select(*expressions, **kwargs)[source]¶ Select fields (or expressions) from each resource as a pandas DataFrame.
Parameters: *expressions : string, callable, or (string, string or callable) pair
One or more expressions giving the fields to select.
Each expression can be either a
stringexpression, acallable, or a(string, string or callable)pair giving a label and an expression.Labels give the column names in the result. Labels can be specified either by giving a
(label, expression)pair, or giving a string of the form “LABEL: EXPRESSION”, such as “upper_name: name.upper()”. Here “upper_name” is the label, and “name.upper()” is the expression that will be evaluated. If not specified, labels default to the text of theexpressionifexpressionis a string, and an automatically generated label ifexpressionis a callable.Each
expressionwill be passed toResource.evaluatefor each resource in the collection. See that method’s docs for details on expression evaluation.if_error : string, one of “raise”, “skip”, or “none” [default: “raise”]
Must be specified as a keyword argument. Controls the behavior when evaluation of an expression raises an uncaught exception. One of:
- raise
Raise the exception to the caller. This is the default.
- skip
Skip resources where evaluation of any of the expressions raises an error. These resources will be omitted from the result.
- none
If evaluating an expression on a resource raises an exception, set that entry in the result to
None.
Returns: A
pandas.DataFrame. Rows correspond to resources. Columns correspondto the specified expressions.
-
singleton(raise_on_multiple=True)[source]¶ If this ResourceCollection contains exactly 1 resource, return it. Otherwise, raise a ValueError.
-
summary¶ A string summarzing the resources in this collection, including their attributes.
The tags associated with any resources in this collection.
-
to_plain_types()[source]¶ Return a representation of this collection using Python dicts, lists, and strings.
-
write(file=None, format=None, indent=None)[source]¶ Serialize this collection to disk.
Parameters: file : string or file handle [optional, default: sys.stdout]
Path or file handle to write to.
format : string, one of “python” or “json” [optional]
Output format. If not specified, it is guessed from the filename extension.
indent : int [optional]
Number of spaces to use for indentation.
sefara.util module¶
A few utility functions and imports.
-
sefara.util.exec_in_directory(filename=None, code=None)[source]¶ Execute Python code from either a file or passed as an argument. If a file is specified, the code will be executed with the current working directory set to the directory where the file resides.
If both
filenameandcodeare specified, thencodeis executed, butfilenameis used to set the current working directory, and in error messages.Parameters: filename : string [optional]
Path to file with Python code to execute.
code : string [optional]
Python code to execute
Returns: dict giving module-level attributes defined by the executed code
Module contents¶
-
sefara.load(filename, format=None, filters=None, transforms=None, environment_transforms=None)[source]¶ Load a
ResourceCollectionfrom a file or URL.Collections can be defined using either Python or JSON.
Parameters: filename : string
Path or URL to resource collection. If a path is given, it is equivalent to a “file://<path>” URL. Supports any protocol handled by
urlopen, such as HTTP and HTTPS.Can be the string ‘-‘ to read from stdin.
May include a “fragment”, the part of a URL following a “#” symbol, e.g. “file1.py#filter=tags.foo”. The fragment is a query string of key/value pairs separated by “&” symbols, e.g. “file1.txt#filter=tags.foo&format=json”.
Valid fragment keys are:
- filter
The value is a Python expression giving a sefara filter.
- transform
The value is a path to a Python file with a sefara transform.
- format
The value gives the format of the data, either “python” or “json”.
- environment_transforms
The value should be “true” or “false” indicating whether to run transforms configured with environment variables.
The fragment operations are processed in order, left to right, and can be specified multiple times. That is, the URL:
file.py#filter=tags.bar&filter=tags.bazis equivalent to:
file.py#filter=tags.bar and tags.bazFragment values can have spaces, and should not be quoted even if they do (as in the above example).
format : “python” or “json” [optional]
Format of data. Overrides any setting specified in the filename URL. If it is not specified in either place, it is guessed from the filename extension.
filters : list of strings or callbles [optional]
Filters to run on the ResourceCollection, in addition to any specified in the filename. Anything you can pass to
ResourceCollection.filteris accepted.transforms : list of strings [optional]
Transforms to run on the ResourceCollection, in addition to any specified in the filename or from the environment. Anything you you can pass to
hooks.transformis accepted.environment_transforms : Boolean [optional]
Whether to run environment_transforms. If specified, this will override the “environment_transforms” fragment setting specified in the filename URL. If not specified in either place, the default is True.
Returns: ResourceCollectioninstance.
-
sefara.loads(data, filename=None, format=None, environment_transforms=True)[source]¶ Load a ResourceCollection from a string.
Parameters: data : string
ResourceCollection specification in either Python or JSON.
filename : string [optional]
filename where this data originally came from to use in error messages
format : string, either “python” or “json” [default: guess from data]
format of the data
environment_transforms : Boolean [default: True]
whether to run transforms configured in environment variables.
Returns: ResourceCollection instance.
-
class
sefara.ResourceCollection(resources, filename='<no file>')[source]¶ Bases:
objectCollection of zero or more resources.
- Resources in a ResourceCollection can be accessed either by name:
- resource_collection[“my_dataset”]
- or by index (or slice):
- resource_collection[0]
-
attributes¶ The attribute names used by resouces in this collection.
-
filter(expression)[source]¶ Return a new collection containing only those resources for which
expressionevaluated to True.Parameters: expression : string or callable
If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a
nameattribute, the expression “name.startswith(‘bar’)” is a valid expression. Tags can be accessed through thetagsvariable. If the resource has a tag calledfoo, then the expression “tags.foo” will evaluate to True. If there is no such tag, then “tags.foo” will evaluate to False.If a callable, then it will be called and passed this
Resourceinstance as its argument.Returns: A new ResourceCollection containing those resources for which
expressionevaluated to True.
-
select(*expressions, **kwargs)[source]¶ Select fields (or expressions) from each resource as a pandas DataFrame.
Parameters: *expressions : string, callable, or (string, string or callable) pair
One or more expressions giving the fields to select.
Each expression can be either a
stringexpression, acallable, or a(string, string or callable)pair giving a label and an expression.Labels give the column names in the result. Labels can be specified either by giving a
(label, expression)pair, or giving a string of the form “LABEL: EXPRESSION”, such as “upper_name: name.upper()”. Here “upper_name” is the label, and “name.upper()” is the expression that will be evaluated. If not specified, labels default to the text of theexpressionifexpressionis a string, and an automatically generated label ifexpressionis a callable.Each
expressionwill be passed toResource.evaluatefor each resource in the collection. See that method’s docs for details on expression evaluation.if_error : string, one of “raise”, “skip”, or “none” [default: “raise”]
Must be specified as a keyword argument. Controls the behavior when evaluation of an expression raises an uncaught exception. One of:
- raise
Raise the exception to the caller. This is the default.
- skip
Skip resources where evaluation of any of the expressions raises an error. These resources will be omitted from the result.
- none
If evaluating an expression on a resource raises an exception, set that entry in the result to
None.
Returns: A
pandas.DataFrame. Rows correspond to resources. Columns correspondto the specified expressions.
-
singleton(raise_on_multiple=True)[source]¶ If this ResourceCollection contains exactly 1 resource, return it. Otherwise, raise a ValueError.
-
summary¶ A string summarzing the resources in this collection, including their attributes.
The tags associated with any resources in this collection.
-
to_plain_types()[source]¶ Return a representation of this collection using Python dicts, lists, and strings.
-
write(file=None, format=None, indent=None)[source]¶ Serialize this collection to disk.
Parameters: file : string or file handle [optional, default: sys.stdout]
Path or file handle to write to.
format : string, one of “python” or “json” [optional]
Output format. If not specified, it is guessed from the filename extension.
indent : int [optional]
Number of spaces to use for indentation.
-
class
sefara.Resource(name=None, **fields)[source]¶ Bases:
attrdict.mapping.AttrMapA Resource gives information on how to access some dataset under analysis (such as a file), along with any optional metadata meaningful to the user.
The only required attribute for a resource is
name. Ifnameis not specified when the Resource is defined, one is automatically generated.The
tagsattribute, if it is specified, is handled slightly specially. It is stored using theTagsclass, which is a Python set that also supports a convenient method of membership testing. Iftagsis aTagsinstance, thentags.fooreturns whether the string “foo” is in the set.-
RAISE= <object object>¶
-
evaluate(expression, error_value=<object object>, extra_bindings={})[source]¶ Evaluate a Python expression or callable in the context of this resource.
Parameters: expression : string or callable
If a string, then it should give a valid Python expression. This expression will be evaluated with the attributes of this resource in the local namespace. For example, since the resource has a
nameattribute, the expression “name.lower()” would return the name in lower case. Tags can be accessed through thetagsvariable. If the resource has a tag calledfoo, then the expression “tags.foo” will evaluate toTrue. If there is no such tag, then “tags.foo” will evaluate toFalse.A few common modules are included in the evaluation namespace, including
os,sys,collections,re, andjson. The resource object itself is also available in theresourcevariable.As a hack to support a primitive form of exception handling, a function called
on_erroris also included in the evaluation namespace. This function takes a single argument,value, of any type and returns None. Ifon_erroris called while evaluating the expression, and the expression subsequently raises an exception, then the exception is caught andvalueis returned as the value of the expression. This means you can write expressions like:on_error(False) or foo.startswith("bar")and if the right side of the expression raises an error (for example, if there is no such attribute
fooin the resource), then the valueFalsewill be used as the expression’s value. Note that you must write the expression as it is here: put theon_errorclause first, and connect it with the main expression withor(this ensures that it gets called before the rest of the expression).If
expressionis a callable, then it will be called and passed this Resource instance as its argument.error_value : object [optional]
If evaluating the expression results in an uncaught exception, the
error_valuevalue will be returned instead. If not specified, thenevaluatewill raise the exception to the caller.extra_bindings : dict [optional]
Additional local variables to include in the evaluation context.
Returns: The Python object returned by evaluating the expression.
-
-
sefara.export(*args, **kwargs)[source]¶ Create and export a Resource with the specified attributes.
All arguments are passed to
Resource.
-
sefara.export_resources(resources)[source]¶ Export one or more Resource instances.
Parameters: resources : list of
ResourceinstancesResource instances to be exported.
-
sefara.transform_exports(path_or_callable)[source]¶ Transform the resources exported by this collection.
Parameters: path_or_callable : string or callable
Passed to
hooks.transform; see those docs for details.