PyRDF’s supported backends¶

The parent backend class¶

class PyRDF.backend.Backend.Backend(config={})[source]¶

Base class for RDataFrame backends. Subclasses of this class need to implement the ‘execute’ method.

supported_operations¶

List of operations supported by the backend.

Type: list

initialization¶

Store user’s initialization method, if defined.

Type: function

__init__(config={})[source]¶

Creates a new instance of the desired implementation of Backend.

Parameters: config (dict) – The config object for the required backend. The default value is an empty Python dictionary: {}.

check_supported(operation_name)[source]¶

Checks if a given operation is supported by the given backend.

Parameters

operation_name (str) – Name of the operation to be checked.

Raises

Exception – This happens when operation_name doesn’t exist
the supported_operations instance attribute. –

abstract execute(generator)[source]¶: Subclasses must define how to run the RDataFrame graph on a given environment.

classmethod register_initialization(fun, *args, **kwargs)[source]¶

Convert the initialization function and its arguments into a callable without arguments. This callable is saved on the backend parent class. Therefore, changes on the runtime backend do not require users to set the initialization function again.

Parameters

fun (function) – Function to be executed.
*args (list) – Variable length argument list used to execute the function.
**kwargs (dict) – Keyword arguments used to execute the function.

The local backend¶

class PyRDF.backend.Local.Local(config={})[source]¶

Backend that relies on the C++ implementation of RDataFrame to locally execute the current graph.

config¶

The config object for the Local backend.

Type: dict

__init__(config={})[source]¶

Creates a new instance of the Local implementation of Backend.

Parameters: config (dict, optional) – The config object for the required backend. The default value is an empty Python dictionary: {}.

execute(generator)[source]¶

Executes locally the current RDataFrame graph.

Parameters: generator (PyRDF.CallableGenerator) – An instance of CallableGenerator that is responsible for generating the callable function.

The distributed backend parent class¶

class PyRDF.backend.Dist.Dist(config={})[source]¶

Base class for implementing all distributed backends.

npartitions¶

The number of chunks to divide the dataset in, each chunk is then processed in parallel.

Type: int

supported_operations¶

list of supported RDataFrame operations in a distributed environment.

Type: list

friend_info¶

A class instance that holds information about any friend trees of the main ROOT.TTree

Type: PyRDF.Dist.FriendInfo

abstract ProcessAndMerge(mapper, reducer)[source]¶: Subclasses must define how to run map-reduce functions on a given backend.

__init__(config={})[source]¶

Creates an instance of Dist.

Parameters: config (dict, optional) – The config options for the current distributed backend. Default value is an empty python dictionary: {}.

build_ranges()[source]¶: Define two type of ranges based on the arguments passed to the RDataFrame head node.

abstract distribute_files(includes_list)[source]¶: Subclasses must define how to send all files needed for the analysis (like headers and libraries) to the workers.

execute(generator)[source]¶

Executes the current RDataFrame graph in the given distributed environment.

Parameters: generator (PyRDF.CallableGenerator) – An instance of CallableGenerator that is responsible for generating the callable function.

get_clusters(treename, filelist)[source]¶

Extract a list of cluster boundaries for the given tree and files

Parameters

treename (str) – Name of the TTree split into one or more files.
filelist (list) – List of one or more ROOT files.

Returns

List of tuples defining the cluster boundaries. Each tuple contains four elements: first entry of a cluster, last entry of cluster, offset of the cluster and file where the cluster belongs to.

Return type

list

class PyRDF.backend.Dist.FriendInfo(friend_names=[], friend_file_names=[])[source]¶

A simple class to hold information about friend trees.

friend_names¶

A list with the names of the ROOT.TTree objects which are friends of the main ROOT.TTree.

Type: list

friend_file_names¶

A list with the paths to the files corresponding to the trees in the friend_names attribute. Each element of friend_names can correspond to multiple file names.

Type: list

__bool__()[source]¶

Define the behaviour of FriendInfo instance when boolean evaluated. Both lists have to be non-empty in order to return True.

Returns: True if both lists are non-empty, False otherwise.
Return type: bool

__init__(friend_names=[], friend_file_names=[])[source]¶

Create an instance of FriendInfo

Parameters

friend_names (list) – A list containing the treenames of the friend trees.
friend_file_names (list) – A list containing the file names corresponding to a given treename in friend_names. Each treename can correspond to multiple file names.

__nonzero__()[source]¶: Python 2 dunder method for __bool__. Kept for compatibility.

class PyRDF.backend.Dist.Range(start, end, filelist=None, friend_info=None)[source]¶

Base class to represent ranges.

A range represents a logical partition of the entries of a chain and is the basis for parallelization. First entry of the range (start) is inclusive while the second one is not (end).

__init__(start, end, filelist=None, friend_info=None)[source]¶

Create an instance of a Range

Parameters

start (int) – First entry of the range.
end (int) – Last entry of the range, which is exclusive.
filelist (list, optional) – Files where the range of entries belongs to.

__repr__()[source]¶: Return a string representation of the range composition.

The Spark distributed backend¶

class PyRDF.backend.Spark.Spark(config={})[source]¶

Backend that executes the computational graph using using Spark framework for distributed execution.

ProcessAndMerge(mapper, reducer)[source]¶

Performs map-reduce using Spark framework.

Parameters

mapper (function) – A function that runs the computational graph and returns a list of values.
reducer (function) – A function that merges two lists that were returned by the mapper.

Returns

A list representing the values of action nodes returned after computation (Map-Reduce).

Return type

list

__init__(config={})[source]¶

Creates an instance of the Spark backend class.

Parameters: config (dict, optional) – The config options for Spark backend. The default value is an empty Python dictionary {}. config should be a dictionary of Spark configuration options and their values with :obj:’npartitions’ as the only allowed extra parameter.

Example:

config = {
    'npartitions':20,
    'spark.master':'myMasterURL',
    'spark.executor.instances':10,
    'spark.app.name':'mySparkAppName'
}

Note

If a SparkContext is already set in the current environment, the Spark configuration parameters from :obj:’config’ will be ignored and the already existing SparkContext would be used.

distribute_files(includes_list)[source]¶

Spark supports sending files to the executors via the SparkContext.addFile method. This method receives in input the path to the file (relative to the path of the current python session). The file is initially added to the Spark driver and then sent to the workers when they are initialized.

Parameters: includes_list (list) – A list consisting of all necessary C++ files as strings, created one of the include functions of the PyRDF API.

PyRDF’s utility functions¶

class PyRDF.backend.Utils.Utils[source]¶

Class that houses general utility functions.

classmethod declare_headers(headers_to_include)[source]¶

Declares all required headers using the ROOT’s C++ Interpreter.

Parameters: headers_to_include (list) – This list should consist of all necessary C++ headers as strings.

classmethod declare_shared_libraries(libraries_to_include)[source]¶

Declares all required shared libraries using the ROOT’s C++ Interpreter.

Parameters: libraries_to_include (list) – This list should consist of all necessary C++ shared libraries as strings.

classmethod extend_include_path(include_path)[source]¶

Extends the list of paths in which ROOT looks for headers and libraries. Every header directory is added to the internal include path of ROOT so the interpreter can find them. Even if the same path is added twice, ROOT keeps a collection of unique paths. Find more at `TInterpreter<https://root.cern.ch/doc/master/classTInterpreter.html>`_

Parameters: include_path (str) – the path to the directory containing files needed for the analysis.