Configuring pliers

Pliers contains a number of package-wide options that can be configured via the pliers.config module. These include:

Setting Type Default Description
cache_transformers bool True Whether or not to cache Transformer outputs in memory
default_converters dict see module See explanation in the Converters section
drop_bad_extractor_results bool True When True, automatically removes any None values returned by any Extractor
log_transformations bool True Whether or not to log transformation details in each Stim’s .history attribute
n_jobs int CPU-1. Number of simultaneous jobs to execute (if parallelize is True)
parallelize bool True Whether or not to use naive parallelization by default
progress_bar bool True Whether or not to display progress bars when looping over Stims
use_generators bool False Whether Transformers should return generators rather than lists when iterating over Stims

Setting options

Package-wide options can be changed either at initialization or at run-time.

At initialization

By default, when pliers is first imported, it will look in three places for configuration files that override the package defaults. In order of precedence, these are:

  1. A pliers_config.json file in the current (working) directory.
  2. A filename set in the PLIERS_CONFIG environment variable.
  3. A pliers_config.json file located in the user’s home directory.

In all cases, the file must be a standard .json file containing only valid option names as keys. Default package values will continue to be used for any options not explicitly specified in the file. For example:

{
        "parallelize": True,
        "n_jobs": 4
}

If the above is placed in a pliers_config.json file in one’s home directory, pliers will execute all iterable transformations in parallel (with 4 jobs).

At run-time

Package options can also be changed dynamically, via the .get_option() and .set_option() (or, for multiple options, .set_options()) accessors:

>>> import pliers as pl
>>> pl.get_option('use_generators')
'False'
>>> pl.set_option('use_generators', True)
# Or...
>>> pl.set_options(use_generators=True, progress_bar=False)

Option details

cache_transformers (bool)

When set to True, the output produced by all .transform() call will be cached in memory (filesystem caching is not currently available). This is the default, and can be very useful in cases where (a) many calls to commercial feature extraction services (e.g., the Google or IBM families of Extractors) are being made, or (b) there are intermediate Stim representations generated by Converter classes that are computationally expensive to produce. Setting cache_transformers to False will result in every transform() call being recomputed, with no intermediates stored in memory.

Note that caching in pliers (really, memoization) is based on the combination of the Transformer class, its initialization parameters, and the id of the input Stim. If any of these changes, results will be computed anew. So, for example, creating two separate instances of the ClarifaiAPIImageExtractor, each with different model arguments, will result in two separate calls being made to the Clarifai API even if the exact same Stim inputs are passed. (However, different instances of the same ClarifaiAPIImageExtractor initialized using the same arguments will still point to the same entry in the cache.)

default_converters (dict)

This option specifies what Converter classes to use for implicit conversion between Stim types (i.e., in cases where the code does not explicitly specify every conversion step). The format for this setting is a bit more involved; for details, see Package-wide conversion defaults.

drop_bad_extractor_results (bool)

In certain conditions, .transform() calls may return None values. Typically this happens either because of an unexpected internal failure (e.g., a timeout occurs in an API-based Extractor), or because None is the expected behavior for a Transformer given certain inputs. Either way, such values can wreak havoc on downstream transformations, because None is not a valid input to any .transform() call in pliers.

To avoid having entire workflows failing unpredictably as soon as any single Transformer/Stim combination returns a None value, pliers will, by default, drop bad values as it encounters them. While this is usually desired, in cases where failures (or other causes of an invalid value return) are important to identify, we can disable this sanitization process by setting the drop_bad_extractor_results option to False. Note that this will typically result in an Exception being raised the first time a bad value is encountered.

log_transformations (bool)

By default, pliers logs every transformation applied to a Stim object in the Stim’s .history property. While this is usually desirable, in contexts where hundreds of thousands or even millions of Stim objects are being processed, the aggregate memory footprint of all of these logs may be non-trivial. We can disable transformation logging at any time by setting the log_transformations setting to False.

parallelization (bool), n_jobs (int)

By default, pliers executes all transformations serially–even in cases where an iterable of Stims is passed in (so that transformation is, in principle, embarrassingly parallel). However, pliers also supports rudimentary parallelization of transformations via the pathos package. If the parallelization option is set to True, any Transformer passed an iterable of Stims as input will apply its transformations to the elements of the list in parallel.

The n_jobs option specifies how many workers to launch. The default value of None will be interpreted as num(CPU cores) - 1. Note that n_jobs will be ignored unless parallelization is enabled.

progress_bar (bool)

By default, pliers shows a progress bar (using tqdm) when transforming iterable inputs (e.g., lists of Stims). To disable this behavior, set progress_bar to False.

use_generators (bool)

Internally, pliers uses generators whenever iteration over Stims occurs, in order to (potentially) reduce its memory footprint. However, generators can be confusing to users new to Python. To minimize confusion, pliers therefore converts all generators to lists before returning results to the user (or passing them as inputs to the next Transformer in a Graph). More experienced users who are comfortable with generator expressions and want to take advantage of their potential memory-saving benefits can enable generators by setting use_generators to True. (Note that it is not a foregone conclusion that enabling generators will reduce memory consumption; if caching is enabled and/or the number of intermediate conversions is large, using generators is unlikely to help much.)