
Parallelization is crucial to scientific computing, especially for large cases involving millions or even trillions of elements. OpFlow provides two levels of parallelization, namely the multithreading & multiprocessing parallelization. Currently, they are implemented with OpenMP and MPI, respectively. This section will illustrate how to properly setup the parallel configuration for a program, as well as some utilities for workload balancing.


To enable the multiprocessing parallelization, OPFLOW_WITH_MPI must be specified during the cmake configuration phase. Also, a OPFLOW_DISTRIBUTE_MODEL_MPI macro needs to be defined before including OpFlow. (Hint: this can be set by the cmake command target_compile_definitions(<proj_name> PRIVATE OPFLOW_DISTRIBUTE_MODEL_MPI)

Environment setup

Currently, OpFlow shares the parallel configuration throughout the program. Therefore, a global configuration of environment is needed before declare any OpFlow objects:

int main(int argc, char* argv[]) {
    // pass the command line args to the initializer
    InitEnvironment(argc, argv);

    // ...
    // call the finalizer at the end of program
    return 0;

To configure the number & type of workers, a ParallelPlan needs to be generated and registered:

// collect the available parallel resources on the current system
auto info = makeParallelInfo();
// you can modify the info to use specified number of processors
// register the global parallel info
// construct a parallel plan & register to the environment
setGlobalParallelPlan(makeParallelPlan(getGlobalParallelInfo(), ParallelIdentifier::DistributeMem));

makeParallelPlan takes a ParallelInfo object and a bit flag describing the parallel mode to be used. Possible masks are: None, DistributeMem, SharedMem and Heterogeneous. The constructed ParallelPlan object records the number of workers for each level of parallelism. It can be accessed globally via getGlobalParallelPlan.

Range based loops

Although OpFlow is designed to let users write loop-free codes as much as possible, there are cases where a simple loop does the job, e.g., printing the data to the console or interacting with non-OpFlow data structures. Therefore, two range based loops, i.e., rangeFor and rangeFor_s, are introduced to do the job. Both of them do the same work, despite that rangeFor_s will run in serial mode while rangeFor will try to use multithreading as much as possible. For example,

CartesianField<Real, Mesh> f;
// print f's value to the console
rangeFor_s(f.accessibleRange, [&](auto&& i) {
    OP_INFO("f[{}] = {}", i.toString(), f[i]);

// copy the data of f to a C array
Real buffer[nx][ny];
rangeFor(f.accessibleRange, [&](auto&& i) {
    buffer[i[0]][i[1]] = f[i];

Both of them takes a range and a functor as arguments. The functor takes an index type corresponding to the range’s type, which indicates the current position of the loop.

SplitStrategy & Distributed data maintenance

The first thing to do when using distributed memory parallelization is to properly divide the total work to each node as evenly as possible. To make this process automatically, OpFlow introduces the concept of Split Strategy. For example, you can create an EvenSplitStrategy object

std::shared_ptr<AbstractSplitStrategy<Field>> strategy
        = std::make_shared<EvenSplitStrategy<Field>>();

and pass it to the builder while building the fields

auto u = builder.setPadding(1).setSplitStrategy(strategy).build();

The builder will automatically calculate the split and store the split range into u’s localRange. The setPadding method is used to reserve halo zones for internal data communication.


Due to the use-after-declaration constraint by the embedded language, OpFlow currently cannot deduce the required padding width from the algorithms composed later. It’s the user’s responsibility to set an appropriate width of padding.

Typically there is no other changes to make for parallelization. All fields are automatically updated after assignment. If you need to make some local changes, do remember to invoke updatePadding() after your modification:

f[DS::MDIndex<2>{10, 10}] = 1.;
// call updatePadding to exchange the padding zones


You can now checkout the Heat transfer example and compare the performance with different parallelization configurations. Check it out!