Using dask distributed for single-machine parallel computing

This example shows the simplest usage of the dask distributed backend, on the local computer.

This is useful for prototyping a solution, to later be run on a truly distributed cluster, as the only change to be made is the address of the scheduler.

Another realistic usage scenario: combining dask code with joblib code, for instance using dask for preprocessing data, and scikit-learn for machine learning. In such a setting, it may be interesting to use distributed as a backend scheduler for both dask and joblib, to orchestrate well the computation.

Setup the distributed client

from distributed import Client
# Typically, to execute on a remote machine, the address of the scheduler
# would go there
client = Client()

# Recover the address
address = client.scheduler_info()['address']

# This import registers the dask.distributed backend for joblib
import distributed.joblib  # noqa

Run parallel computation using dask.distributed

import time
import joblib


def long_running_function(i):
    time.sleep(.1)
    return i

The verbose messages below show that the backend is indeed the dask.distributed one

with joblib.parallel_backend('dask.distributed', scheduler_host=address):
    joblib.Parallel(n_jobs=2, verbose=100)(
        joblib.delayed(long_running_function)(i)
        for i in range(10))

Out:

[Parallel(n_jobs=2)]: Using backend DaskDistributedBackend with 4 concurrent workers.
[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:    0.3s
[Parallel(n_jobs=2)]: Done   2 tasks      | elapsed:    0.3s
[Parallel(n_jobs=2)]: Done   3 tasks      | elapsed:    0.3s
[Parallel(n_jobs=2)]: Done   4 out of  10 | elapsed:    0.4s remaining:    0.6s
[Parallel(n_jobs=2)]: Done   5 out of  10 | elapsed:    0.4s remaining:    0.4s
[Parallel(n_jobs=2)]: Done   6 out of  10 | elapsed:    0.4s remaining:    0.3s
[Parallel(n_jobs=2)]: Done   7 out of  10 | elapsed:    0.4s remaining:    0.2s
[Parallel(n_jobs=2)]: Done   8 out of  10 | elapsed:    0.5s remaining:    0.1s
[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:    0.5s remaining:    0.0s
[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:    0.5s finished

Progress in computation can be followed on the distributed web interface, see http://distributed.readthedocs.io/en/latest/web.html

Total running time of the script: ( 0 minutes 1.152 seconds)

Gallery generated by Sphinx-Gallery