Parallelization for embarrassingly parallel tasks

I would like to promote the idea of applying multiprocessing.Pool() execution of embarrassingly parallel tasks, e.g. the application of a function to a large number of columns.
Obviously, there's some overhead to set it up, so there will be a lower limit of number of columns from which on this only can be faster than the already fast cython approach. I will be adding my performance tests later to this issue.
I was emailing @jreback about this and he added the following remarks/complications:
- transferring numpy data is not that efficient - but still prob good enough
- you can transfer (pickle) lambda type functions so maybe need to look at using a library called dill ( which solves this problem) - possibly could slightly modify msgpack to do this though (and is already pretty efficient at transferring other types of objects)
- could also investigate joblib - I think statsmodels uses it [ed: That seems to be correct, I read in their group about joblib]
- I would create a new top level dir core/parallel for this type of stuff
- the strategy in this link could be a good way to follow: http://stackoverflow.com/questions/17785275/share-large-read-only-numpy-array-between-multiprocessing-processes 

links:
http://stackoverflow.com/questions/13065172/multiprocess-python-numpy-code-for-processing-data-faster
http://docs.cython.org/src/userguide/parallelism.html
http://distarray.readthedocs.org/en/v0.5.0/


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parallelization for embarrassingly parallel tasks #5751

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Parallelization for embarrassingly parallel tasks #5751

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions