-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
DocsGroupbyIO HDF5read_hdf, HDFStoreread_hdf, HDFStorePerformanceMemory or execution speed performanceMemory or execution speed performance
Milestone
Description
I would like to promote the idea of applying multiprocessing.Pool() execution of embarrassingly parallel tasks, e.g. the application of a function to a large number of columns.
Obviously, there's some overhead to set it up, so there will be a lower limit of number of columns from which on this only can be faster than the already fast cython approach. I will be adding my performance tests later to this issue.
I was emailing @jreback about this and he added the following remarks/complications:
- transferring numpy data is not that efficient - but still prob good enough
- you can transfer (pickle) lambda type functions so maybe need to look at using a library called dill ( which solves this problem) - possibly could slightly modify msgpack to do this though (and is already pretty efficient at transferring other types of objects)
- could also investigate joblib - I think statsmodels uses it [ed: That seems to be correct, I read in their group about joblib]
- I would create a new top level dir core/parallel for this type of stuff
- the strategy in this link could be a good way to follow: http://stackoverflow.com/questions/17785275/share-large-read-only-numpy-array-between-multiprocessing-processes
links:
http://stackoverflow.com/questions/13065172/multiprocess-python-numpy-code-for-processing-data-faster
http://docs.cython.org/src/userguide/parallelism.html
http://distarray.readthedocs.org/en/v0.5.0/
JoelBondurant, jackfischer, Shujian2015, OnlyBelter and akbir
Metadata
Metadata
Assignees
Labels
DocsGroupbyIO HDF5read_hdf, HDFStoreread_hdf, HDFStorePerformanceMemory or execution speed performanceMemory or execution speed performance