ARROW-4629: [Python] Pandas arrow conversion slowed down by imports #3706

fjetter · 2019-02-19T18:14:51Z

The local imports slow down the conversion from pandas to arrow significantly (see here)

kszucs · 2019-02-19T18:20:55Z

python/pyarrow/array.pxi

@@ -15,6 +15,7 @@
 # specific language governing permissions and limitations
 # under the License.

+import pyarrow.pandas_compat as pdcompat


Pandas is an optional dependency of pyarrow, that's why pandas is not imported here.

I changed it to an import iff pandas is available

xhochy · 2019-02-20T09:00:21Z

python/pyarrow/array.pxi

@@ -165,7 +170,6 @@ def array(object obj, type=None, mask=None, size=None, bint from_pandas=False,
                from_pandas=True, safe=safe,
                memory_pool=memory_pool)
        else:
-            import pyarrow.pandas_compat as pdcompat


@wesm I'm just realising that here is one of the potential problems users may have reported about pdcompat import problems. This path should be supported without pandas but currently isn't.

yes, I would agree with you. can you create a follow up issue about this? We should set up a docker-compose "no pandas" build to make sure that the project is usable without pandas

I went ahead and opened https://issues.apache.org/jira/browse/ARROW-4640

xhochy

@fjetter Can you expand this to also only trigger the line once a user has Pandas? I added the necessary changes as Suggested Edits.

python/pyarrow/array.pxi

Co-Authored-By: fjetter <[email protected]>

codecov-io · 2019-02-20T10:11:08Z

Codecov Report

Merging #3706 into master will decrease coverage by 21.31%.
The diff coverage is 57.14%.

@@             Coverage Diff             @@
##           master    #3706       +/-   ##
===========================================
- Coverage   87.79%   66.48%   -21.32%     
===========================================
  Files         688      323      -365     
  Lines       84280    48123    -36157     
  Branches     1081        0     -1081     
===========================================
- Hits        73994    31993    -42001     
- Misses      10175    16130     +5955     
+ Partials      111        0      -111

Impacted Files	Coverage Δ
python/pyarrow/pandas_compat.py	`96.3% <100%> (-0.04%)`	⬇️
python/pyarrow/array.pxi	`68.75% <50%> (-0.45%)`	⬇️
cpp/src/arrow/pretty_print.h	`0% <0%> (-100%)`	⬇️
cpp/src/parquet/murmur3.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/testing/util.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/io/hdfs-internal.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/util/sse-util.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/testing/random.h	`0% <0%> (-100%)`	⬇️
cpp/src/parquet/hasher.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/util/int-util.cc	`0.39% <0%> (-99.21%)`	⬇️
... and 512 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e61bcf...eb5c8ba. Read the comment docs.

pitrou · 2019-02-20T12:25:28Z

I'm surprised we're importing Pandas inconditionally. We probably shouldn't do that, as Pandas is quite slow to import:

$ time python -c "import numpy"

real	0m0,097s
user	0m0,077s
sys	0m0,020s

$ time python -c "import pandas"

real	0m0,300s
user	0m0,272s
sys	0m0,028s

Here is a comparison of PyArrow import time with and without Pandas:

$ time python -c "import pyarrow"

real	0m0,360s
user	0m0,305s
sys	0m0,037s

$ time python -c "import sys; sys.modules['pandas'] = None; import pyarrow"

real	0m0,144s
user	0m0,124s
sys	0m0,020s

=> more than twice faster without.

xhochy · 2019-02-20T12:29:22Z

@pitrou Any ideas on how to avoid these local imports and also have the benefit of only loading pandas when needed?

pitrou · 2019-02-20T12:42:06Z

Imports are reasonably cheap once the module is already loaded, but it's probably better to avoid doing them in a tight loop. So hoisting the import outside of critical loops should be sufficient.

>>> %timeit import pandas                                                                                                                                         
102 ns ± 0.497 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit import pyarrow.pandas_compat as pdcompat                                                                                                              
253 ns ± 7.27 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This is in pure Python, though. Cython does not seem to implement the same optimizations as CPython does:

>>> %timeit lib._noop_bench()                                                                                                                                     
66.3 ns ± 0.379 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit lib._import_bench()                                                                                                                                   
928 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

(edit: issue opened for Cython at cython/cython#2854)

xhochy · 2019-02-20T13:35:50Z

@pitrou Are you happy with the change here and we will deal with the Pandas import issue separately or should this patch be adapted before merging?

pitrou · 2019-02-20T13:43:14Z

Yes, I've created ARROW-4637 for the pandas import.

The local imports slow down the conversion from pandas to arrow significantly (see [here](https://issues.apache.org/jira/browse/ARROW-4629)) Author: fjetter <[email protected]> Author: Uwe L. Korn <[email protected]> Closes apache#3706 from fjetter/local_imports and squashes the following commits: eb5c8ba <Uwe L. Korn> Apply suggestions from code review b4604be <fjetter> Only import pandas_compat if pandas is available f1c8b40 <fjetter> Don't use local imports

wesm · 2019-02-20T17:46:11Z

@xhochy @pitrou we could create a singleton object that exposes the pandas APIs that we need. The imports will be performed on creation of the singleton

Don't use local imports

f1c8b40

kszucs reviewed Feb 19, 2019

View reviewed changes

kszucs changed the title ~~[Python] ARROW-4629 - Pandas arrow conversion slowed down by imports~~ ARROW-4629: [Python] Pandas arrow conversion slowed down by imports Feb 19, 2019

Only import pandas_compat if pandas is available

b4604be

xhochy reviewed Feb 20, 2019

View reviewed changes

python/pyarrow/array.pxi Outdated Show resolved Hide resolved

python/pyarrow/array.pxi Outdated Show resolved Hide resolved

Apply suggestions from code review

eb5c8ba

Co-Authored-By: fjetter <[email protected]>

xhochy approved these changes Feb 20, 2019

View reviewed changes

xhochy closed this in 957fe15 Feb 20, 2019

fjetter deleted the local_imports branch February 20, 2019 14:22

jreback mentioned this pull request Feb 23, 2019

REF: use _constructor and ABCFoo to avoid runtime imports pandas-dev/pandas#25272

Merged

4 tasks

asfimport mentioned this pull request Feb 20, 2019

[Python] Pandas to arrow conversion slowed down by local imports #21165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-4629: [Python] Pandas arrow conversion slowed down by imports #3706

ARROW-4629: [Python] Pandas arrow conversion slowed down by imports #3706

Uh oh!

fjetter commented Feb 19, 2019

Uh oh!

kszucs Feb 19, 2019 •

edited

Loading

Uh oh!

fjetter Feb 19, 2019

Uh oh!

xhochy Feb 20, 2019

Uh oh!

wesm Feb 20, 2019

Uh oh!

wesm Feb 20, 2019

Uh oh!

xhochy left a comment

Uh oh!

Uh oh!

Uh oh!

codecov-io commented Feb 20, 2019

Uh oh!

pitrou commented Feb 20, 2019

Uh oh!

xhochy commented Feb 20, 2019

Uh oh!

pitrou commented Feb 20, 2019 •

edited

Loading

Uh oh!

xhochy commented Feb 20, 2019 •

edited

Loading

Uh oh!

pitrou commented Feb 20, 2019

Uh oh!

wesm commented Feb 20, 2019

Uh oh!

Uh oh!

ARROW-4629: [Python] Pandas arrow conversion slowed down by imports #3706

ARROW-4629: [Python] Pandas arrow conversion slowed down by imports #3706

Uh oh!

Conversation

fjetter commented Feb 19, 2019

Uh oh!

kszucs Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter Feb 19, 2019

Choose a reason for hiding this comment

Uh oh!

xhochy Feb 20, 2019

Choose a reason for hiding this comment

Uh oh!

wesm Feb 20, 2019

Choose a reason for hiding this comment

Uh oh!

wesm Feb 20, 2019

Choose a reason for hiding this comment

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-io commented Feb 20, 2019

Codecov Report

Uh oh!

pitrou commented Feb 20, 2019

Uh oh!

xhochy commented Feb 20, 2019

Uh oh!

pitrou commented Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xhochy commented Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Feb 20, 2019

Uh oh!

wesm commented Feb 20, 2019

Uh oh!

Uh oh!

kszucs Feb 19, 2019 •

edited

Loading

pitrou commented Feb 20, 2019 •

edited

Loading

xhochy commented Feb 20, 2019 •

edited

Loading