Skip to content

ARROW-4629: [Python] Pandas arrow conversion slowed down by imports #3706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

fjetter
Copy link
Contributor

@fjetter fjetter commented Feb 19, 2019

The local imports slow down the conversion from pandas to arrow significantly (see here)

@@ -15,6 +15,7 @@
# specific language governing permissions and limitations
# under the License.

import pyarrow.pandas_compat as pdcompat
Copy link
Member

@kszucs kszucs Feb 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas is an optional dependency of pyarrow, that's why pandas is not imported here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to an import iff pandas is available

@kszucs kszucs changed the title [Python] ARROW-4629 - Pandas arrow conversion slowed down by imports ARROW-4629: [Python] Pandas arrow conversion slowed down by imports Feb 19, 2019
@@ -165,7 +170,6 @@ def array(object obj, type=None, mask=None, size=None, bint from_pandas=False,
from_pandas=True, safe=safe,
memory_pool=memory_pool)
else:
import pyarrow.pandas_compat as pdcompat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm I'm just realising that here is one of the potential problems users may have reported about pdcompat import problems. This path should be supported without pandas but currently isn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I would agree with you. can you create a follow up issue about this? We should set up a docker-compose "no pandas" build to make sure that the project is usable without pandas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjetter Can you expand this to also only trigger the line once a user has Pandas? I added the necessary changes as Suggested Edits.

@codecov-io
Copy link

Codecov Report

Merging #3706 into master will decrease coverage by 21.31%.
The diff coverage is 57.14%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #3706       +/-   ##
===========================================
- Coverage   87.79%   66.48%   -21.32%     
===========================================
  Files         688      323      -365     
  Lines       84280    48123    -36157     
  Branches     1081        0     -1081     
===========================================
- Hits        73994    31993    -42001     
- Misses      10175    16130     +5955     
+ Partials      111        0      -111
Impacted Files Coverage Δ
python/pyarrow/pandas_compat.py 96.3% <100%> (-0.04%) ⬇️
python/pyarrow/array.pxi 68.75% <50%> (-0.45%) ⬇️
cpp/src/arrow/pretty_print.h 0% <0%> (-100%) ⬇️
cpp/src/parquet/murmur3.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/testing/util.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/io/hdfs-internal.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/sse-util.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/testing/random.h 0% <0%> (-100%) ⬇️
cpp/src/parquet/hasher.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/int-util.cc 0.39% <0%> (-99.21%) ⬇️
... and 512 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e61bcf...eb5c8ba. Read the comment docs.

@pitrou
Copy link
Member

pitrou commented Feb 20, 2019

I'm surprised we're importing Pandas inconditionally. We probably shouldn't do that, as Pandas is quite slow to import:

$ time python -c "import numpy"

real	0m0,097s
user	0m0,077s
sys	0m0,020s

$ time python -c "import pandas"

real	0m0,300s
user	0m0,272s
sys	0m0,028s

Here is a comparison of PyArrow import time with and without Pandas:

$ time python -c "import pyarrow"

real	0m0,360s
user	0m0,305s
sys	0m0,037s

$ time python -c "import sys; sys.modules['pandas'] = None; import pyarrow"

real	0m0,144s
user	0m0,124s
sys	0m0,020s

=> more than twice faster without.

@xhochy
Copy link
Member

xhochy commented Feb 20, 2019

@pitrou Any ideas on how to avoid these local imports and also have the benefit of only loading pandas when needed?

@pitrou
Copy link
Member

pitrou commented Feb 20, 2019

Imports are reasonably cheap once the module is already loaded, but it's probably better to avoid doing them in a tight loop. So hoisting the import outside of critical loops should be sufficient.

>>> %timeit import pandas                                                                                                                                         
102 ns ± 0.497 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit import pyarrow.pandas_compat as pdcompat                                                                                                              
253 ns ± 7.27 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This is in pure Python, though. Cython does not seem to implement the same optimizations as CPython does:

>>> %timeit lib._noop_bench()                                                                                                                                     
66.3 ns ± 0.379 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit lib._import_bench()                                                                                                                                   
928 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

(edit: issue opened for Cython at cython/cython#2854)

@xhochy
Copy link
Member

xhochy commented Feb 20, 2019

@pitrou Are you happy with the change here and we will deal with the Pandas import issue separately or should this patch be adapted before merging?

@pitrou
Copy link
Member

pitrou commented Feb 20, 2019

Yes, I've created ARROW-4637 for the pandas import.

@xhochy xhochy closed this in 957fe15 Feb 20, 2019
xhochy pushed a commit to xhochy/arrow that referenced this pull request Feb 20, 2019
The local imports slow down the conversion from pandas to arrow significantly (see [here](https://issues.apache.org/jira/browse/ARROW-4629))

Author: fjetter <[email protected]>
Author: Uwe L. Korn <[email protected]>

Closes apache#3706 from fjetter/local_imports and squashes the following commits:

eb5c8ba <Uwe L. Korn> Apply suggestions from code review
b4604be <fjetter> Only import pandas_compat if pandas is available
f1c8b40 <fjetter> Don't use local imports
@fjetter fjetter deleted the local_imports branch February 20, 2019 14:22
@wesm
Copy link
Member

wesm commented Feb 20, 2019

@xhochy @pitrou we could create a singleton object that exposes the pandas APIs that we need. The imports will be performed on creation of the singleton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants