-
Notifications
You must be signed in to change notification settings - Fork 3.8k
ARROW-4629: [Python] Pandas arrow conversion slowed down by imports #3706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/pyarrow/array.pxi
Outdated
@@ -15,6 +15,7 @@ | |||
# specific language governing permissions and limitations | |||
# under the License. | |||
|
|||
import pyarrow.pandas_compat as pdcompat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas is an optional dependency of pyarrow, that's why pandas is not imported here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to an import iff pandas is available
@@ -165,7 +170,6 @@ def array(object obj, type=None, mask=None, size=None, bint from_pandas=False, | |||
from_pandas=True, safe=safe, | |||
memory_pool=memory_pool) | |||
else: | |||
import pyarrow.pandas_compat as pdcompat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm I'm just realising that here is one of the potential problems users may have reported about pdcompat
import problems. This path should be supported without pandas but currently isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I would agree with you. can you create a follow up issue about this? We should set up a docker-compose "no pandas" build to make sure that the project is usable without pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and opened https://issues.apache.org/jira/browse/ARROW-4640
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fjetter Can you expand this to also only trigger the line once a user has Pandas? I added the necessary changes as Suggested Edits.
Co-Authored-By: fjetter <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #3706 +/- ##
===========================================
- Coverage 87.79% 66.48% -21.32%
===========================================
Files 688 323 -365
Lines 84280 48123 -36157
Branches 1081 0 -1081
===========================================
- Hits 73994 31993 -42001
- Misses 10175 16130 +5955
+ Partials 111 0 -111
Continue to review full report at Codecov.
|
I'm surprised we're importing Pandas inconditionally. We probably shouldn't do that, as Pandas is quite slow to import:
Here is a comparison of PyArrow import time with and without Pandas:
=> more than twice faster without. |
@pitrou Any ideas on how to avoid these local imports and also have the benefit of only loading pandas when needed? |
Imports are reasonably cheap once the module is already loaded, but it's probably better to avoid doing them in a tight loop. So hoisting the import outside of critical loops should be sufficient. >>> %timeit import pandas
102 ns ± 0.497 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit import pyarrow.pandas_compat as pdcompat
253 ns ± 7.27 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) This is in pure Python, though. Cython does not seem to implement the same optimizations as CPython does: >>> %timeit lib._noop_bench()
66.3 ns ± 0.379 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit lib._import_bench()
928 ns ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) (edit: issue opened for Cython at cython/cython#2854) |
@pitrou Are you happy with the change here and we will deal with the Pandas import issue separately or should this patch be adapted before merging? |
Yes, I've created ARROW-4637 for the pandas import. |
The local imports slow down the conversion from pandas to arrow significantly (see [here](https://issues.apache.org/jira/browse/ARROW-4629)) Author: fjetter <[email protected]> Author: Uwe L. Korn <[email protected]> Closes apache#3706 from fjetter/local_imports and squashes the following commits: eb5c8ba <Uwe L. Korn> Apply suggestions from code review b4604be <fjetter> Only import pandas_compat if pandas is available f1c8b40 <fjetter> Don't use local imports
The local imports slow down the conversion from pandas to arrow significantly (see here)