Skip to content

PERF: get_block_type heavy use could benefit performance improvements #48212

@Code0x58

Description

@Code0x58

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : b5958ee1999e9aead1938c0bba2b674378807b3d
python           : 3.6.9.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.15.0-191-generic
Version          : #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8
pandas           : 1.1.5
numpy            : 1.19.4
pytz             : 2019.3
dateutil         : 2.8.0
pip              : 19.3.1
setuptools       : 41.6.0
Cython           : 0.23.4
pytest           : 4.0.1
hypothesis       : 3.66.8
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : 0.7.3
lxml.etree       : 4.3.2
html5lib         : None
pymysql          : None
psycopg2         : 2.8.4 (dt dec pq3 ext lo64)
jinja2           : 2.8
IPython          : 3.2.1
pandas_datareader: None
bs4              : 4.4.1
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.3
numexpr          : None
odfpy            : None
openpyxl         : 3.0.8
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.4
sqlalchemy       : 1.3.1
tables           : None
tabulate         : 0.8.5
xarray           : None
xlrd             : 1.2.0
xlwt             : None
numba            : 0.31.0

Prior Performance

While upgrading a large pandas heavy codebase from 0.19.2, I was looking at areas to improve performance as there has been a fairly consistent ~25% performance drop across various slices of the codebase when moving to 1.1.5 (other libraries were upgraded at the same time so not necessarily just pandas contributing to that). Making blocks showed up in profiling as taking quite a lot longer, with this method being an easy place to boost performance as it is heavily used. It looks like this is the case on master as well.

Caching the results of the method produced a ~5% performance increase in quite a large test suite, but would be nice to see the change on asv.

Generally it looks like the is_*dtype(...) related places might be able to benefit from attention, so I will look into those if a usable pattern comes out of this. IIR it was a extension dtype added in 2016/probably just after 0.19.2 that looked like it took a fair bit of time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsInternalsRelated to non-user accessible pandas implementationPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions