Skip to content

LinkedDataFrame linkage indexes do not update on groupby #4

@aclarry

Description

@aclarry

Overview

GGHM demand model code makes extensive use of LinkedDataFrame.groupby, which worked under old versions of pandas, but are broken in pandas >=0.25.

Reproduction:

Install cheval with pandas >=0.25, and run the following

df1_dict = {
    "df2_id": [1,     2,   3,   1,   2,   3],
    "cats":   ["M", "F", "M", "F", "M", "F"],
}

df2_dict = {
    "col1":   ["a", "b", "c"],
}

df1 = LinkedDataFrame(pd.DataFrame(df1_dict))
df2 = LinkedDataFrame(pd.DataFrame(df2_dict))

df1.link_to(df2, "df2", on_self="df2_id")

for cat, group_df in df1.groupby("cats"):
    group_df.df2.col1 # ValueError: Length of values (6) does not match length of index (3)

This is happening because the private method interception to update the linkage indexes (_take and _take_with_is_copy) is no longer getting called in newer versions of pandas.

Recommended fix:

Find a less hack-ey method to update indexes. Either:

  • Move to having all updates happen in a stable public endpoint (likely __finalize__)
  • Have all linkages check indexes when they are accessed and update them on-the-fly if necessary (would have to be careful, as a naïve approach may subtly break code if the indexes are the same length)

If performance poses a concern, we could do both and use a stable endpoind, but do lazy update of indexes.

Note:

In pandas 1.3, there is a separate bug which causes LinkedDataFrames to not be properly initialized during groupbys: see pandas issue #45363. This would result in an infinite recursion on LinkedDataFrame.__getattr__ when doing basically anything on the grouped LinkedDataFrame objects, and so the above code would issue a RecursionError.

This issue is basically unfixable on our end; it was fixed in pandas 1.4, but we may want cheval to issue a warning if it detects this incompatible version of pandas.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions