-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Overview
GGHM demand model code makes extensive use of LinkedDataFrame.groupby
, which worked under old versions of pandas, but are broken in pandas >=0.25.
Reproduction:
Install cheval with pandas >=0.25, and run the following
df1_dict = {
"df2_id": [1, 2, 3, 1, 2, 3],
"cats": ["M", "F", "M", "F", "M", "F"],
}
df2_dict = {
"col1": ["a", "b", "c"],
}
df1 = LinkedDataFrame(pd.DataFrame(df1_dict))
df2 = LinkedDataFrame(pd.DataFrame(df2_dict))
df1.link_to(df2, "df2", on_self="df2_id")
for cat, group_df in df1.groupby("cats"):
group_df.df2.col1 # ValueError: Length of values (6) does not match length of index (3)
This is happening because the private method interception to update the linkage indexes (_take
and _take_with_is_copy
) is no longer getting called in newer versions of pandas.
Recommended fix:
Find a less hack-ey method to update indexes. Either:
- Move to having all updates happen in a stable public endpoint (likely
__finalize__
) - Have all linkages check indexes when they are accessed and update them on-the-fly if necessary (would have to be careful, as a naïve approach may subtly break code if the indexes are the same length)
If performance poses a concern, we could do both and use a stable endpoind, but do lazy update of indexes.
Note:
In pandas 1.3, there is a separate bug which causes LinkedDataFrames to not be properly initialized during groupbys: see pandas issue #45363. This would result in an infinite recursion on LinkedDataFrame.__getattr__
when doing basically anything on the grouped LinkedDataFrame
objects, and so the above code would issue a RecursionError
.
This issue is basically unfixable on our end; it was fixed in pandas 1.4, but we may want cheval to issue a warning if it detects this incompatible version of pandas.