-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New defaults for concat
, merge
, combine_*
#10062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
coords="different", | ||
compat="equals", | ||
join="outer", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hard-coded these to the old defaults since there is no way for the user to set them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this approach. These options result in confusing groupby behaviour (#2145) but we can tackle that later
0e65034
to
5461a9f
Compare
The last test file that I need to work on is test_concat.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on!
I'd like to avoid adding the extra option. Can you remind us of what that would be useful please?
coords="different", | ||
compat="equals", | ||
join="outer", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this approach. These options result in confusing groupby behaviour (#2145) but we can tackle that later
xarray/tests/test_merge.py
Outdated
assert_identical(ds2, actual) | ||
|
||
actual = ds2.merge(ds1) | ||
actual = ds2.merge(ds1, compat="override") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is dangerous, the dimensionality of 'x'
in the output depends on the order of merging.
I intended compat="override"
to be used when "x"
is the same in all datasets, not merely 'compatible' as is here. It seems like we would want to at minimum, assert that ndim
is same for all x
in all datasets.
Opened #10094
The option is part of the deprecation process. Basically how it works is in this PR the default values for these kwargs do not change. BUT you get This is how I'm thinking about the plan after this PR lands:
I guess you could get rid of the option the tradeoff is that to get rid of the |
This is ready for review! The failing test looks like it is also failing on main. |
Wellll maybe the unit tests will pass now. I'll fix mypy and the doctests next week. |
The mypy failures now match those in other PRs - I opened #10110 to track |
b683b37
to
6e0e0eb
Compare
except AlignmentError as e: | ||
e.add_note( | ||
"If you are intending to concatenate datasets, please specify the concatenation dimension explicitly. " | ||
"Using merge to concatenate is quite inefficient." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So happy we can disallow this nonsense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about e.add_note
! That's very nice.
6e0e0eb
to
f63ed66
Compare
for more information, see https://pre-commit.ci
That is awesome! Thanks so much for taking a look. It's great to have fewer false positives! Let me know if there is anything I can do to help! |
69b9ef6
to
bec25f6
Compare
@@ -100,14 +108,18 @@ def concat( | |||
load the data payload of data variables into memory if they are not | |||
already loaded. | |||
* "all": All data variables will be concatenated. | |||
* None: Means ``"all"`` if ``dim`` is not present in any of the ``objs``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new.
We want to distinguish between concatenating ålong an existing dimension (numpy's concatenate) and concatenating along a new dimension (numpy's stack).
Without this (and with data_vars="minimal", coords="minimal"
as default), we are not doing any concatenation if asked to concatenate along a new dim. That behaviour is certainly wrong. So now the new default we are migrating to is data_vars=None, coords="minimal"
where we will at least stack the data vars when ask to concat
along a new dimension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok so effectively make the kwarg conditional on the presence of dim
in the dataset? I guess that's ok because it is discernible without loading data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the improvement is that we don't ever check the data to make the decision. And because of named dimensions, we can distinguish between concatenate
and stack
(in numpy terms).
I'll bring this up for discussion at Wednesday's meeting and then merge if there are no objections.
Thanks for your work here!
29e0da6
to
a45ff5f
Compare
a45ff5f
to
f0eab2e
Compare
- ``coords``: "minimal" | ||
- ``compat``: "override" | ||
- ``join``: "exact" | ||
|
||
By `Julia Signell <https://github.com/jsignell>`_. | ||
(:issue:`8778`, :issue:`1385`, :pull:`10062`). By `Julia Signell <https://github.com/jsignell>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably add yourself as an author on this one 😅 @dcherian
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might just be a few cleanups left with data_vars=None
as the default. I can push if that is helpful, just don't want to step on your toes if you are still going :)
except AlignmentError as e: | ||
e.add_note( | ||
"If you are intending to concatenate datasets, please specify the concatenation dimension explicitly. " | ||
"Using merge to concatenate is quite inefficient." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about e.add_note
! That's very nice.
Absolutely, please go ahead. I wasn't sure that you were still interested in pushing this over line given how long this PR has been open :) . I can review again when you're done |
Hahah yeah still interested! I just didn't know what to do to make it palatable :) so I'm stoked to have your review! |
Ok! I pushed the changes I was talking about and merged in main. I think the flaky tests are just being flaky 🤷🏻 |
Hehe, it's just a hard complex change. I had to check it out and use a debugger on a few tests to fully grok things, which then led to the fixes I pushed. |
Oh totally! People should take their time to feel comfortable with it. |
Replaces #10051
FutureWarnings
whats-new.rst
New functions/methods are listed inapi.rst
This PR attempts to throw warnings if and only if the output would change with the new kwarg defaults. To exercise the new default I am toggling the option back and forth and running the existing test suite.
With new kwarg values
use_new_combine_kwarg_defaults=True
-> 78 faileduse_new_combine_kwarg_defaults=False
and run the last failed -> 76 failed, 2 passedThose 2 are missed alarms. In these two cases the behavior will be different when xarray switched to using new default kwarg values and there is no warning. But I am not totally sure whether we need to handle them because they are tests for conditions that used to raise an error and with the new defaults they do not.
With old kwarg values
use_new_combine_kwarg_defaults=False
-> 119 faileduse_new_combine_kwarg_defaults=True
and run the last failed -> 76 failed, 43 passedThose 44 are potentially false alarms - they could also be tests that just don't assert that the result exactly matches some fixed expectation - but mostly they are false alarms.
Here is a list of them
About half of them are triggered by my attempt to catch cases where different datasets have matching overlapping variables and with
compat='no_conflicts'
you might get different results than withcompat='override'
. There might be a better way, but we want to be careful to avoid calling compute.Updating tests
After investigating the impact on existing tests, I updated the tests to make sure they still test what they were meant to test by passing in any required kwargs and I added new tests that explicitly ensure that for a bunch of different inputs, using old defaults throws a warning OR output is the same with new and old defaults.
Notes