Skip to content

New defaults for concat, merge, combine_* #10062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 56 commits into
base: main
Choose a base branch
from

Conversation

jsignell
Copy link
Contributor

@jsignell jsignell commented Feb 19, 2025

Replaces #10051

This PR attempts to throw warnings if and only if the output would change with the new kwarg defaults. To exercise the new default I am toggling the option back and forth and running the existing test suite.

With new kwarg values

  • Run all the tests with use_new_combine_kwarg_defaults=True -> 78 failed
  • Change to use_new_combine_kwarg_defaults=False and run the last failed -> 76 failed, 2 passed

Those 2 are missed alarms. In these two cases the behavior will be different when xarray switched to using new default kwarg values and there is no warning. But I am not totally sure whether we need to handle them because they are tests for conditions that used to raise an error and with the new defaults they do not.

FAILED xarray/tests/test_concat.py::TestConcatDataset::test_concat_do_not_promote - Failed: DID NOT RAISE <class 'ValueError'>
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_error - Failed: DID NOT RAISE <class 'xarray.core.merge.MergeError'>

With old kwarg values

  • Run all the tests with use_new_combine_kwarg_defaults=False -> 119 failed
  • Change to use_new_combine_kwarg_defaults=True and run the last failed -> 76 failed, 43 passed

Those 44 are potentially false alarms - they could also be tests that just don't assert that the result exactly matches some fixed expectation - but mostly they are false alarms.

Here is a list of them
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[outer-minimal-nested-t] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[outer-minimal-by_coords-None] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[inner-minimal-nested-t] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[inner-minimal-by_coords-None] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[left-minimal-nested-t] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[left-minimal-by_coords-None] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[right-minimal-nested-t] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_does_same_as_concat[right-minimal-by_coords-None] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_combine_attrs[drop] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_combine_attrs[override] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_combine_attrs[no_conflicts] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_combine_attrs[identical] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_combine_attrs[drop_conflicts] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataset_attr_by_coords - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_open_mfdataset_dataarray_attr_by_coords - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_common_coord_when_datavars_minimal - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestOpenMFDatasetWithDataVarsAndCoordsKw::test_invalid_data_vars_value_should_fail - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_backends.py::TestDask::test_open_mfdataset_concat_dim_none - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::TestDask::test_open_mfdataset_concat_dim_default_none - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_backends.py::test_h5netcdf_storage_options - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_combine.py::TestNestedCombine::test_auto_combine_2d - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_combine.py::TestNestedCombine::test_auto_combine_2d_combine_attrs_kwarg - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_combine.py::TestCombineDatasetsbyCoords::test_infer_order_from_coords - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_combine.py::TestCombineDatasetsbyCoords::test_combine_by_coords_no_concat - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_concat.py::TestConcatDataset::test_concat_coords_kwarg[dim1-minimal] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_concat.py::TestConcatDataset::test_concat_coords_kwarg[dim1-all] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_concat.py::TestConcatDataset::test_concat_errors - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_concat.py::TestConcatDataArray::test_concat_avoids_index_auto_creation - FutureWarning: In a future version of xarray the default value for coords will change from coords='different' to coords='minimal'. This is likely to lead to different results when multiple datasets have m...
FAILED xarray/tests/test_dask.py::test_map_blocks_da_ds_with_template[obj1] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_dask.py::test_map_blocks_errors_bad_template[obj1] - FutureWarning: In a future version of xarray the default value for data_vars will change from data_vars='all' to data_vars='minimal'. This is likely to lead to different results when multiple datasets hav...
FAILED xarray/tests/test_dataset.py::TestDataset::test_dataset_math_auto_align - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_datasets - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[no_conflicts-attrs10-attrs20-expected_attrs0-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[no_conflicts-attrs11-attrs21-expected_attrs1-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[no_conflicts-attrs12-attrs22-expected_attrs2-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[no_conflicts-attrs13-attrs23-expected_attrs3-True] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[drop-attrs14-attrs24-expected_attrs4-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[identical-attrs15-attrs25-expected_attrs5-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[identical-attrs16-attrs26-expected_attrs6-True] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[override-attrs17-attrs27-expected_attrs7-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[drop_conflicts-attrs18-attrs28-expected_attrs8-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_arrays_attrs_variables[<lambda>-attrs19-attrs29-expected_attrs9-False] - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...
FAILED xarray/tests/test_merge.py::TestMergeFunction::test_merge_no_conflicts_preserve_attrs - FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlappin...

About half of them are triggered by my attempt to catch cases where different datasets have matching overlapping variables and with compat='no_conflicts' you might get different results than with compat='override'. There might be a better way, but we want to be careful to avoid calling compute.

Updating tests

After investigating the impact on existing tests, I updated the tests to make sure they still test what they were meant to test by passing in any required kwargs and I added new tests that explicitly ensure that for a bunch of different inputs, using old defaults throws a warning OR output is the same with new and old defaults.

Notes

  • I used sentinel values to indicate a kwarg value that the user has not explicitly set. The benefit of using this approach is that it makes it more obvious that people should not be setting these kwargs to None in their own code. Also with this approach it is possible to encode both the old and the new default value in the function signature.
  • When users have opted in to the new default values there is logic that tries to catch any errors that might be related to the new default kwargs and adds that context to the error message. In case people don't pay attention to the warnings this will at least give them some indication that things have changed and explain new errors that crop up. This only applies when there is already an error, so they don't catch places where the code succeeds but the result is different than before.

coords="different",
compat="equals",
join="outer",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hard-coded these to the old defaults since there is no way for the user to set them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this approach. These options result in confusing groupby behaviour (#2145) but we can tackle that later

@jsignell jsignell force-pushed the concat_default_kwargs branch from 0e65034 to 5461a9f Compare February 24, 2025 20:07
@jsignell
Copy link
Contributor Author

The last test file that I need to work on is test_concat.py

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on!

I'd like to avoid adding the extra option. Can you remind us of what that would be useful please?

coords="different",
compat="equals",
join="outer",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this approach. These options result in confusing groupby behaviour (#2145) but we can tackle that later

assert_identical(ds2, actual)

actual = ds2.merge(ds1)
actual = ds2.merge(ds1, compat="override")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is dangerous, the dimensionality of 'x' in the output depends on the order of merging.

I intended compat="override" to be used when "x" is the same in all datasets, not merely 'compatible' as is here. It seems like we would want to at minimum, assert that ndim is same for all x in all datasets.

Opened #10094

@jsignell
Copy link
Contributor Author

jsignell commented Mar 4, 2025

I'd like to avoid adding the extra option. Can you remind us of what that would be useful please?

The option is part of the deprecation process. Basically how it works is in this PR the default values for these kwargs do not change. BUT you get FutureWarnings if the new default values would lead to a different result or an error. The use_new_combine_kwarg_defaults (this absolutely does not have to be the final name) allows you to opt into the new defaults prematurely. Setting that option to True lets you encounter any issues that might arise and it turns off the FutureWarnings.

This is how I'm thinking about the plan after this PR lands:

  1. after a sufficiently long deprecation cycle there should be another PR that:
    • removes the option (but (probably) doesn't throw if people are still setting it)
    • removes all the FutureWarnings
  2. after even more time one last PR that:
    • encodes the new defaults right into the function signature
    • removes the extra context about new defaults that is included in some error messages

I guess you could get rid of the option the tradeoff is that to get rid of the FutureWarnings people may have to hardcode to the new defaults. But there aren't too many false alarms (I need to update the PR description UPDATED) so most of the time when there is a FutureWarning it does indicate that hardcoding the kwarg value is a good idea.

@jsignell jsignell marked this pull request as ready for review March 4, 2025 21:01
@jsignell
Copy link
Contributor Author

jsignell commented Mar 4, 2025

This is ready for review! The failing test looks like it is also failing on main.

@jsignell
Copy link
Contributor Author

jsignell commented Mar 7, 2025

Wellll maybe the unit tests will pass now. I'll fix mypy and the doctests next week.

@jsignell
Copy link
Contributor Author

The mypy failures now match those in other PRs - I opened #10110 to track

@dcherian dcherian force-pushed the concat_default_kwargs branch from b683b37 to 6e0e0eb Compare July 19, 2025 04:26
except AlignmentError as e:
e.add_note(
"If you are intending to concatenate datasets, please specify the concatenation dimension explicitly. "
"Using merge to concatenate is quite inefficient."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So happy we can disallow this nonsense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about e.add_note! That's very nice.

@dcherian dcherian force-pushed the concat_default_kwargs branch from 6e0e0eb to f63ed66 Compare July 19, 2025 04:36
@jsignell
Copy link
Contributor Author

Thanks Julia! I powered through test_concat and test_merge. I'll need another 2 hours to do test_backends and test_combine. But hopefully we can get this in for the next release.

It all looks good so far, I managed to reduce the false positive warnings on merge, so that's a good win!

That is awesome! Thanks so much for taking a look. It's great to have fewer false positives! Let me know if there is anything I can do to help!

@dcherian dcherian force-pushed the concat_default_kwargs branch from 69b9ef6 to bec25f6 Compare July 26, 2025 23:53
@@ -100,14 +108,18 @@ def concat(
load the data payload of data variables into memory if they are not
already loaded.
* "all": All data variables will be concatenated.
* None: Means ``"all"`` if ``dim`` is not present in any of the ``objs``,
Copy link
Contributor

@dcherian dcherian Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new.

We want to distinguish between concatenating ålong an existing dimension (numpy's concatenate) and concatenating along a new dimension (numpy's stack).

Without this (and with data_vars="minimal", coords="minimal" as default), we are not doing any concatenation if asked to concatenate along a new dim. That behaviour is certainly wrong. So now the new default we are migrating to is data_vars=None, coords="minimal" where we will at least stack the data vars when ask to concat along a new dimension

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok so effectively make the kwarg conditional on the presence of dim in the dataset? I guess that's ok because it is discernible without loading data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the improvement is that we don't ever check the data to make the decision. And because of named dimensions, we can distinguish between concatenate and stack (in numpy terms).

I'll bring this up for discussion at Wednesday's meeting and then merge if there are no objections.

Thanks for your work here!

@dcherian dcherian force-pushed the concat_default_kwargs branch 2 times, most recently from 29e0da6 to a45ff5f Compare July 27, 2025 02:38
@dcherian dcherian force-pushed the concat_default_kwargs branch from a45ff5f to f0eab2e Compare July 27, 2025 04:21
- ``coords``: "minimal"
- ``compat``: "override"
- ``join``: "exact"

By `Julia Signell <https://github.com/jsignell>`_.
(:issue:`8778`, :issue:`1385`, :pull:`10062`). By `Julia Signell <https://github.com/jsignell>`_.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably add yourself as an author on this one 😅 @dcherian

Copy link
Contributor Author

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might just be a few cleanups left with data_vars=None as the default. I can push if that is helpful, just don't want to step on your toes if you are still going :)

except AlignmentError as e:
e.add_note(
"If you are intending to concatenate datasets, please specify the concatenation dimension explicitly. "
"Using merge to concatenate is quite inefficient."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about e.add_note! That's very nice.

@dcherian
Copy link
Contributor

I can push if that is helpful,

Absolutely, please go ahead.

I wasn't sure that you were still interested in pushing this over line given how long this PR has been open :) . I can review again when you're done

@jsignell
Copy link
Contributor Author

Hahah yeah still interested! I just didn't know what to do to make it palatable :) so I'm stoked to have your review!

@jsignell
Copy link
Contributor Author

Ok! I pushed the changes I was talking about and merged in main. I think the flaky tests are just being flaky 🤷🏻

@dcherian
Copy link
Contributor

I just didn't know what to do to make it palatable :)

Hehe, it's just a hard complex change. I had to check it out and use a debugger on a few tests to fully grok things, which then led to the fixes I pushed.

@jsignell
Copy link
Contributor Author

I just didn't know what to do to make it palatable :)

Hehe, it's just a hard complex change. I had to check it out and use a debugger on a few tests to fully grok things, which then led to the fixes I pushed.

Oh totally! People should take their time to feel comfortable with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants