Add refresh_attributes() and implement cache_attrs for Group and Array #3215

bojidar-bg · 2025-07-08T10:53:25Z

Should resolve #3178, if one passes cache_attrs=False when creating the various groups.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

Should resolve zarr-developers#3178

codecov · 2025-07-08T11:05:04Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.75%. Comparing base (378d5af) to head (075a330).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3215      +/-   ##
==========================================
+ Coverage   94.73%   94.75%   +0.01%     
==========================================
  Files          78       78              
  Lines        8646     8670      +24     
==========================================
+ Hits         8191     8215      +24     
  Misses        455      455

Files with missing lines	Coverage Δ
src/zarr/api/synchronous.py	`94.66% <ø> (ø)`
src/zarr/core/array.py	`98.48% <100.00%> (+0.02%)`	⬆️
src/zarr/core/group.py	`94.94% <100.00%> (+0.06%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b · 2025-07-08T11:19:22Z

src/zarr/api/synchronous.py

+    cache_attrs : bool, optional
+        If True (default), user attributes will be cached for attribute read
+        operations. If False, user attributes are reloaded from the store prior
+        to all attribute read operations.


this docstring says true is the default, but the parameter itself has a default of None

Other places defining a cache_attrs argument did the same, so I decided to follow suit. (:

lets break with the past and ensure that the docstring matches the annotation here

d-v-b · 2025-07-08T11:21:17Z

anyone have opinions on whether cache_attrs should have a default of True or False?

TomAugspurger · 2025-07-08T18:32:18Z

Can someone write up a detailed proposal of the model here? The interaction between in-memory metadata objects and their serialized counterparts is a big topic.

If we're formally committing to something then we should do so carefully, and document it extensively.

bojidar-bg · 2025-07-10T21:22:32Z

From my perspective of "just solving the issue", the crux of this PR is the refresh_attributes() function, and not the cache_attrs argument—I implemented cache_attrs only because it looked convenient to knock off another v3 missing feature. (:

As for a model... that would be a good idea, I agree.

If I understand the current main branch code correctly (from a cursory reading, mind it), the current model for caching data is that the store may cache data, but Array/Group never cache data themselves, only metadata. Metadata is cached when opening an array/group, and kept cached until you discard the respective object.
As such, array data and group items are never cached; meanwhile, group attributes and consolidated metadata, as well as array attributes, shapes, codecs, encoders, chunks sizes and such are always cached.
(Critically, since stores can never notify us of remote changes, writing to the same chunk/shard of data from multiple open instances of the same Array, or writing to the metadata of multiple open instances of the same Array or Group results in latest-write-wins behavior; which, with multiple peers resizing the same array, can result in various kinds of inconsistent data.)

This PR adds the possibility of manually (or automatically) reloading just the attributes of an array or group. Ideally, there would be support for reloading the whole metadata in case there might have been changes made by others—but for that, we would need to either make arrays/groups unfrozen, or guide users to re-open the whole array/group (and replace all references to the old version) in case there have been changes made by another process.

As for a better model:

Automatically watching for metadata changes seem infeasible. Perhaps support for it can be added in the store, but apart from the local and memory store, I don't think that can work for s3- and http-based stores.
Leaving things as-is and relying on users to manually re-open modified groups and arrays sounds like a footgun waiting to happen, especially with longer-running process which might keep a reference to an array around.
Adding a refresh_metadata method, with some ways to control metadata caching, seem plausible, but would require the AsyncArray() and AsyncGroup() classes to be unfrozen (so that they can replace their metadata on-the-fly).
Merging this PR as-is with its refresh_attributes method is probably also a footgun waiting to happen; as now not only would users have a way to slowly drift out of sync with their arrays' newest shapes, but they would also have a function to sync everything but the array's shape 😅

Thoughts?

TomAugspurger · 2025-07-14T17:35:01Z

Very briefly, my understand of the current model is that there isn't any metadata caching at the Python Array / Group object level (Store instances are free to do what they want, and I think fsspec-based ones might cache things, but let's focus on the zarr-python objects)

Under that model, you can rely on your Group / Array metadata reflecting the state of the Store at the time that you created the instance. Methods that update the store in some way (creating, deleting, or updating) immediately synchronize with the Store. That's relatively simple to explain and understand.

IIUC, this attribute (or more generally, metadata) caching / refreshing is equivalent to users creating a new Group / Array instance. Should we steer users towards that? I hesitate to overcomplicate our API.

bojidar-bg · 2025-07-14T22:34:27Z

"Group / Array metadata reflecting the state of the Store at the time that you created the instance" is metadata being stored for the lifetime of the Group/Array instance, with no way to invalidate that stored metadata except by throwing away the whole Group / Array instance and recreating it anew. As "cached" data is data stored outside the main source of truth, especially for optimization purposes, I would claim that this metadata stored for the lifetime of the Group/Array instance is a caching mechanism at the Group / Array level.

Now, if Group / Array instances are short-lived, say, you create a group and immediately use it, there shouldn't be any harm coming from such caching. However, if users are likely to reuse and share Group / Array instances between multiple user-defined objects, then such caching can be rather unintuitive. As exemplified by the original issue (#3178), such behavior is confusing if you somehow end up with multiple long-lived instances of the same group. And, what's worse, currently to there's no indication that "creating a new Group / Array instance" is ever going to be necessary when you are just starting out, which can then tempt users to reuse the same Array instance across all their classes, making future recreation of that Array instance somewhat challenging.

--

Since Array already has e.g. an append method which modifies the Array object's metadata in-place, I think it's safe to say that the API is already incompatible with Array being an immutable object that stores metadata. Likewise with Group.attributes allowing direct modification of metadata.
As such, I feel like there are three generally-consistent positions one can take:

If metadata is part of the Array / Group, and both don't behave as short-lived immutable frozen pure-data objects, it makes sense to completely unfreeze them, and add methods with which to update the cached metadata.
If metadata is not part of the Array / Group, and both don't store anything past their Store and path, it makes sense to remove the metadata property from the Array/Group and push caching of metadata entirely to the store. (*Or: to an extra MetadataCache singleton of some sort, that stores and maintains a mutable mapping from Store/path to metadata)
If metadata is part of the Array / Group, yet both do behave as short-lived pure-data objects, it makes sense to adapt the API (likely turning it into a sort of fluent API) as well as the documentation and examples to always use method chaining instead of variables, and to treat Array / Group as short-lived utility objects which just expose a set of methods—with red capital letters around the risks of holding long-lived references to Array / Group, and maybe even warnings when the library can detect a long-lived reference.

TomAugspurger · 2025-07-22T16:32:39Z

On

with no way to invalidate that stored metadata except by throwing away the whole Group / Array instance and recreating it anew

and

I would claim that this metadata stored for the lifetime of the Group/Array instance is a caching mechanism at the Group / Array level.

I can't tell if it's just semantics or not, but I feel like there is a difference between "this object doesn't attempt to cache" (my words) and "this metadata stored for the lifetime of the Group/Array instance is a caching mechanism at the Group / Array level" (your words), despite them being functionally the same.

If I read a JSON file into a dictionary:

d = json.loads(Path('file.json').read_text())

Would we say that d is caching the contents of file.json? Would users have any expectation that updating file.json results in changes to d?

So in general, I'd push people toward recreating a Group or Array if they need to ensure that they're object is up to date, rather than providing a .refresh_attributes() method (and we can provide a helper method that provides a new Group / Array with up to date metadata by re-reading from the store, if that's helpful).

Very good point about us being loose with whether or not our Array and Group objects really are immutable at the metadata level (Array.append). I'm not sure, but I'd guess that's coming from trying to mirror the NumPy API and that clashing with our other goals like immutability.

Overall, I'd push us towards simple, stateless objects in zarr-python. I recognize that trying to layer stateless behavior on top of a fundamentally stateful thing (objects in a Store) is fraught. But zarr-python really isn't equipped to completely and robustly handle synchronization and consistency. You'd need something like icechunk for that.

IMO zarr-python can keep things simple and provide the stateless building blocks that other things can be built on top of.

But that's just my loosely held opinion. I don't want to hold this up anymore if other maintainers disagree with me.

bojidar-bg · 2025-07-24T12:14:45Z

To reply to your question:

Would we say that d is caching the contents of file.json? Would users have any expectation that updating file.json results in changes to d?

You could easily make a point that "d is the cached version of file.json (at the time of reading)". You wouldn't expect that the cache is kept up to date automatically, unless you've configured some mechanism to do that (and as .read_text returns a str and not some fancy subscribable-to object, you would quickly assume that the cached version is never updated).

I guess there is some difference in semantics there; but to me, "cache" is a very loose concept, which applies even in cases when you do something as simple as:

d = load_d()
print(d['a'] + d['b'])

instead of:

print(load_d()['a'] + load_d()['b'])

I can't comment on whether a simple stateless/immutable implementation or a fancy stateful implementation is better for Zarr Python—so I'll trust you on that 😅

In this case, if you would rather double down on making things stateless, I think the best option might be to remove the metadata field from the Array/Group, and instead fetch it every time it's accessed from storage. It would make the code somewhat slower, even with caching at the Store level, since we would have to re-parse the metadata JSON every time an array is resized or even just read from (*in case someone replaced the whole array from "under our feet" and now the filters are different).

..Alternatively, we could make things simple, if inconsistent. Leave things as they are, with a (cached) copy of the metadata in the Array/Group instance, and carefully document the behavior of that. In that case, the original issue this PR was made for could be closed as "wontfix", and users advised that opening a group loads the as-of-then current metadata—while listing subgroups or reading from an array reads the state at the time of listing/reading instead. Array.append / update_metadata and similar cached-state-mutating functions would have to remain inconsistent, however—even if we force users to get a new immutable instance every time, Group.attributes[..] = .. cannot return the new Group instance it creates.

Add refresh_attributes() and implement cache_attrs for Group, Array

16675fa

Should resolve zarr-developers#3178

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 8, 2025

d-v-b reviewed Jul 8, 2025

View reviewed changes

Merge branch 'main' into 3178-group-attribute-cache

075a330

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add refresh_attributes() and implement cache_attrs for Group and Array #3215

Add refresh_attributes() and implement cache_attrs for Group and Array #3215

Uh oh!

bojidar-bg commented Jul 8, 2025

Uh oh!

codecov bot commented Jul 8, 2025 •

edited

Loading

Uh oh!

d-v-b Jul 8, 2025

Uh oh!

bojidar-bg Jul 8, 2025

Uh oh!

d-v-b Jul 8, 2025

Uh oh!

d-v-b commented Jul 8, 2025

Uh oh!

TomAugspurger commented Jul 8, 2025

Uh oh!

bojidar-bg commented Jul 10, 2025

Uh oh!

TomAugspurger commented Jul 14, 2025

Uh oh!

bojidar-bg commented Jul 14, 2025 •

edited

Loading

Uh oh!

TomAugspurger commented Jul 22, 2025

Uh oh!

bojidar-bg commented Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

Add refresh_attributes() and implement cache_attrs for Group and Array #3215

Are you sure you want to change the base?

Add refresh_attributes() and implement cache_attrs for Group and Array #3215

Uh oh!

Conversation

bojidar-bg commented Jul 8, 2025

Uh oh!

codecov bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d-v-b Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

bojidar-bg Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Jul 8, 2025

Uh oh!

TomAugspurger commented Jul 8, 2025

Uh oh!

bojidar-bg commented Jul 10, 2025

Uh oh!

TomAugspurger commented Jul 14, 2025

Uh oh!

bojidar-bg commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Jul 22, 2025

Uh oh!

bojidar-bg commented Jul 24, 2025

Uh oh!

Uh oh!

codecov bot commented Jul 8, 2025 •

edited

Loading

bojidar-bg commented Jul 14, 2025 •

edited

Loading