Skip to content

can we reject dtype inference for numpy object arrays #3077

@d-v-b

Description

@d-v-b

numpy arrays with dtype "O" are ambiguous, in the sense that they could contain values that zarr should store as:

  • variable-length strings
  • variable-length arrays
  • arbitrary python objects
  • etc

Unlike the object dtype, every other numpy dtype has a simple mapping to a zarr metadata representation. For these dtypes (e.g., int8, int16, etc), a user can provide a numpy array and we can automatically pick the right zarr data type representation from that array. But for the object dtype, this is not possible. Extra information is needed to resolve a zarr data type for object dtype arrays.

in zarr-python 2, we used an optional object_codec keyword argument to array creation routines. If a user provided dtype=np.dtype('O') or equivalent without a object_codec, then zarr-python 2 would error.

I don't want to use this exact pattern today, because object_codec is not really well-defined, and this extra parameter, used only for numpy object dtypes, would greatly complicate the dtype inference for all the other dtypes. Here is my alternative proposal: we refuse to do any dtype inference for numpy object dtypes. Instead, the user must provide an explicit zarr dtype that is compatible with the numpy object dtype.

e.g.:
create_array(...., dtype=np.dtype('O')) would raise an informative exception, guiding the user to do this instead:
create_array(..., dtype=zarr.dtypes.VariableLengthString()), or create_array(..., dtype='numpy.variable_length_string')

Thoughts on this pattern?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew features or improvements

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions