-
-
Notifications
You must be signed in to change notification settings - Fork 346
Description
numpy arrays with dtype "O"
are ambiguous, in the sense that they could contain values that zarr should store as:
- variable-length strings
- variable-length arrays
- arbitrary python objects
- etc
Unlike the object dtype, every other numpy dtype has a simple mapping to a zarr metadata representation. For these dtypes (e.g., int8
, int16
, etc), a user can provide a numpy array and we can automatically pick the right zarr data type representation from that array. But for the object dtype, this is not possible. Extra information is needed to resolve a zarr data type for object dtype arrays.
in zarr-python 2, we used an optional object_codec
keyword argument to array creation routines. If a user provided dtype=np.dtype('O')
or equivalent without a object_codec
, then zarr-python 2 would error.
I don't want to use this exact pattern today, because object_codec
is not really well-defined, and this extra parameter, used only for numpy object dtypes, would greatly complicate the dtype inference for all the other dtypes. Here is my alternative proposal: we refuse to do any dtype inference for numpy object dtypes. Instead, the user must provide an explicit zarr dtype that is compatible with the numpy object dtype.
e.g.:
create_array(...., dtype=np.dtype('O'))
would raise an informative exception, guiding the user to do this instead:
create_array(..., dtype=zarr.dtypes.VariableLengthString())
, or create_array(..., dtype='numpy.variable_length_string')
Thoughts on this pattern?