Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want an option to ignore variables of unsupported data type when opening zarr files #2465

Open
amberjungminlee opened this issue Jul 27, 2022 · 6 comments

Comments

@amberjungminlee
Copy link

Our team is working with earth science zarr data that has some variable metadata stored as a string object type. These variables contain strings such as source URLs that correspond to each chunk. We are aware that in the documentation, string types are not supported as variables. This is fine. Because this field is just additional metadata for internal purposes, they are not necessary for our use case of netCDF4. We want it so that the netCDF4 code can either ignore the string variables, throw a warning that string variables are not supported, or simply have limited functionality for string variables. We just don't want the code to break when we open the zarr file.

@dopplershift
Copy link
Member

Out of curiosity, why store these URLs (which I assume have no dimensionality) as global attributes instead of variables?

@DennisHeimbigner
Copy link
Collaborator

Any chance you could send me one of those files in either zip format
or as a tar'd directory?
Also, I have been working at a low level on adding fixed size string support to nczarr,
Are you willing to act as a test case for it?

@amberjungminlee
Copy link
Author

These URLs do have dimensionality. They correspond to each time chunk and contain source information from where the individual file was downloaded.

And yes, @DennisHeimbigner , we would be open to being a test case for string support.

Here is the file that has the issue. It has a bogus URL for now, but it has the same dimensions as the time variable.

In case you are interested in replicating the issue, the error that I get when opening this file is "Assertion failed: (type && type->format_type_info != NULL), function zclose_type, file zclose.c, line 228."

generated.zip

@dopplershift
Copy link
Member

@amberjungminlee Ah, that makes sense then.

DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this issue Aug 1, 2022
re: Issue Unidata#2465
re: Issue Unidata#2259

[Note: It also tangentially affects PR Unidata#2466 since this PR requires that PR to be merged before this one and actually includes that PR here.]

The primary issue to be addressed is to provide a way for user to
specify the size of the fixed length strings. This is handled by providing
the following new attributes special:
1. **_nczarr_default_maxstrlen** —
This is an attribute of the root group. It specifies the default
maximum string length for string types. If not specified, then
it has the value of 64 characters.
2. **_nczarr_maxstrlen** &mdash
This is a per-variable attribute. It specifies the maximum
string length for the string type associated with the variable.
If not specified, then it is assigned the value of
**_nczarr_default_maxstrlen**.

This PR also requires some hacking to handle the existing netcdf-c NC_CHAR
type, which does not exist in zarr. The goal was to choose numpy types for
both the netcdf-c NC_STRING type and the netcdf-c NC_CHAR type such that
if a pure zarr implementation read them, it would still work and an
NC_CHAR type would be handled by zarr as a string of length 1.

For writing variables and NCZarr attributes, the type mapping is as follows:
* "|S1" for NC_CHAR.
* ">S1" for NC_STRING && MAXSTRLEN==1
* ">Sn" for NC_STRING && MAXSTRLEN==n

Note that it is a bit of a hack to use endianness, but it should be ok since for
string/char, the endianness has no meaning.

For reading attributes with pure zarr (i.e. with no nczarr
atribute types defined), they will always be interpreted as of
type NC_CHAR.

## Misc. Other Changes
1. Convert the nczarr special attributes and keys to be all lower case. So "_NCZARR_ATTR" now used "_nczarr_attr. Support back compatibility for the upper case names.
2. Cleanup my too-clever-by-half handling of scalars in libnczarr.
@DennisHeimbigner
Copy link
Collaborator

This PR (#2467) is an experimental
draft PR that attempts to add Zarr/Numpy fixed size string support to NCZarr.

@DennisHeimbigner
Copy link
Collaborator

Fixed by #2492

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants