Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ShuffleFilter fails to round trip #172

Open
nhz2 opened this issue Dec 21, 2024 · 3 comments
Open

ShuffleFilter fails to round trip #172

nhz2 opened this issue Dec 21, 2024 · 3 comments

Comments

@nhz2
Copy link
Member

nhz2 commented Dec 21, 2024

julia> using Zarr

julia> codec = Zarr.ShuffleFilter(elementsize=4)
Zarr.ShuffleFilter(0x0000000000000004)

julia> Zarr.zdecode(Zarr.zencode(UInt8[0x05], codec), codec)
1-element Vector{UInt8}:
 0xe0

From what I can tell the shuffle filter is missing the "Add leftover to the end of data" step from https://github.com/HDFGroup/hdf5/blob/f2642985d8c23ff7e876c6228c7cc0cf20515923/src/H5Zshuffle.c#L279-L284

@mkitti am I reading that HDF5 code correctly, and do you know if appending leftover data at the end after the shuffle is a standard thing to do? I can't find a place where this is documented.

@mkitti
Copy link
Member

mkitti commented Dec 21, 2024

Shuffling under Zarr should error if the input array byte count is not a multiple of the element size.

https://github.com/zarr-developers/numcodecs/blob/main/numcodecs%2Fshuffle.py

HDF5 filter implementations should not be assumed to be compatible with their Zarr counterparts.

Additionally, Zarr v2 codecs and Zarr v3 codecs may have subtly distinct behavior and defaults.

@nhz2
Copy link
Member Author

nhz2 commented Dec 22, 2024

Interesting, I think the shuffle filter was originally supposed to be compatible with HDF5. Ref: fsspec/kerchunk#11 But they took the implementation from https://github.com/HDFGroup/hsds/blob/03890edfa735cc77da3bc06f6cf5de5bd40d1e23/hsds/util/storUtil.py#L43

@nhz2
Copy link
Member Author

nhz2 commented Dec 30, 2024

I've tested in nhz2/ChunkCodecs.jl#6 that HDF5 copies the remaining data at the end if the data length is not evenly divisible by the element size.
For example "12312312312345" with element size 3 gets byte shuffled to"11112222333345".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants