-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading multi-member gzip files or providing access to remaining data #102
Comments
I'll preface by saying that this is a very niche use case, so even if it is easy to implement I'll have to weigh the bundle size costs to see if it's worth adding. That being said, this might be possible to add to the streaming API, i.e. |
Great, thanks for taking a look! If there's at least a way to get the amount of data consumed after first member (couldn't find that in the current state object), the rest could even be implemented on my end. Besides my use case, it looks like it has come up for other use cases, at least in pako (eg. bgzip, nodeca/pako#139) |
That's judicious. I'd like to elaborate on another use case, to outline this feature's potential value. In addition to the original description's use case for WARC, as mentioned above, this feature would help efficiently implement bgzip in JavaScript. That's an important algorithm for biology and medicine. Enabling bgzip via fflate would improve speed and maintainability for bioinformatics applications. For example, bgzip allows genome visualization packages like JBrowse and igv.js to quickly load segments of the human genome or other genomic data, helping scientists assess DNA samples relating to cancer and other genetic diseases. Currently for those use cases, genomics JS packages depend on Support for reading multi-member gzip files or providing access to remaining data would presumably enable packages like bgzf-filehandle or others to build atop fflate instead of pako. So, for a range of biological use cases, that'd save scientists time on every page load by having their browsers parse less JS, and speed up development by requiring bioinformatics engineers to only know fflate and not also pako for JavaScript that deals with compression and decompression. I hope that helps explain some additional value in making |
I had forgotten about this issue but what you've proposed does seem compelling. I'll look into the |
OK after looking at the requirements for this it's actually not too difficult. Support for GZIP extra fields will need to be added on the compression side if you want to create @ikreymer if you still need this for WARC, could you explain why exactly you need access to the byte offsets of each new member? At the moment I'm thinking of simply allowing you to push after the final block. Also @eweitz random access into the GZIP from a |
This is needed to be able to create an index of the records in the WARC file, which are kept in separate file/data structure. The index is created once by reading the entire WARC, but after that, the WARC is typically accessed via random access/seeking to a single member and inflating that (eg. by performing an HTTP range request for just the data for a single member). To be able to create an index in the browser, need to be able to get the offsets of each member. |
Bump for simple multi-member decompression support? |
Added support for this and releasing in v0.8.0. The implementation transparently decompresses concatenated GZIP archives (as the |
Great! Didn't see the |
v0.8.0 published with these changes. Let me know if you find any issues! |
What can't you do right now?
Gzip supports having 'multi-member' gzip files, where essentially gzip files are concatenated one after the other.
This is used in certain formats, such as WARC
An optimal solution
An optimal solution would be for fflate Inflate to provide a way parse mutli-member gzip files by providing an option,
and an additional callback when a new member is started (as well as the offset of the member into the stream).
Another option is to provide an offset into the buffer consumed by reading the gzip, allowing the developer to manually create a new Inflate object.
(How) is this done by other libraries?
pako provides a
avail_in
counter which keeps track of how many bytes have not yet been consumed.One approach I've used is something like this:
https://github.com/webrecorder/warcio.js/blob/main/src/readers.js#L282 (though this is with an earlier version of pako).
Pako in latest version may try to read the multi-member gzips as one buffer, though it seems it doesn't always work (in my tests)
A key my use case is to be able to get an offset to the beginning of each member, and flush the data buffer at the end of each member.
Ideally, there could be a callback that indicates when a new member has been started and the offset at that new member:
The ondata callbacks after onnewgzipmember are assumed to be from the gzip member, and ondata always flushes when the member boundary is reached.
The text was updated successfully, but these errors were encountered: