[Feature Request]: Compressing large folders,For example, compressing the entire D: drive #223

yuzhengjun1204 · 2024-06-28T02:36:25Z

Feature description

When compressing a very large folder, calling the compress function can take a long time and it will exit when an error is output. Is it possible to divide a large folder into n smaller files and process the compress function individually?

Additional context

No response

Code of Conduct

By submitting this issue, I agree to follow bit7z's Code of Conduct

rikyoz · 2024-06-28T20:01:02Z

Hi!

Is it possible to divide a large folder into n smaller files and process the compress function individually?

Bit7z cannot automatically separate the files into subsets for you.
If you manage to do it in your code, one possible way to achieve what you need is to create an initial archive from a known subset of the files (e.g., via BitArchiveWriter or BitFileCompressor), and then incrementally update the newly created archive with the remaining known subsets of files (e.g., via BitArchiveEditor, or using BitFileCompressor with setUpdateMode(UpdateMode::Append)).

yuzhengjun1204 · 2024-06-30T07:20:00Z

Thank you for your response, I will give it a try.

yuzhengjun1204 · 2024-07-01T15:35:41Z

Hi!

Is it possible to divide a large folder into n smaller files and process the compress function individually?

Bit7z cannot automatically separate the files into subsets for you. If you manage to do it in your code, one possible way to achieve what you need is to create an initial archive from a known subset of the files (e.g., via BitArchiveWriter or BitFileCompressor), and then incrementally update the newly created archive with the remaining known subsets of files (e.g., via BitArchiveEditor, or using BitFileCompressor with setUpdateMode(UpdateMode::Append)).

When I split a large folder into individual files for processing, I find the speed to be very slow. Is this normal, and what can I do to improve the compression speed? my code is like this`

Bit7zLibrary lib{"7z.dll"};
BitFileCompressor compressor{lib, BitFormat::Zip};
// .....
for(auto& file_info: file_list){
    compressor.setCompressionLevel(BitCompressionLevel::Normal);
    std::map<std::string, std::string> temp_file_map;
    temp_file_map[file_info.absolute_file_path_] = file_info.relative_file_path_;
    compressor.compress(temp_file_map, dst_file_path_.toStdString());
    compressor.setOverwriteMode(OverwriteMode::None);
    compressor.setUpdateMode(UpdateMode::Append);
}

yuzhengjun1204 · 2024-07-01T16:47:30Z

A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).

BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// .....
compressor.setOverwriteMode(OverwriteMode::Overwrite);
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path);

rikyoz · 2024-07-01T18:00:08Z

A few notes first:

You can call setCompressionLevel, setOverwriteMode and setUpdateMode once outside the loop, as the values they set will be used for all compression operations.
- Unless, of course, you need to change the settings for each compression operation, which doesn't seem to be the case from your code.
The default overwrite mode of BitFileCompressor is already None, so you don't need to set it.

As for the performance of the code:

I guess that file_info refers to single files, not directories. If this is the case, you can use compressor.compressFile(file_info.absolute_file_path_, dst_file_path_.toStdString(), file_info.relative_file_path_).
- This has the advantage of not having to allocate a std::map on each iteration.
- The allocation of the map is probably not the bottleneck here, though, so I don't expect any significant improvement.
Each compression operation that needs to append files to an existing archive is increasingly expensive in terms of performance:
- Updating an archive basically works by creating a new temporary archive, copying the data from the original archive, appending the newly compressed data, and then overwriting the original archive with the temporary one.
- This is unavoidable due to how 7-Zip's interface works (and 7-Zip GUI/CLI does the same, as far as I know).

So yes, the slowness is normal, especially for such a large archive and because you're doing a lot of single file append compressions.
The only way to improve performance is to reduce the number of compressions, which is exactly what you have done here:

A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).
BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// .....
compressor.setOverwriteMode(OverwriteMode::Overwrite);
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path);

In my first reply, I suggested the "update" approach because it was the only way to achieve what you originally described in this issue.
But I don't think it's the best approach, performance-wise, as you also noted.

Ideally, only one compression should be performed.
If this is not possible, the fewer compressions the better.
I would avoid doing single file updates, but rather compress "clusters" of files together.

In the issue, you mentioned that your program terminated because of an error during the compression process. This could be due to an underlying problem with the files being compressed, so I would check that before taking the "update" approach.

yuzhengjun1204 · 2024-07-02T03:40:09Z

A few notes first:

You can call setCompressionLevel, setOverwriteMode and setUpdateMode once outside the loop, as the values they set will be used for all compression operations.

Unless, of course, you need to change the settings for each compression operation, which doesn't seem to be the case from your code.

The default overwrite mode of BitFileCompressor is already None, so you don't need to set it.

As for the performance of the code:

I guess that file_info refers to single files, not directories. If this is the case, you can use compressor.compressFile(file_info.absolute_file_path_, dst_file_path_.toStdString(), file_info.relative_file_path_).

This has the advantage of not having to allocate a std::map on each iteration.

The allocation of the map is probably not the bottleneck here, though, so I don't expect any significant improvement.

Each compression operation that needs to append files to an existing archive is increasingly expensive in terms of performance:

Updating an archive basically works by creating a new temporary archive, copying the data from the original archive, appending the newly compressed data, and then overwriting the original archive with the temporary one.

This is unavoidable due to how 7-Zip's interface works (and 7-Zip GUI/CLI does the same, as far as I know).

So yes, the slowness is normal, especially for such a large archive and because you're doing a lot of single file append compressions. The only way to improve performance is to reduce the number of compressions, which is exactly what you have done here:
A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).
BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// .....
compressor.setOverwriteMode(OverwriteMode::Overwrite);
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path);
In my first reply, I suggested the "update" approach because it was the only way to achieve what you originally described in this issue. But I don't think it's the best approach, performance-wise, as you also noted.

Ideally, only one compression should be performed. If this is not possible, the fewer compressions the better. I would avoid doing single file updates, but rather compress "clusters" of files together.

In the issue, you mentioned that your program terminated because of an error during the compression process. This could be due to an underlying problem with the files being compressed, so I would check that before taking the "update" approach.

When compressing, if there is an exception, it will throw an error message and stop compressing. Is there a way to throw an error message but not stop compressing? I hope the compression can continue and skip those erroneous files.

yuzhengjun1204 · 2024-07-02T14:43:33Z

When compressing a particularly large file, about 200GB, a stack overflow may occur, causing an abnormal exit. Are there any good ways to handle this?`

BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// ...
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path_.toStdString());

`

rikyoz · 2024-07-02T19:49:35Z

When compressing, if there is an exception, it will throw an error message and stop compressing. Is there a way to throw an error message but not stop compressing? I hope the compression can continue and skip those erroneous files.

Unfortunately, no. But I think it could be a useful feature, so I'll definitely add it to the library.

Here is what you can do for now:

Try the whole compression once, check which files are failing (BitException provides a failedFiles() method which can help) and retry the whole compression without those files; or
Use incremental compression of clusters of files, as explained above; or
If the files you need to compress are always the same, you could check the reason why they cannot be compressed and avoid compressing them preemptively.

When compressing a particularly large file, about 200GB, a stack overflow may occur, causing an abnormal exit. Are there any good ways to handle this?

Um, that's weird, I'll try to replicate the problem.
What OS, compiler and configuration flags are you using?
Is the 7-Zip GUI/CLI successfully compressing the same file or is it failing?

yuzhengjun1204 · 2024-07-04T08:17:15Z

7-zip GUI is ok. I can use try, catch(...), to catch the stack overflow exception and getlasterror=5, which is basically reproducible during my tests. I hope to avoid stack overflow exceptions.
7-zip: 24.05
OS: Win10, MSVC2019, x86.
flags: BIT7Z_AUTO_FORMAT、BIT7Z_REGEX_MATCHING、BIT7Z_AUTO_PREFIX_LONG_PATHS、BIT7Z_PATH_SANITIZATION

rikyoz · 2024-07-04T19:58:17Z

Thanks for the further details.

I've already tried a similar configuration: MSVC 2022, x86, same flags, 7-Zip 23.01, single ~110GB file to be compressed as a zip file.
In my tests, I could not reproduce the issue, and the compression finished flawlessly.
I'll try to use a configuration closer to yours and see if I can reproduce the issue.

yuzhengjun1204 · 2024-07-05T03:03:35Z

Thanks for the further details.

I've already tried a similar configuration: MSVC 2022, x86, same flags, 7-Zip 23.01, single ~110GB file to be compressed as a zip file. In my tests, I could not reproduce the issue, and the compression finished flawlessly. I'll try to use a configuration closer to yours and see if I can reproduce the issue.

Test code.
After running compressor.compress, process crashed.

#include <iostream>
#include "include/bit7z/bitextractor.hpp"
#include "include/bit7z/bitcompressor.hpp"
#include "include/bit7z/bitfilecompressor.hpp"
#include "include/bit7z/bitexception.hpp"
#include "include/bit7z/bitfileextractor.hpp"
using namespace  bit7z;
void compressfile() {
	try { // bit7z classes can throw BitException objects
		using namespace bit7z;
		Bit7zLibrary lib{ "7z.dll" };
		BitFileCompressor compressor{ lib, BitFormat::Zip };
		compressor.setProgressCallback([&](uint64_t in_size) -> bool {
			std::cout << "setProgressCallback: " << in_size << std::endl;
			return true;
			});
		compressor.setFileCallback([&](std::string file_path) {
			std::cout << "filecallback" << file_path << std::endl;
			});
		std::vector<std::string> file_list;
		file_list.push_back("d:\\");
		compressor.compress(file_list, "d://d.zip");
	}
	catch (const bit7z::BitException& ex) {
		std::cout << "Hello World!" << ex.code() << ex.what();
	}
}
int main()
{
	std::cout << "Hello World!\n";
	compressfile();
}

yuzhengjun1204 · 2024-07-05T03:07:05Z

This code can catch error:3.
Debug output
bit7z_test.exe (Win32): Loaded "C:\Windows\System32\DriverStore\FileRepository\iigd_dch.inf_amd64_4be767c332df1d04\igdumdim32.dll". Exception at 0x773C590F (ntdll.dll) in bit7z_test.exe: 0xC00000FD: Stack overflow (parameters: 0x00000001, 0x01042FD0).

void compressfile() {
	try { // bit7z classes can throw BitException objects
		using namespace bit7z;
		Bit7zLibrary lib{ "7z.dll" };
		BitFileCompressor compressor{ lib, BitFormat::Zip };
		compressor.setProgressCallback([&](uint64_t in_size) -> bool {
			std::cout << "setProgressCallback: " << in_size << std::endl;
			return true;
			});
		compressor.setFileCallback([&](std::string file_path) {
			std::cout << "filecallback" << file_path << std::endl;
			});
		std::vector<std::string> file_list;
		file_list.push_back("d:\\");
		compressor.compress(file_list, "d://d.zip");
	}
	catch (const bit7z::BitException& ex) {
		std::cout << "error:1" << ex.code() << ex.what();
	}
	catch (const std::exception& e) {
		std::cout << "error:2" << e.what();
	}
	catch (...) {
		std::cout << "error:3:";
	}
}

yuzhengjun1204 · 2024-07-05T03:50:15Z

It seems that the issue is caused by the recursive call in this code.

void FilesystemIndexer::listDirectoryItems( vector< unique_ptr< GenericInputItem > >& result,
                                            bool recursive,
                                            const fs::path& prefix ) {
    fs::path path = mDirItem.filesystemPath();
    if ( !prefix.empty() ) {
        path = path / prefix;
    }
    const bool includeRootPath = mFilter.empty() ||
                                 !mDirItem.filesystemPath().has_parent_path() ||
                                 mDirItem.inArchivePath().filename() != mDirItem.filesystemName();
    const bool shouldIncludeMatchedItems = mPolicy == FilterPolicy::Include;
    std::error_code error;
    for ( const auto& currentEntry : fs::directory_iterator( path, error ) ) {
        auto searchPath = includeRootPath ? mDirItem.inArchivePath() : fs::path{};
        if ( !prefix.empty() ) {
            searchPath = searchPath.empty() ? prefix : searchPath / prefix;
        }

        const FilesystemItem currentItem{ currentEntry, searchPath, mSymlinkPolicy };
        /* An item matches if:
         *  - Its name matches the wildcard pattern, and
         *  - Either is a file, or we are interested also to include folders in the index.
         *
         * Note: The boolean expression uses short-circuiting to optimize the evaluation. */
        const bool itemMatches = ( !mOnlyFiles || !currentItem.isDir() ) &&
                                 fsutil::wildcard_match( mFilter, currentItem.name() );
        if ( itemMatches == shouldIncludeMatchedItems ) {
            result.emplace_back( std::make_unique< FilesystemItem >( currentItem ) );
        }

        if ( currentItem.isDir() && ( recursive || ( itemMatches == shouldIncludeMatchedItems ) ) ) {
            //currentItem is a directory, and we must list it only if:
            // > indexing is done recursively
            // > indexing is not recursive, but the directory name matched the filter.
            const fs::path nextDir = prefix.empty() ?
                                     currentItem.filesystemName() : prefix / currentItem.filesystemName();
            listDirectoryItems( result, true, nextDir );
        }
    }
}

rikyoz · 2024-07-05T06:37:11Z

Ah, I see!
I was investigating a file size issue, but the problem is actually caused by trying to recursively index a directory structure that is too deeply nested.
I'll try rewriting the indexing code to not use recursion.
Thanks for the additional details, they really helped!

Fixes stack overflow issues when indexing deeply nested directory structures (close issue #223)

rikyoz · 2024-07-11T08:14:37Z

Hi!
Just wanted to give you a quick update on this issue.
I have managed to rewrite the indexing algorithm to not use recursion, and from my tests it seems to work well.
I've pushed the change to both the develop branch and the new hotfix/v4.0.8 branch, as I plan to include the fix in the next patch release.
Please let me know if the fix works for you.

yuzhengjun1204 · 2024-07-11T09:08:49Z

Hi! Just wanted to give you a quick update on this issue. I have managed to rewrite the indexing algorithm to not use recursion, and from my tests it seems to work well. I've pushed the change to both the develop branch and the new hotfix/v4.0.8 branch, as I plan to include the fix in the next patch release. Please let me know if the fix works for you.

I tried it.
When I use version v4.0.8 to compress the D drive, it will throw a std::exception, which indicates "recursive_directory_iterator::operator++: access denied." However, when I use the following code, it works fine.

void FilesystemIndexer::listDirectoryItems(vector<unique_ptr<GenericInputItem>>& result, bool recursive,
                                           const fs::path& prefix) {
    fs::path path = mDirItem.filesystemPath();
    if (!prefix.empty()) {
        path = path / prefix;
    }
    const bool includeRootPath = mFilter.empty() || !mDirItem.filesystemPath().has_parent_path() ||
                                 mDirItem.inArchivePath().filename() != mDirItem.filesystemName();
    const bool shouldIncludeMatchedItems = mPolicy == FilterPolicy::Include;

    std::queue<fs::path> dirs_to_process;
    dirs_to_process.push(path);

    while (!dirs_to_process.empty()) {
        path = dirs_to_process.front();
        dirs_to_process.pop();

        std::error_code error;
        for (const auto& currentEntry : fs::directory_iterator(path, error)) {
            auto searchPath = includeRootPath ? mDirItem.inArchivePath() : fs::path{};
            if (!prefix.empty()) {
                searchPath = searchPath.empty() ? prefix : searchPath / prefix;
            }

            const FilesystemItem currentItem{currentEntry, searchPath, mSymlinkPolicy};
            const bool itemMatches =
                (!mOnlyFiles || !currentItem.isDir()) && fsutil::wildcard_match(mFilter, currentItem.name());
            if (itemMatches == shouldIncludeMatchedItems) {
                result.emplace_back(std::make_unique<FilesystemItem>(currentItem));
            }

            if (currentItem.isDir() && (recursive || (itemMatches == shouldIncludeMatchedItems))) {
                dirs_to_process.push(currentEntry.path());
            }
        }
    }
}

rikyoz · 2024-07-11T11:04:40Z

Interesting.
Yeah, I could use a queue as in your code, but I found a possible solution using just the std::filesystem: passing fs::directory_options::skip_permission_denied to the constructor of the fs::recursive_directory_iterator.
I've just pushed a commit (6b35220) with the fix to the hotfix/v4.0.8 branch.

yuzhengjun1204 · 2024-07-11T11:32:41Z

yes,it works.

rikyoz · 2024-07-11T11:34:00Z

Perfect, thanks! 🙏

yuzhengjun1204 assigned rikyoz Jun 28, 2024

rikyoz added the 🐞 bug label Jul 5, 2024

rikyoz added a commit that referenced this issue Jul 7, 2024

Rewrite directory indexing algorithm to not use recursion

e8bd812

Fixes stack overflow issues when indexing deeply nested directory structures (close issue #223)

rikyoz added a commit that referenced this issue Jul 10, 2024

Rewrite directory indexing algorithm to not use recursion

6f1248c

Fixes stack overflow issues when indexing deeply nested directory structures (close issue #223)

rikyoz closed this as completed Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Compressing large folders,For example, compressing the entire D: drive #223

[Feature Request]: Compressing large folders,For example, compressing the entire D: drive #223

yuzhengjun1204 commented Jun 28, 2024

rikyoz commented Jun 28, 2024

yuzhengjun1204 commented Jun 30, 2024

yuzhengjun1204 commented Jul 1, 2024

yuzhengjun1204 commented Jul 1, 2024

rikyoz commented Jul 1, 2024 •

edited

Loading

yuzhengjun1204 commented Jul 2, 2024

yuzhengjun1204 commented Jul 2, 2024

rikyoz commented Jul 2, 2024

yuzhengjun1204 commented Jul 4, 2024

rikyoz commented Jul 4, 2024 •

edited

Loading

yuzhengjun1204 commented Jul 5, 2024 •

edited

Loading

yuzhengjun1204 commented Jul 5, 2024 •

edited

Loading

yuzhengjun1204 commented Jul 5, 2024

rikyoz commented Jul 5, 2024

rikyoz commented Jul 11, 2024

yuzhengjun1204 commented Jul 11, 2024 •

edited

Loading

rikyoz commented Jul 11, 2024 •

edited

Loading

yuzhengjun1204 commented Jul 11, 2024

rikyoz commented Jul 11, 2024

[Feature Request]: Compressing large folders,For example, compressing the entire D: drive #223

[Feature Request]: Compressing large folders,For example, compressing the entire D: drive #223

Comments

yuzhengjun1204 commented Jun 28, 2024

Feature description

Additional context

Code of Conduct

rikyoz commented Jun 28, 2024

yuzhengjun1204 commented Jun 30, 2024

yuzhengjun1204 commented Jul 1, 2024

yuzhengjun1204 commented Jul 1, 2024

rikyoz commented Jul 1, 2024 • edited Loading

yuzhengjun1204 commented Jul 2, 2024

yuzhengjun1204 commented Jul 2, 2024

rikyoz commented Jul 2, 2024

yuzhengjun1204 commented Jul 4, 2024

rikyoz commented Jul 4, 2024 • edited Loading

yuzhengjun1204 commented Jul 5, 2024 • edited Loading

yuzhengjun1204 commented Jul 5, 2024 • edited Loading

yuzhengjun1204 commented Jul 5, 2024

rikyoz commented Jul 5, 2024

rikyoz commented Jul 11, 2024

yuzhengjun1204 commented Jul 11, 2024 • edited Loading

rikyoz commented Jul 11, 2024 • edited Loading

yuzhengjun1204 commented Jul 11, 2024

rikyoz commented Jul 11, 2024

rikyoz commented Jul 1, 2024 •

edited

Loading

rikyoz commented Jul 4, 2024 •

edited

Loading

yuzhengjun1204 commented Jul 5, 2024 •

edited

Loading

yuzhengjun1204 commented Jul 5, 2024 •

edited

Loading

yuzhengjun1204 commented Jul 11, 2024 •

edited

Loading

rikyoz commented Jul 11, 2024 •

edited

Loading