Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Compressing large folders,For example, compressing the entire D: drive #223

Closed
1 task done
yuzhengjun1204 opened this issue Jun 28, 2024 · 19 comments
Closed
1 task done
Assignees
Labels

Comments

@yuzhengjun1204
Copy link

Feature description

When compressing a very large folder, calling the compress function can take a long time and it will exit when an error is output. Is it possible to divide a large folder into n smaller files and process the compress function individually?

Additional context

No response

Code of Conduct

@rikyoz
Copy link
Owner

rikyoz commented Jun 28, 2024

Hi!

Is it possible to divide a large folder into n smaller files and process the compress function individually?

Bit7z cannot automatically separate the files into subsets for you.
If you manage to do it in your code, one possible way to achieve what you need is to create an initial archive from a known subset of the files (e.g., via BitArchiveWriter or BitFileCompressor), and then incrementally update the newly created archive with the remaining known subsets of files (e.g., via BitArchiveEditor, or using BitFileCompressor with setUpdateMode(UpdateMode::Append)).

@yuzhengjun1204
Copy link
Author

Thank you for your response, I will give it a try.

@yuzhengjun1204
Copy link
Author

Hi!

Is it possible to divide a large folder into n smaller files and process the compress function individually?

Bit7z cannot automatically separate the files into subsets for you. If you manage to do it in your code, one possible way to achieve what you need is to create an initial archive from a known subset of the files (e.g., via BitArchiveWriter or BitFileCompressor), and then incrementally update the newly created archive with the remaining known subsets of files (e.g., via BitArchiveEditor, or using BitFileCompressor with setUpdateMode(UpdateMode::Append)).

When I split a large folder into individual files for processing, I find the speed to be very slow. Is this normal, and what can I do to improve the compression speed? my code is like this`

Bit7zLibrary lib{"7z.dll"};
BitFileCompressor compressor{lib, BitFormat::Zip};
// .....
for(auto& file_info: file_list){
    compressor.setCompressionLevel(BitCompressionLevel::Normal);
    std::map<std::string, std::string> temp_file_map;
    temp_file_map[file_info.absolute_file_path_] = file_info.relative_file_path_;
    compressor.compress(temp_file_map, dst_file_path_.toStdString());
    compressor.setOverwriteMode(OverwriteMode::None);
    compressor.setUpdateMode(UpdateMode::Append);
}

@yuzhengjun1204
Copy link
Author

A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).

BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// .....
compressor.setOverwriteMode(OverwriteMode::Overwrite);
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path);

@rikyoz
Copy link
Owner

rikyoz commented Jul 1, 2024

A few notes first:

  • You can call setCompressionLevel, setOverwriteMode and setUpdateMode once outside the loop, as the values they set will be used for all compression operations.
    • Unless, of course, you need to change the settings for each compression operation, which doesn't seem to be the case from your code.
  • The default overwrite mode of BitFileCompressor is already None, so you don't need to set it.

As for the performance of the code:

  • I guess that file_info refers to single files, not directories. If this is the case, you can use compressor.compressFile(file_info.absolute_file_path_, dst_file_path_.toStdString(), file_info.relative_file_path_).
    • This has the advantage of not having to allocate a std::map on each iteration.
    • The allocation of the map is probably not the bottleneck here, though, so I don't expect any significant improvement.
  • Each compression operation that needs to append files to an existing archive is increasingly expensive in terms of performance:
    • Updating an archive basically works by creating a new temporary archive, copying the data from the original archive, appending the newly compressed data, and then overwriting the original archive with the temporary one.
    • This is unavoidable due to how 7-Zip's interface works (and 7-Zip GUI/CLI does the same, as far as I know).

So yes, the slowness is normal, especially for such a large archive and because you're doing a lot of single file append compressions.
The only way to improve performance is to reduce the number of compressions, which is exactly what you have done here:

A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).

BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// .....
compressor.setOverwriteMode(OverwriteMode::Overwrite);
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path);

In my first reply, I suggested the "update" approach because it was the only way to achieve what you originally described in this issue.
But I don't think it's the best approach, performance-wise, as you also noted.

Ideally, only one compression should be performed.
If this is not possible, the fewer compressions the better.
I would avoid doing single file updates, but rather compress "clusters" of files together.

In the issue, you mentioned that your program terminated because of an error during the compression process. This could be due to an underlying problem with the files being compressed, so I would check that before taking the "update" approach.

@yuzhengjun1204
Copy link
Author

A few notes first:

  • You can call setCompressionLevel, setOverwriteMode and setUpdateMode once outside the loop, as the values they set will be used for all compression operations.

    • Unless, of course, you need to change the settings for each compression operation, which doesn't seem to be the case from your code.
  • The default overwrite mode of BitFileCompressor is already None, so you don't need to set it.

As for the performance of the code:

  • I guess that file_info refers to single files, not directories. If this is the case, you can use compressor.compressFile(file_info.absolute_file_path_, dst_file_path_.toStdString(), file_info.relative_file_path_).

    • This has the advantage of not having to allocate a std::map on each iteration.
    • The allocation of the map is probably not the bottleneck here, though, so I don't expect any significant improvement.
  • Each compression operation that needs to append files to an existing archive is increasingly expensive in terms of performance:

    • Updating an archive basically works by creating a new temporary archive, copying the data from the original archive, appending the newly compressed data, and then overwriting the original archive with the temporary one.
    • This is unavoidable due to how 7-Zip's interface works (and 7-Zip GUI/CLI does the same, as far as I know).

So yes, the slowness is normal, especially for such a large archive and because you're doing a lot of single file append compressions. The only way to improve performance is to reduce the number of compressions, which is exactly what you have done here:

A compressed package of about 15GB takes about 5 minutes using this method, while it might take 50 minutes using the above method. file_list only include a directory(about 15GB).

BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// .....
compressor.setOverwriteMode(OverwriteMode::Overwrite);
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path);

In my first reply, I suggested the "update" approach because it was the only way to achieve what you originally described in this issue. But I don't think it's the best approach, performance-wise, as you also noted.

Ideally, only one compression should be performed. If this is not possible, the fewer compressions the better. I would avoid doing single file updates, but rather compress "clusters" of files together.

In the issue, you mentioned that your program terminated because of an error during the compression process. This could be due to an underlying problem with the files being compressed, so I would check that before taking the "update" approach.

When compressing, if there is an exception, it will throw an error message and stop compressing. Is there a way to throw an error message but not stop compressing? I hope the compression can continue and skip those erroneous files.

@yuzhengjun1204
Copy link
Author

When compressing a particularly large file, about 200GB, a stack overflow may occur, causing an abnormal exit. Are there any good ways to handle this?`

BitFileCompressor compressor{lib, BitFormat::Zip};
std::vector<std::string> file_list;
// ...
compressor.setCompressionLevel(BitCompressionLevel::Normal);
compressor.compress(file_list, dst_file_path_.toStdString());

`

@rikyoz
Copy link
Owner

rikyoz commented Jul 2, 2024

When compressing, if there is an exception, it will throw an error message and stop compressing. Is there a way to throw an error message but not stop compressing? I hope the compression can continue and skip those erroneous files.

Unfortunately, no. But I think it could be a useful feature, so I'll definitely add it to the library.

Here is what you can do for now:

  • Try the whole compression once, check which files are failing (BitException provides a failedFiles() method which can help) and retry the whole compression without those files; or
  • Use incremental compression of clusters of files, as explained above; or
  • If the files you need to compress are always the same, you could check the reason why they cannot be compressed and avoid compressing them preemptively.

When compressing a particularly large file, about 200GB, a stack overflow may occur, causing an abnormal exit. Are there any good ways to handle this?

Um, that's weird, I'll try to replicate the problem.
What OS, compiler and configuration flags are you using?
Is the 7-Zip GUI/CLI successfully compressing the same file or is it failing?

@yuzhengjun1204
Copy link
Author

7-zip GUI is ok. I can use try, catch(...), to catch the stack overflow exception and getlasterror=5, which is basically reproducible during my tests. I hope to avoid stack overflow exceptions.
7-zip: 24.05
OS: Win10, MSVC2019, x86.
flags: BIT7Z_AUTO_FORMAT、BIT7Z_REGEX_MATCHING、BIT7Z_AUTO_PREFIX_LONG_PATHS、BIT7Z_PATH_SANITIZATION

@rikyoz
Copy link
Owner

rikyoz commented Jul 4, 2024

Thanks for the further details.

I've already tried a similar configuration: MSVC 2022, x86, same flags, 7-Zip 23.01, single ~110GB file to be compressed as a zip file.
In my tests, I could not reproduce the issue, and the compression finished flawlessly.
I'll try to use a configuration closer to yours and see if I can reproduce the issue.

@yuzhengjun1204
Copy link
Author

yuzhengjun1204 commented Jul 5, 2024

Thanks for the further details.

I've already tried a similar configuration: MSVC 2022, x86, same flags, 7-Zip 23.01, single ~110GB file to be compressed as a zip file. In my tests, I could not reproduce the issue, and the compression finished flawlessly. I'll try to use a configuration closer to yours and see if I can reproduce the issue.

Test code.
After running compressor.compress, process crashed.

#include <iostream>
#include "include/bit7z/bitextractor.hpp"
#include "include/bit7z/bitcompressor.hpp"
#include "include/bit7z/bitfilecompressor.hpp"
#include "include/bit7z/bitexception.hpp"
#include "include/bit7z/bitfileextractor.hpp"
using namespace  bit7z;
void compressfile() {
	try { // bit7z classes can throw BitException objects
		using namespace bit7z;
		Bit7zLibrary lib{ "7z.dll" };
		BitFileCompressor compressor{ lib, BitFormat::Zip };
		compressor.setProgressCallback([&](uint64_t in_size) -> bool {
			std::cout << "setProgressCallback: " << in_size << std::endl;
			return true;
			});
		compressor.setFileCallback([&](std::string file_path) {
			std::cout << "filecallback" << file_path << std::endl;
			});
		std::vector<std::string> file_list;
		file_list.push_back("d:\\");
		compressor.compress(file_list, "d://d.zip");
	}
	catch (const bit7z::BitException& ex) {
		std::cout << "Hello World!" << ex.code() << ex.what();
	}
}
int main()
{
	std::cout << "Hello World!\n";
	compressfile();
}

image

image

@yuzhengjun1204
Copy link
Author

yuzhengjun1204 commented Jul 5, 2024

This code can catch error:3.
Debug output
bit7z_test.exe (Win32): Loaded "C:\Windows\System32\DriverStore\FileRepository\iigd_dch.inf_amd64_4be767c332df1d04\igdumdim32.dll". Exception at 0x773C590F (ntdll.dll) in bit7z_test.exe: 0xC00000FD: Stack overflow (parameters: 0x00000001, 0x01042FD0).

void compressfile() {
	try { // bit7z classes can throw BitException objects
		using namespace bit7z;
		Bit7zLibrary lib{ "7z.dll" };
		BitFileCompressor compressor{ lib, BitFormat::Zip };
		compressor.setProgressCallback([&](uint64_t in_size) -> bool {
			std::cout << "setProgressCallback: " << in_size << std::endl;
			return true;
			});
		compressor.setFileCallback([&](std::string file_path) {
			std::cout << "filecallback" << file_path << std::endl;
			});
		std::vector<std::string> file_list;
		file_list.push_back("d:\\");
		compressor.compress(file_list, "d://d.zip");
	}
	catch (const bit7z::BitException& ex) {
		std::cout << "error:1" << ex.code() << ex.what();
	}
	catch (const std::exception& e) {
		std::cout << "error:2" << e.what();
	}
	catch (...) {
		std::cout << "error:3:";
	}
}

@yuzhengjun1204
Copy link
Author

It seems that the issue is caused by the recursive call in this code.

void FilesystemIndexer::listDirectoryItems( vector< unique_ptr< GenericInputItem > >& result,
                                            bool recursive,
                                            const fs::path& prefix ) {
    fs::path path = mDirItem.filesystemPath();
    if ( !prefix.empty() ) {
        path = path / prefix;
    }
    const bool includeRootPath = mFilter.empty() ||
                                 !mDirItem.filesystemPath().has_parent_path() ||
                                 mDirItem.inArchivePath().filename() != mDirItem.filesystemName();
    const bool shouldIncludeMatchedItems = mPolicy == FilterPolicy::Include;
    std::error_code error;
    for ( const auto& currentEntry : fs::directory_iterator( path, error ) ) {
        auto searchPath = includeRootPath ? mDirItem.inArchivePath() : fs::path{};
        if ( !prefix.empty() ) {
            searchPath = searchPath.empty() ? prefix : searchPath / prefix;
        }

        const FilesystemItem currentItem{ currentEntry, searchPath, mSymlinkPolicy };
        /* An item matches if:
         *  - Its name matches the wildcard pattern, and
         *  - Either is a file, or we are interested also to include folders in the index.
         *
         * Note: The boolean expression uses short-circuiting to optimize the evaluation. */
        const bool itemMatches = ( !mOnlyFiles || !currentItem.isDir() ) &&
                                 fsutil::wildcard_match( mFilter, currentItem.name() );
        if ( itemMatches == shouldIncludeMatchedItems ) {
            result.emplace_back( std::make_unique< FilesystemItem >( currentItem ) );
        }

        if ( currentItem.isDir() && ( recursive || ( itemMatches == shouldIncludeMatchedItems ) ) ) {
            //currentItem is a directory, and we must list it only if:
            // > indexing is done recursively
            // > indexing is not recursive, but the directory name matched the filter.
            const fs::path nextDir = prefix.empty() ?
                                     currentItem.filesystemName() : prefix / currentItem.filesystemName();
            listDirectoryItems( result, true, nextDir );
        }
    }
}

@rikyoz
Copy link
Owner

rikyoz commented Jul 5, 2024

Ah, I see!
I was investigating a file size issue, but the problem is actually caused by trying to recursively index a directory structure that is too deeply nested.
I'll try rewriting the indexing code to not use recursion.
Thanks for the additional details, they really helped!

@rikyoz rikyoz added the 🐞 bug label Jul 5, 2024
rikyoz added a commit that referenced this issue Jul 7, 2024
Fixes stack overflow issues when indexing deeply nested directory structures (close issue #223)
rikyoz added a commit that referenced this issue Jul 10, 2024
Fixes stack overflow issues when indexing deeply nested directory structures (close issue #223)
@rikyoz
Copy link
Owner

rikyoz commented Jul 11, 2024

Hi!
Just wanted to give you a quick update on this issue.
I have managed to rewrite the indexing algorithm to not use recursion, and from my tests it seems to work well.
I've pushed the change to both the develop branch and the new hotfix/v4.0.8 branch, as I plan to include the fix in the next patch release.
Please let me know if the fix works for you.

@yuzhengjun1204
Copy link
Author

yuzhengjun1204 commented Jul 11, 2024

Hi! Just wanted to give you a quick update on this issue. I have managed to rewrite the indexing algorithm to not use recursion, and from my tests it seems to work well. I've pushed the change to both the develop branch and the new hotfix/v4.0.8 branch, as I plan to include the fix in the next patch release. Please let me know if the fix works for you.

I tried it.
When I use version v4.0.8 to compress the D drive, it will throw a std::exception, which indicates "recursive_directory_iterator::operator++: access denied." However, when I use the following code, it works fine.

void FilesystemIndexer::listDirectoryItems(vector<unique_ptr<GenericInputItem>>& result, bool recursive,
                                           const fs::path& prefix) {
    fs::path path = mDirItem.filesystemPath();
    if (!prefix.empty()) {
        path = path / prefix;
    }
    const bool includeRootPath = mFilter.empty() || !mDirItem.filesystemPath().has_parent_path() ||
                                 mDirItem.inArchivePath().filename() != mDirItem.filesystemName();
    const bool shouldIncludeMatchedItems = mPolicy == FilterPolicy::Include;

    std::queue<fs::path> dirs_to_process;
    dirs_to_process.push(path);

    while (!dirs_to_process.empty()) {
        path = dirs_to_process.front();
        dirs_to_process.pop();

        std::error_code error;
        for (const auto& currentEntry : fs::directory_iterator(path, error)) {
            auto searchPath = includeRootPath ? mDirItem.inArchivePath() : fs::path{};
            if (!prefix.empty()) {
                searchPath = searchPath.empty() ? prefix : searchPath / prefix;
            }

            const FilesystemItem currentItem{currentEntry, searchPath, mSymlinkPolicy};
            const bool itemMatches =
                (!mOnlyFiles || !currentItem.isDir()) && fsutil::wildcard_match(mFilter, currentItem.name());
            if (itemMatches == shouldIncludeMatchedItems) {
                result.emplace_back(std::make_unique<FilesystemItem>(currentItem));
            }

            if (currentItem.isDir() && (recursive || (itemMatches == shouldIncludeMatchedItems))) {
                dirs_to_process.push(currentEntry.path());
            }
        }
    }
}

@rikyoz
Copy link
Owner

rikyoz commented Jul 11, 2024

Interesting.
Yeah, I could use a queue as in your code, but I found a possible solution using just the std::filesystem: passing fs::directory_options::skip_permission_denied to the constructor of the fs::recursive_directory_iterator.
I've just pushed a commit (6b35220) with the fix to the hotfix/v4.0.8 branch.

@yuzhengjun1204
Copy link
Author

yes,it works.

@rikyoz
Copy link
Owner

rikyoz commented Jul 11, 2024

Perfect, thanks! 🙏

@rikyoz rikyoz closed this as completed Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants