-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cleaning up data api file cache #737
Changes from 2 commits
e35768a
e6fb801
3afcd26
ab768b5
ff06a16
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -23,6 +23,7 @@ | |||||
import hashlib | ||||||
import json | ||||||
import logging | ||||||
from os.path import commonprefix | ||||||
from pathlib import Path | ||||||
from urllib.parse import quote, unquote, urlsplit, urlunsplit | ||||||
import time | ||||||
|
@@ -595,13 +596,13 @@ def _download_file(self, local_path, fileinfo, check=checksize, retries=3): | |||||
downloaded = self._tracked_download(remote_url=fileinfo.url, local_path=local_path) | ||||||
if not downloaded.enddownload: | ||||||
raise Download.Failed("Download seems to be in progress, please try again later" | ||||||
" or remove cache entry by calling" | ||||||
f" `Client.purge_cache(Path('{local_path}'))`!") | ||||||
" or remove cache entry from database by calling" | ||||||
f" `Client.purge_cache_db(Path('{local_path}'))`!") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a bit to difficult to understand. When should a user purge the cache? When should a user wait for the download? Any way to help the user better here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The exception is more verbose now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Very clear now! |
||||||
try: | ||||||
check(local_path, fileinfo) | ||||||
except Download.Failed as dlf: | ||||||
local_path.unlink(missing_ok=True) | ||||||
self.purge_cache(local_path) | ||||||
self.purge_cache_db(local_path) | ||||||
raise dlf | ||||||
return local_path | ||||||
except Download.Failed as dle: | ||||||
|
@@ -663,7 +664,7 @@ def _organize_path(dataset, target_dir): | |||||
return target_dir | ||||||
|
||||||
@staticmethod | ||||||
def purge_cache(local_path): | ||||||
def purge_cache_db(local_path): | ||||||
"""Removes entry from the sqlite database that keeps track of files downloaded by | ||||||
`cached_download`. This may be necessary in case a previous attempt has failed | ||||||
in an uncontroled way (power outage or the like). | ||||||
|
@@ -1009,3 +1010,55 @@ def into_files_df(dataset_infos): | |||||
""" | ||||||
return Client.into_datasets_df(dataset_infos) \ | ||||||
.merge(pd.DataFrame([dsfile for ds in dataset_infos for dsfile in ds.files])) | ||||||
|
||||||
def purge_cache(self, target_dir=SYSTEM_DIR, keep_testfiles=True): | ||||||
"""Removes downloaded dataset files from the given directory if they have been downloaded | ||||||
with the API client, if they are beneath the given directory and if one of the following | ||||||
is the case: | ||||||
- there status is neither 'active' nor 'test_dataset' | ||||||
= their status is 'test_dataset' and keep_testfiles is set to False | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- their status is 'active' and they are outdated, i.e., there is a dataset with the same | ||||||
data_type and name but a newer version. | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
target_dir : Path or str, optional | ||||||
files downloaded beneath this directory and empty subdirectories will be removed. | ||||||
default: SYSTEM_DIR | ||||||
keep_testfiles : bool, optional | ||||||
if set to True, files from datasets with status 'test_dataset' will not be removed. | ||||||
default: True | ||||||
""" | ||||||
|
||||||
# collect urls from datasets that should not be removed | ||||||
test_datasets = self.list_dataset_infos(status='test_dataset') if keep_testfiles else [] | ||||||
test_urls = set(filinf.url for dsinf in test_datasets for filinf in dsinf.files) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not understand the short variables There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. they're called |
||||||
|
||||||
active_datasets = self.list_dataset_infos(status='active', version='newest') | ||||||
active_urls = set(filinf.url for dsinf in active_datasets for filinf in dsinf.files) | ||||||
|
||||||
not_to_be_removed = test_urls.union(active_urls) | ||||||
|
||||||
# make a list of downloaded files that could be removed | ||||||
to_be_removed = [d for d in Download.select() if d.url not in not_to_be_removed] | ||||||
|
||||||
# helper function for filtering by target_dir | ||||||
target_dir = Path(target_dir).absolute() | ||||||
|
||||||
# remove files and sqlite db entries | ||||||
for obsolete in to_be_removed: | ||||||
opath = Path(obsolete.path) | ||||||
if opath.exists() and Path(commonprefix([target_dir, opath])) == target_dir: | ||||||
opath.unlink() | ||||||
obsolete.delete_instance() | ||||||
|
||||||
# clean up: remove all empty directories beneath target_dir | ||||||
def rm_empty_dirs(directory: Path): | ||||||
for subdir in directory.iterdir(): | ||||||
if subdir.is_dir(): | ||||||
rm_empty_dirs(subdir) | ||||||
try: | ||||||
directory.rmdir() | ||||||
except OSError: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What error is ignored here? I do not understand. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added an inline comment |
||||||
pass | ||||||
rm_empty_dirs(target_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there 3 test files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's just the nature of that test dataset. datasets can have any number of files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. I meant, why does this particular test result in 3 files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beats me. I picked it for being small (in file size) and expired. I suspect it's an experimental dataset that was used to explore the data api itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a better-known test file then? The test looks rather mysterious like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Wrong. Sorry. The one with 3 files is used in
TestStormEurope.test_icon_read
. Reading icon files takes a directory as input and collects data from there. Having more than one makes complete sense. And the test is fairly known. Apart from that I think it doesn't really matter which dataset we pick as long as size is acceptable and status and version make a difference.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, if it is clear to you, I am fine with it.