Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleaning up data api file cache #737

Merged
merged 5 commits into from
Jun 19, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 41 additions & 1 deletion climada/test/test_api_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@
Test save module.
"""
from pathlib import Path
import tempfile
import unittest
from shutil import rmtree

import numpy as np

Expand Down Expand Up @@ -233,6 +233,46 @@ def test_multiplicity_split(self):
self.assertEqual(straight, {'b': '1'})
self.assertEqual(multi, {'country_name': ['x', 'y', 'z']})

def test_purge_cache(self):
client = Client()

active_ds = client.get_dataset_info(data_type="litpop", name="LitPop_150arcsec_ABW", version="v2")
outdated_ds = client.get_dataset_info(data_type="litpop", name="LitPop_150arcsec_ABW", version="v1")
test_ds = client.get_dataset_info(data_type="storm_europe", name="test_storm_europe_icon_2021012800", version="v1", status="test_dataset")
expired_ds = client.get_dataset_info(data_type="tropical_cyclone", name="rename_files2", version="v1", status="expired")

with tempfile.TemporaryDirectory() as temp_dir:
for ds in [active_ds, outdated_ds, test_ds, expired_ds]:
client.download_dataset(dataset=ds, target_dir=Path(temp_dir))
self.assertEqual( # outdated dataset present
1,
len(list(Path(temp_dir).joinpath('exposures/litpop/LitPop_150arcsec_ABW/v1').iterdir()))
)
self.assertEqual( # expired data set present
1,
len(list(Path(temp_dir).joinpath('hazard/tropical_cyclone/rename_files2/v1').iterdir()))
)

client.purge_cache(target_dir=temp_dir)
self.assertFalse( # outdated data set removed
Path(temp_dir).joinpath('exposures/litpop/LitPop_150arcsec_ABW/v1').is_dir()
)
self.assertFalse( # expired data set removed
Path(temp_dir).joinpath('hazard/tropical_cyclone/rename_files2/v1').is_dir()
)
self.assertEqual( # test files are still there
3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there 3 test files?

Copy link
Collaborator Author

@emanuel-schmid emanuel-schmid Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's just the nature of that test dataset. datasets can have any number of files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I meant, why does this particular test result in 3 files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beats me. I picked it for being small (in file size) and expired. I suspect it's an experimental dataset that was used to explore the data api itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a better-known test file then? The test looks rather mysterious like this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Wrong. Sorry. The one with 3 files is used in TestStormEurope.test_icon_read. Reading icon files takes a directory as input and collects data from there. Having more than one makes complete sense. And the test is fairly known. Apart from that I think it doesn't really matter which dataset we pick as long as size is acceptable and status and version make a difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, if it is clear to you, I am fine with it.

len(list(Path(temp_dir).joinpath('hazard/storm_europe/test_storm_europe_icon_2021012800/v1').iterdir()))
)

client.purge_cache(target_dir=temp_dir, keep_testfiles=False)
self.assertTrue( # uptodate active dataset file still there
Path(temp_dir).joinpath('exposures/litpop/LitPop_150arcsec_ABW/v2/LitPop_150arcsec_ABW.hdf5').exists()
)
self.assertFalse( # test data removed, empty directories removed
Path(temp_dir).joinpath('hazard/').exists()
)


def rm_empty_dir(folder):
for subfolder in folder.iterdir():
Expand Down
61 changes: 57 additions & 4 deletions climada/util/api_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
import hashlib
import json
import logging
from os.path import commonprefix
from pathlib import Path
from urllib.parse import quote, unquote, urlsplit, urlunsplit
import time
Expand Down Expand Up @@ -595,13 +596,13 @@ def _download_file(self, local_path, fileinfo, check=checksize, retries=3):
downloaded = self._tracked_download(remote_url=fileinfo.url, local_path=local_path)
if not downloaded.enddownload:
raise Download.Failed("Download seems to be in progress, please try again later"
" or remove cache entry by calling"
f" `Client.purge_cache(Path('{local_path}'))`!")
" or remove cache entry from database by calling"
f" `Client.purge_cache_db(Path('{local_path}'))`!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit to difficult to understand. When should a user purge the cache? When should a user wait for the download? Any way to help the user better here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception is more verbose now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear now!

try:
check(local_path, fileinfo)
except Download.Failed as dlf:
local_path.unlink(missing_ok=True)
self.purge_cache(local_path)
self.purge_cache_db(local_path)
raise dlf
return local_path
except Download.Failed as dle:
Expand Down Expand Up @@ -663,7 +664,7 @@ def _organize_path(dataset, target_dir):
return target_dir

@staticmethod
def purge_cache(local_path):
def purge_cache_db(local_path):
"""Removes entry from the sqlite database that keeps track of files downloaded by
`cached_download`. This may be necessary in case a previous attempt has failed
in an uncontroled way (power outage or the like).
Expand Down Expand Up @@ -1009,3 +1010,55 @@ def into_files_df(dataset_infos):
"""
return Client.into_datasets_df(dataset_infos) \
.merge(pd.DataFrame([dsfile for ds in dataset_infos for dsfile in ds.files]))

def purge_cache(self, target_dir=SYSTEM_DIR, keep_testfiles=True):
"""Removes downloaded dataset files from the given directory if they have been downloaded
with the API client, if they are beneath the given directory and if one of the following
is the case:
- there status is neither 'active' nor 'test_dataset'
= their status is 'test_dataset' and keep_testfiles is set to False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
= their status is 'test_dataset' and keep_testfiles is set to False
- their status is 'test_dataset' and keep_testfiles is set to False

- their status is 'active' and they are outdated, i.e., there is a dataset with the same
data_type and name but a newer version.

Parameters
----------
target_dir : Path or str, optional
files downloaded beneath this directory and empty subdirectories will be removed.
default: SYSTEM_DIR
keep_testfiles : bool, optional
if set to True, files from datasets with status 'test_dataset' will not be removed.
default: True
"""

# collect urls from datasets that should not be removed
test_datasets = self.list_dataset_infos(status='test_dataset') if keep_testfiles else []
test_urls = set(filinf.url for dsinf in test_datasets for filinf in dsinf.files)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the short variables dsinf and filinf. Are these some standard abbreviations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they're called ds_info and file_info now


active_datasets = self.list_dataset_infos(status='active', version='newest')
active_urls = set(filinf.url for dsinf in active_datasets for filinf in dsinf.files)

not_to_be_removed = test_urls.union(active_urls)

# make a list of downloaded files that could be removed
to_be_removed = [d for d in Download.select() if d.url not in not_to_be_removed]

# helper function for filtering by target_dir
target_dir = Path(target_dir).absolute()

# remove files and sqlite db entries
for obsolete in to_be_removed:
opath = Path(obsolete.path)
if opath.exists() and Path(commonprefix([target_dir, opath])) == target_dir:
opath.unlink()
obsolete.delete_instance()

# clean up: remove all empty directories beneath target_dir
def rm_empty_dirs(directory: Path):
for subdir in directory.iterdir():
if subdir.is_dir():
rm_empty_dirs(subdir)
try:
directory.rmdir()
except OSError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error is ignored here? I do not understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an inline comment # raised when directory is not empty

pass
rm_empty_dirs(target_dir)