Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add classmethod to Hazard for reading raster-like data from NetCDF file #487

Closed
peanutfun opened this issue Jun 21, 2022 · 8 comments · Fixed by #507
Closed

Add classmethod to Hazard for reading raster-like data from NetCDF file #487

peanutfun opened this issue Jun 21, 2022 · 8 comments · Fixed by #507

Comments

@peanutfun
Copy link
Member

peanutfun commented Jun 21, 2022

The Hazard class offers several options to instantiate it from data files, e.g. from_raster, from_excel, etc. The classmethod from_raster, in particular, uses rasterio to open datasets and read their metadata, coordinates, and data. In this issue, I want to discuss if a general-purpose classmethod for reading data from a NetCDF file into a Hazard object might be useful, and how such a method could look like. A method implementing such a functionality to some extent can be found at climada_petals/blob/feature/wildfire/climada_petals/hazard/wildfire.py#L2247.

What the method should do

Use a single NetCDF file to load data for a consistent instance of Hazard, meaning that if data is missing, it will be set to a sensible default.

The minimal (i.e., essential) data supplied as variables in the file should be

  • hazard intensity data (2D or 3D dataset)
  • coordinates (1D dataset each)
  • time (1D dataset, if applicable)

Optional data could include:

  • hazard fraction data (same dimensions as intensity)
  • event frequency (1D)
  • event names (1D)
  • event IDs (1D)
  • coordinate system information (attributes/metadata)

Method signature

from_netcdf should take the following arguments:

  • data (path-like or xarray.Dataset, required): The dataset. Open the file if it is a path.
  • intensity_var (string, required): The name of the hazard intensity variable in the dataset
  • fraction_var (string, optional): The name of the hazard fraction variable in the dataset
  • coordinate_vars (dict, optional): A mapping from default coordinate names to the variables used as coords in the dataset, e.g. dict(longitude="lon", latitude="y")
  • tbd

Method outline

Suppose a netCDF file contains the following data:

  • intensity: 3D dataset (dims: "time", "longitude", "latitude")
  • 1D coordinate dataset for each dimension

Then the following code creates a consistent Hazard instance from this data:

import xarray as xr
from scipy.sparse import csr_matrix
from climada.hazard import Hazard
from climada.hazard.centroids.centr import Centroids

data = xr.open_dataset("...")
hazard = Hazard()

# Transpose the data so we flatten it with longitude running "fastest"
intensity = data["intensity"].transpose("time", "latitude", "longitude")
hazard.intensity = csr_matrix(intensity.values.reshape((data.sizes["time"], -1)))
hazard.intensity.eliminate_zeros()

# Build centroids
lat, lon = np.meshgrid(data["latitude"].values, data["longitude"].values, indexing="ij")
hazard.centroids = Centroids.from_lat_lon(lat.flatten(), lon.flatten())
hazard.centroids.set_lat_lon_to_meta()

# Consistent Hazard also needs
# hazard.fraction, hazard.event_id, hazard.event_name, hazard.frequency, hazard.date
# but these can be defaulted, e.g.
hazard.event_id = np.array(range(1, data.sizes["time"] + 1))
@chahank
Copy link
Member

chahank commented Jun 22, 2022

Great idea. Three points:

  • we need some flexibility in the input variable naming
  • we should keep the Hazard object in vector format
  • if would be nice to potentially extract some metadata for the 'tag'

@timschmi95
Copy link
Collaborator

Some points from my side:

  • Thomas Röösli already wrote the centroids_from_nc method, which gets the centroids from a netcdf, of which we could probably use some parts. In particular, the method should probably be able to read netcdf file with 1D arrays for lat/lon (then it creates a meshsgrid), but also with 2D arrays of lat/lon in which case no meshgrid needs to be created.
  • I used the xr method .stack to flatten the dataset. Not sure if it's better or worse than the .transpose option in your suggestion
  • @peanutfun I'll send you my netcdf file and reader method on slack to compare

@peanutfun
Copy link
Member Author

peanutfun commented Jul 7, 2022

How should we handle optional information (i.e. information for which we can supply a default) in general? If we don't want users to always specify which data exactly to read from a file, we would need to use some kind of lookup. Consider the following:

The data file contains a coordinate "frequency". By default, the user probably expects this to be loaded as hazard event frequency, even without stating a frequency="frequency" parameter specifically. If such a coordinate does not exist, and the user did not specify a frequency parameter, then the method should choose a default value. Also, the user must be able to specify that the frequency coordinate should not be read.

So I see these use cases:

  • User does not specify a parameter. The method looks for the default coordinate name, and if found, uses this data. If not, it uses a "sensible" default value.
  • User specifies a coordinate name. The method should only use this coordinate and throw an error if it cannot find it.
  • User speficies "" as coordinate name. The method should fall back to the default value, even though the default coordinate is available in the data.

Examples in code:

# Signature:
def from_raster_netcdf(self, file, frequency=None, **kwargs):
    pass

Hazard.from_raster_netcdf(file)  # Load 'frequency' data if available, use default otherwise
Hazard.from_raster_netcdf(file, frequency="freq")  # Load 'freq' data or throw error
Hazard.from_raster_netcdf(file, frequency="")  # Ignore 'frequency' data, use default

@peanutfun
Copy link
Member Author

@chahank @emanuel-schmid @timschmi95 Some input/opinions would be welcome here 🙏

How should we handle optional information (i.e. information for which we can supply a default) in general? If we don't want users to always specify which data exactly to read from a file, we would need to use some kind of lookup.

See #487 (comment)

@chahank
Copy link
Member

chahank commented Jul 15, 2022

User speficies "" as coordinate name. The method should fall back to the default value, even though the default coordinate is available in the data.

I like the general idea, but I am a bit confused by this use case. I would rather say that it does then not give any frequency value and the use should define it?

@peanutfun
Copy link
Member Author

Good point! I think the goal should be that the new method always returns a consistent Hazard object, meaning that it is ready to be used in computations. This is not the case if hazard.frequency is set to None. Therefore, I would go for the default even in this use case, because the user still has to do the same steps if they want to add custom frequency information. Consider:

# Load hazard, ignoring the 'frequency' data in the file
hazard = Hazard.from_raster_netcdf(file, frequency="")

# Default frequency is loaded, 'hazard' is consistent
np.testing.assert_array_equal(hazard.frequency, np.ones(hazard.event_id.size))

# Overwrite the frequency
# NOTE: Exactly the same if hazard.frequency is None or the default np.array
hazard.frequency = my_fancy_frequency

@chahank
Copy link
Member

chahank commented Jul 15, 2022

Good point, it should return a consistent object.

@peanutfun peanutfun linked a pull request Feb 8, 2023 that will close this issue
11 tasks
@peanutfun
Copy link
Member Author

Fixed by #507, #578, #589

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants