Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterBasedNormalizer should only select the minimum number of required components #700

Open
fealho opened this issue Aug 29, 2023 · 0 comments

Comments

@fealho
Copy link
Member

fealho commented Aug 29, 2023

Problem Description

The ClusterBasedNormalizer usually uses the maximum number of clusters possible, when fewer clusters would be sufficient to properly represent the data. This affects the performance of CTGAN, so ideally it would select as few components as necessary.

Investigation

There are three values that can be tweaked to improve the component selection process:

  • weight_threshold: this attribute controls which components are selected in the line below. However, the threshold is usually to small to properly filter the components, so it should either be increased, removed, or detected automatically based on the data.
    self.valid_component_indicator = self._bgm_transformer.weights_ > self.weight_threshold
  • weight_concentration_prior: it's not obvious that this parameter helps achieve our goal at all. If that's the case, it should be removed.
  • max_clusters: the default value of 10 is quite frequently higher than what the dataset actually needs. If we cannot find a good value for weight_threshold perhaps we can detect the max_clusters automatically instead (in which case we can remove the entire logic for valid_component_indicator).

Additional Notes

Ensure CTGAN works well with these changes, as well as that it works for any type of dataset. If it is not possible to find a strict improvement over the current implementation, then perhaps it's best to leave the code as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant