Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add handling of NaN values to Scaler TableTransformers #873

Open
Tarmandan opened this issue Jun 28, 2024 · 2 comments
Open

feat: Add handling of NaN values to Scaler TableTransformers #873

Tarmandan opened this issue Jun 28, 2024 · 2 comments
Assignees
Labels
enhancement 💡 New feature or request lab Suitable for the lab team1

Comments

@Tarmandan
Copy link
Contributor

Is your feature request related to a problem?

Currently RobustScaler, StandardScaler and possibly RangeScaler will not work on columns containing NaN-values.

Desired solution

The fit and transform methods of the scalers should ignore NaN as they ignore None.

Possible alternatives (optional)

Implementing and raising a ContainsNaNError for the end user might be preferable, as the choice of how to handle rows and columns containing NaN might be relevant to the data science problem.

Screenshots (optional)

No response

Additional Context (optional)

Currently there is no test coverage for NaN and None values in tables for the scalers.

@Tarmandan Tarmandan added the enhancement 💡 New feature or request label Jun 28, 2024
@lars-reimann lars-reimann added the lab Suitable for the lab label Jun 28, 2024
lars-reimann pushed a commit that referenced this issue Jul 1, 2024
Closes #650 

### Summary of Changes

Adds a RobustScaler class that works like the StandardScaler but uses
median instead of mean and interquartile range instead of standard
deviation. If the interquartile range is 0 it will only substract the
median from all rows.

For now cannot handle columns containing NaN-values. See Issue #873

---------

Co-authored-by: srose <118634249+wastedareas@users.noreply.github.com>
Co-authored-by: Simon <simon@schwubbel.dip0.t-ipconnect.de>
Co-authored-by: megalinter-bot <129584137+megalinter-bot@users.noreply.github.com>
@Tarmandan
Copy link
Contributor Author

Currently the route chosen is to replace NaN values with None. It is implemented and works on StandardScaler, but needs more test coverage.
Should a warning be added that NaNs have been replaced? It might be unexpected behaviour.
And we have not checked the implications for runtime. It uses the fill_nan from polars, this might not be good for performance.

@Tarmandan
Copy link
Contributor Author

We discovered that maybe the implementation is not working as intended. We forgot that the with_columns returns a new data_frame (while in the background it only modifies). But then we have some unexplainable error in the robust scaler, see TODO in the test_fit method.

lars-reimann pushed a commit that referenced this issue Jul 19, 2024
## [0.27.0](v0.26.0...v0.27.0) (2024-07-19)

### Features

*  join ([#870](#870)) ([5764441](5764441)), closes [#745](#745)
* activation function for forward layer ([#891](#891)) ([5b5bb3f](5b5bb3f)), closes [#889](#889)
* add `ImageDataset.split` ([#846](#846)) ([3878751](3878751)), closes [#831](#831)
* add FunctionalTableTransformer ([#901](#901)) ([37905be](37905be)), closes [#858](#858)
* add InvalidFitDataError ([#824](#824)) ([487854c](487854c)), closes [#655](#655)
* add KNearestNeighborsImputer ([#864](#864)) ([fcdfecf](fcdfecf)), closes [#743](#743)
* add moving average plot ([#836](#836)) ([abcf68a](abcf68a))
* add RobustScaler ([#874](#874)) ([62320a3](62320a3)), closes [#650](#650) [#873](#873)
* add SequentialTableTransformer ([#893](#893)) ([e93299f](e93299f)), closes [#802](#802)
* add temporal operations ([#832](#832)) ([06eab77](06eab77))
* added 'histogram_2d' in TablePlotter  ([#903](#903)) ([4e65ba9](4e65ba9)), closes [#869](#869) [#798](#798)
* added from_str_to_temporal and continues prediction ([#767](#767)) ([35f468a](35f468a)), closes [#806](#806) [#765](#765) [#740](#740) [#773](#773)
* added GRU layer ([#845](#845)) ([d33cb5d](d33cb5d))
* Adds Dropout Layer ([#868](#868)) ([a76f0a1](a76f0a1)), closes [#848](#848)
* dark mode for plots ([#911](#911)) ([5447551](5447551)), closes [#798](#798)
* easily create a baseline model ([#811](#811)) ([8e1b995](8e1b995)), closes [#710](#710)
* get first cell with value other than `None` ([#904](#904)) ([5a0cdb3](5a0cdb3)), closes [#799](#799)
* hyperparameter optimization for fnn models ([#897](#897)) ([c1f66e5](c1f66e5)), closes [#861](#861)
* implement violin plots ([#900](#900)) ([9f5992a](9f5992a)), closes [#867](#867)
* plot decision tree ([#876](#876)) ([d3f81dc](d3f81dc)), closes [#856](#856)
* prediction no longer takes a time series dataset only table ([#838](#838)) ([762e5c2](762e5c2)), closes [#837](#837)
* raise if `remove_colums` is called with unknown column by default ([#852](#852)) ([8f78163](8f78163)), closes [#807](#807)
* regularization strength for logistic classifier ([#866](#866)) ([9f74e92](9f74e92)), closes [#750](#750)
* reorders parameters of RangeScaler and makes them keyword-only ([#847](#847)) ([2b82db7](2b82db7)), closes [#809](#809)
* replace seaborn with matplotlib for box_plot ([#863](#863)) ([4ef078e](4ef078e)), closes [#805](#805) [#849](#849)
* replaced seaborn with matplotlib for correlation_heatmap ([#850](#850)) ([d4680d4](d4680d4)), closes [#800](#800) [#849](#849)

### Bug Fixes

* **deps:** bump urllib3 from 2.2.1 to 2.2.2 ([#842](#842)) ([b81bcd6](b81bcd6)), closes [#3122](https://github.com/Safe-DS/Library/issues/3122) [#3363](https://github.com/Safe-DS/Library/issues/3363) [#3122](https://github.com/Safe-DS/Library/issues/3122) [#3363](https://github.com/Safe-DS/Library/issues/3363) [#3406](https://github.com/Safe-DS/Library/issues/3406) [#3398](https://github.com/Safe-DS/Library/issues/3398) [#3399](https://github.com/Safe-DS/Library/issues/3399) [#3396](https://github.com/Safe-DS/Library/issues/3396) [#3394](https://github.com/Safe-DS/Library/issues/3394) [#3391](https://github.com/Safe-DS/Library/issues/3391) [#3316](https://github.com/Safe-DS/Library/issues/3316) [#3387](https://github.com/Safe-DS/Library/issues/3387) [#3386](https://github.com/Safe-DS/Library/issues/3386)
* labels of correlation heatmap ([#894](#894)) ([a88a609](a88a609)), closes [#871](#871)
* make multi-processing in baseline models more consistent ([#909](#909)) ([fa24560](fa24560)), closes [#907](#907)

### Performance Improvements

* improved performance in various methods in `Image` and `ImageList` ([#879](#879)) ([134e7d8](134e7d8))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 💡 New feature or request lab Suitable for the lab team1
Projects
Status: In Progress
Development

When branches are created from issues, their pull requests are automatically linked.

4 participants