Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup DataConfig implementation #1187

Merged
merged 1 commit into from
Jun 6, 2024
Merged

Cleanup DataConfig implementation #1187

merged 1 commit into from
Jun 6, 2024

Conversation

shaahji
Copy link
Contributor

@shaahji shaahji commented Jun 5, 2024

Cleanup DataConfig implementation

  • Removed DataConfig::params_config
  • Removed DataConfig::components/component_args

All components specific parameters are now grouped in four separate objects:

  • load_dataset_config
  • pre_process_data_config
  • post_process_data_config
  • dataloader_config

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
  • Is this PR including examples changes? If yes, please remember to update example documentation in a follow-up PR.

(Optional) Issue link

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

olive/data/component/dataloader.py Fixed Show fixed Hide fixed
olive/workflows/run/config.py Fixed Show fixed Hide fixed
olive/data/config.py Fixed Show fixed Hide fixed
test/unit_test/test_data_root.py Fixed Show fixed Hide fixed
}
},
"pre_process_data_config": {
"params": {
Copy link
Collaborator

@guotuofeng guotuofeng Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need the params key name? it would be better if like

            "load_dataset_config": {
                    "data_name": "glue",
                    "subset": "mrpc",
                    "split": "validation"
            },

Copy link
Collaborator

@guotuofeng guotuofeng Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type info is needless since the load_dataset_config already contains the type info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need the params key name? it would be better if like

            "load_dataset_config": {
                    "data_name": "glue",
                    "subset": "mrpc",
                    "split": "validation"
            },

I agree that params seems redundant. I wasn't sure, what would the config class look like since type is required parameter and other are all optional. There is no strict list of what the params can be (think of custom data set/ data container implementation which can take any user arguments).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type info is needless since the load_dataset_config already contains the type info

Hmm, I am unsure what you mean config already contains the type info. Please elaborate.

Copy link
Contributor

@jambayk jambayk Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree params nesting is not needed. But like Hitesh said, type field is still needed if the user wants to override the default component type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the changes in this PR look good, I would like to merge this change before making the decorative params change. It's modifying too many files, and I don't want to sit on them for too long. Merging will be very hard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me! as you said, it's a cosmetic change that only looks different on the user side. In the backend, we need a way to hold the params anyways.

Copy link
Collaborator

@guotuofeng guotuofeng Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the type required? I don't see it in the example.
what I mean is not change the DataComponentConfig and just convert the plain json dict to DataComponentConfig in validator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example of type override image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without that params is still required to collect only the parameters that are passed down to the creator function. I noticed that DataSet/DataContainer constructors fail if they are passed unknown arguments (including in kwargs). So, it might not actually be possible to remove the params entirely. If we did, we will still be doing custom include/exclude list to collect only the arguments that are passed down to the constructor (but then what are the other parameters for? discard?).

@shaahji shaahji marked this pull request as ready for review June 5, 2024 17:16
docs/architecture.md Outdated Show resolved Hide resolved
@shaahji shaahji force-pushed the shaahji/dcparams branch 3 times, most recently from eac2ab0 to e0ff9e8 Compare June 5, 2024 21:12
* Removed DataConfig::params_config
* Removed DataConfig::components/component_args

All components specific parameters are now grouped in four separate objects:
 + load_dataset_config
 + pre_process_data_config
 + post_process_data_config
 + dataloader_config
@shaahji shaahji merged commit 1358acf into main Jun 6, 2024
35 checks passed
@shaahji shaahji deleted the shaahji/dcparams branch June 6, 2024 05:17
DavitGrigoryan132 pushed a commit to DavitGrigoryan132/Olive that referenced this pull request Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants