-
-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parse function to DataFrameModel #1181
Conversation
Reopen due to description being updated. |
fdc3c08
to
ffa255f
Compare
This is awesome @ShishinMo! I think this is a good start 🔥 We should also make sure this works with the object-based Essentially a Steps Needed
Code Exampleobject-based API pa.DataFrameSchema(
# parsers at the schema level have access to the entire dataframe
parsers=[pa.Parser(lambda df: df.transform("sqrt"))],
columns={
# parsers at the column level
"col1": pa.Column(parsers=[pa.Parser(lambda series: seriers.transform("sqrt"))])
}
) And the equivalent class-based API # parsers at the schema level have access to the entire dataframe
class Model(pa.DataFrameModel):
col1: pa.typing.Series[float]
@pa.dataframe_parse
def dataframe_sqrt(cls, df):
return df.transform("sqrt")
# parsers at the column level
@pa.parse("col1")
def sqrt(cls, series):
return series.transform("sqrt") Open Questions
Whew! Sorry if this is a lot to dump on this one PR, but your initiative kicking this off got me going! This is a lot of additional changes, and I can definitely help implement many of these. As a good starting point, I'd recommend adding support just for column-level/series-level parser support and getting everything working there. I can help get this over the finish-line, but if you can get start would love to get this feature into the core codebase! |
(Sorry for replying from different account since I merge my organization account and my individual account.) For your questions,
|
Sounds good! That should also collect |
1d32676
to
b7159f4
Compare
@cosmicBboy Sorry for very late commit, but I have
Now we can parse a dataframe in two ways as following. import pandera as pa
import pandas as pd
df = pd.DataFrame({"col1": [144.0, 256.0, 1024.0], "col2": [144.0, 256.0, 1024.0]})
schema = pa.DataFrameSchema(
parsers=[pa.Parser(lambda df: df.transform("sqrt"))],
columns={
"col1": pa.Column(parsers=[pa.Parser(lambda series: series.transform("sqrt"))])
},
)
schema.validate(df)
### col1 col2
### 0 3.464102 12.0
### 1 4.000000 16.0
### 2 5.656854 32.0 or, import pandera as pa
import pandas as pd
df = pd.DataFrame({"col1": [144.0, 256.0, 1024.0], "col2": [144.0, 256.0, 1024.0]})
class Model(pa.DataFrameModel):
col1: pa.typing.Series[float]
@pa.dataframe_parse
def df_sqrt(cls, df):
return df.transform("sqrt")
@pa.parse("col1")
def sqrt(cls, series):
return series.transform("sqrt")
print(Model.validate(df))
### col1 col2
### 0 3.464102 12.0
### 1 4.000000 16.0
### 2 5.656854 32.0 |
thanks @utopianf there are linter and unit test errors (see failing checks). check out the contributing guide to learn how to run linters and unit tests locally to make sure everything's passing. I can help out some time in the next two weeks if you're still have trouble |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1181 +/- ##
===========================================
- Coverage 94.29% 83.60% -10.69%
===========================================
Files 91 114 +23
Lines 7024 8446 +1422
===========================================
+ Hits 6623 7061 +438
- Misses 401 1385 +984 ☔ View full report in Codecov by Sentry. |
@utopianf any progress on this PR? |
@cosmicBboy I'm sorry. I was busy for a while and then I neglected it. I will make corrections related to the Linter and push them this weekend. |
@cosmicBboy Sorry for keeping you waiting. The code from upstream has changed significantly compared to before, so I rewrite my code to fit the current code. I've pushed the version that passed all tests locally for now. |
79a7071
to
e21748c
Compare
Hi @cosmicBboy - added documentation (sorry my english should be pool), feel free to request any additional work if needed |
thanks @utopianf ! looks like a bunch of unit tests are failing ^^ |
@utopianf can you give me push permission to your fork of pandera? I recently made some changes to the docs using myst instead of rst) and need to make some changes |
@cosmicBboy of course! I have invited you to be a collaborator, but I am not sure if it is enough. Feel free to tell me if it does not work |
ccb6da5
to
29c7e13
Compare
Done! Hopefully all tests and docs build should pass now. Thanks again for all the work on this PR, this is gonna be awesome! In terms of the docs, I think it's almost there, I think a few more sections needs to be added:
|
Will also need to add more tests to get better code coverage of the new parser functionality. (see the codecov warnings: https://github.com/unionai-oss/pandera/pull/1181/files#diff-1756d33acfa3e810dfcc642cb0f42630446dea86a262e3201290e23c19400404) |
pandera/backends/pandas/parsers.py
Outdated
self.parser_fn = partial(parser._parser_fn, **parser._parser_kwargs) | ||
|
||
@overload | ||
def prerprocess( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: prerprocess -> preprocess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
ClassParser = Callable[[Union[classmethod, AnyCallable]], classmethod] | ||
|
||
|
||
def parse(*fields, **parse_kwargs) -> ClassParser: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be parser
? I feel like verb tense parse
also makes sense, but my sense is it should just be the lowercase form of the Parser
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with either, so I respect your opinion.
Fixed
@utopianf if you can add the tests I can work on the docs |
9c72575
to
f2a73b4
Compare
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com> Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com> Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com> Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com> Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
@cosmicBboy sorry for being late. I have add some tests and do some refactoring to remove some unused codes. Hope this work |
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
…t branches Signed-off-by: Shishin Mo <maoson0307@gmail.com>
… behaviour for DataframeSchame Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff @utopianf ! Thanks for you work on this 🚀
Added the parse function to DataFrameModel for doing some preprocessing on series.
I understand there are some discussing(#252) on this enhancement, so if this way doesn't meet your needs, please feel free to reject.
e.g. If we want to make a dataframe with 6-digit string series of codes, we can do it by following: