Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parse function to DataFrameModel #1181

Merged
merged 15 commits into from
Apr 12, 2024

Conversation

ghost
Copy link

@ghost ghost commented May 10, 2023

Added the parse function to DataFrameModel for doing some preprocessing on series.

I understand there are some discussing(#252) on this enhancement, so if this way doesn't meet your needs, please feel free to reject.

e.g. If we want to make a dataframe with 6-digit string series of codes, we can do it by following:

import pandas as pd
import pandera as pa

class PassSchema(pa.DataFrameModel):
    code: pa.typing.Series[str]

    @pa.parse("code")
    def to_str(cls, series):
        return series.astype(str).str.zfill(6)


class ErrorSchema(pa.DataFrameModel):
    code: pa.typing.Series[str]


df = pd.DataFrame({"code": [123, 234]})

print(PassSchema.parse_and_validate(df))

# Output:
#                  code
#  0           000123
#  1           000234

print(ErrorSchema.parse_and_validate(df))

# Output:
# ~~~~
# pandera.errors.SchemaError: expected series 'code' to have type str:
# failure cases:
#    index  failure_case
# 0      0           123
# 1      1           234

@ghost ghost closed this May 10, 2023
@ghost
Copy link
Author

ghost commented May 10, 2023

Reopen due to description being updated.

@ghost ghost reopened this May 10, 2023
@ghost ghost force-pushed the feature/parse branch 2 times, most recently from fdc3c08 to ffa255f Compare May 10, 2023 05:08
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented May 10, 2023

This is awesome @ShishinMo! I think this is a good start 🔥

We should also make sure this works with the object-based DataFrameSchema api. In fact, the parsing logic should be implemented in DataFrameSchemaBackend.validate method, similar to run_checks.

Essentially a Parser is analogous to a Check, but it returns transformed data instead of boolean data. The idea is that DataFrameSchema and DataFrameSchemaBackend (and the corresponding schema components) provides the base layer that implements the schema specification and the validation/parsing logic. DataFrameModel and model components "compile" down to DataFrameSchema and should actually do any validation/parsing.

Steps Needed

  • pandera.api.base.parsers: you did this already with MetaParse and BaseParse. (nit: rename parses to parsers)
  • pandera.api.parsers: you did this already with Parse.
  • pandera.backends.base.BaseParserBackend: you already did this with BaseParseBackend
  • pandera.backends.pandas.parsers: this should implement a PandasParserBackend, similar to PandasCheckBackend. This implements the core logic of parser data transformation based on Parser.parse_fn. Note that this will be substantially simpler than PandasCheckBackend, since all it does is apply the parse_fn to the data.
  • The container, array, and components API modules need to be updated so they accept a parsers argument, similar to a checks argument. For the initial support, let's only support parsers for Column, DataFrameSchema, and SeriesSchema classes (not Index or MultiIndex).
  • The container, array, and components backend modules need to be updated so that it has a run_parsers` method (similar to run_checks)
  • I think this parser functionality should be part of the validate method call, no need for a separate parse_and_validate method.
  • Implement a pandera.api.pandas.model_components.parse and dataframe_parsse method decorator, similar to check and dataframe_check.
  • Add tests for DataFrameModel and DataFrameSchema and model/schema components
  • Add documentation

Code Example

object-based API

pa.DataFrameSchema(
    # parsers at the schema level have access to the entire dataframe
    parsers=[pa.Parser(lambda df: df.transform("sqrt"))],
    columns={
        # parsers at the column level
        "col1": pa.Column(parsers=[pa.Parser(lambda series: seriers.transform("sqrt"))])
    }
)

And the equivalent class-based API

# parsers at the schema level have access to the entire dataframe
class Model(pa.DataFrameModel):
    col1: pa.typing.Series[float]

    @pa.dataframe_parse
    def dataframe_sqrt(cls, df):
        return df.transform("sqrt")

    # parsers at the column level
    @pa.parse("col1")
    def sqrt(cls, series):
        return series.transform("sqrt")

Open Questions

  • How many parsers should be allowed per schema/schema component? Should it be one per schema? or can the user provide a list of parsers? If we want to support built-in parsers (e.g. normalizing by mean/std, clipping negative values, etc) it may make sense to support a list of parsers that form a sort of constrained data transformation pipeline, i.e. the output shape must match the input shape). This affects the next question:
  • In what order are parsers executed? I feel like type-coercion -> dataframe parsers -> component-level parsers -> checks makes sense. Basically first do all data transformations first, then apply the checks. Checks are all independent of each other anyway, so this is a clean way of guaranteeing that, after calling validate, the data returned is type-correct, parsed according to the parsers specified, and then the checks are all passing.

Whew! Sorry if this is a lot to dump on this one PR, but your initiative kicking this off got me going! This is a lot of additional changes, and I can definitely help implement many of these.

As a good starting point, I'd recommend adding support just for column-level/series-level parser support and getting everything working there. I can help get this over the finish-line, but if you can get start would love to get this feature into the core codebase!

@utopianf
Copy link
Contributor

(Sorry for replying from different account since I merge my organization account and my individual account.)
Thanks @cosmicBboy, I will be trying to go through the steps you wrote and update the features as you suggested!

For your questions,

  1. I defined def _collect_parse_infos() that collects all the parsers in to __parses__. __parses__ is a list so that parsers will be applied to one series with the same order of their definitions. Since this will force users to order the function definition which may be similar to pydantic?
  2. In parse_and_validate(), the series will be first parsed then validated, which means the current order is parse -> coercion. I totally agree with your recommendation and will update it.

@cosmicBboy
Copy link
Collaborator

parses is a list so that parsers will be applied to one series with the same order of their definitions. Since this will force users to order the function definition which may be similar to pydantic?

Sounds good! That should also collect dataframe_parse methods, which should then be fed into the DataFrameSchema constructor

@utopianf utopianf force-pushed the feature/parse branch 4 times, most recently from 1d32676 to b7159f4 Compare June 14, 2023 04:52
@utopianf
Copy link
Contributor

@cosmicBboy Sorry for very late commit, but I have

  • updated container, array, and components modules
  • implemented pandera.api.pandas.model_components.parse and dataframe_parse

Now we can parse a dataframe in two ways as following.

import pandera as pa
import pandas as pd

df = pd.DataFrame({"col1": [144.0, 256.0, 1024.0], "col2": [144.0, 256.0, 1024.0]})

schema = pa.DataFrameSchema(
    parsers=[pa.Parser(lambda df: df.transform("sqrt"))],
    columns={
        "col1": pa.Column(parsers=[pa.Parser(lambda series: series.transform("sqrt"))])
    },
)
schema.validate(df)
###        col1  col2
### 0  3.464102  12.0
### 1  4.000000  16.0
### 2  5.656854  32.0

or,

import pandera as pa
import pandas as pd

df = pd.DataFrame({"col1": [144.0, 256.0, 1024.0], "col2": [144.0, 256.0, 1024.0]})


class Model(pa.DataFrameModel):
    col1: pa.typing.Series[float]

    @pa.dataframe_parse
    def df_sqrt(cls, df):
        return df.transform("sqrt")

    @pa.parse("col1")
    def sqrt(cls, series):
        return series.transform("sqrt")

print(Model.validate(df))
###        col1  col2
### 0  3.464102  12.0
### 1  4.000000  16.0
### 2  5.656854  32.0

@cosmicBboy
Copy link
Collaborator

thanks @utopianf there are linter and unit test errors (see failing checks). check out the contributing guide to learn how to run linters and unit tests locally to make sure everything's passing.

I can help out some time in the next two weeks if you're still have trouble

@codecov
Copy link

codecov bot commented Jun 29, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.60%. Comparing base (4df61da) to head (3cc986d).
Report is 55 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1181       +/-   ##
===========================================
- Coverage   94.29%   83.60%   -10.69%     
===========================================
  Files          91      114       +23     
  Lines        7024     8446     +1422     
===========================================
+ Hits         6623     7061      +438     
- Misses        401     1385      +984     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cosmicBboy
Copy link
Collaborator

@utopianf any progress on this PR?

@utopianf
Copy link
Contributor

@cosmicBboy I'm sorry. I was busy for a while and then I neglected it. I will make corrections related to the Linter and push them this weekend.

@utopianf
Copy link
Contributor

utopianf commented Mar 26, 2024

@cosmicBboy Sorry for keeping you waiting. The code from upstream has changed significantly compared to before, so I rewrite my code to fit the current code. I've pushed the version that passed all tests locally for now. I'll start writing new tests from here.. Here are also some tests!
And now the remaining task should be the documentation.

@utopianf utopianf force-pushed the feature/parse branch 7 times, most recently from 79a7071 to e21748c Compare March 29, 2024 07:51
@utopianf
Copy link
Contributor

Hi @cosmicBboy - added documentation (sorry my english should be pool), feel free to request any additional work if needed

@cosmicBboy
Copy link
Collaborator

thanks @utopianf ! looks like a bunch of unit tests are failing ^^

@cosmicBboy
Copy link
Collaborator

@utopianf can you give me push permission to your fork of pandera? I recently made some changes to the docs using myst instead of rst) and need to make some changes

@utopianf
Copy link
Contributor

utopianf commented Apr 1, 2024

@cosmicBboy of course! I have invited you to be a collaborator, but I am not sure if it is enough. Feel free to tell me if it does not work

@cosmicBboy
Copy link
Collaborator

Done! Hopefully all tests and docs build should pass now.

Thanks again for all the work on this PR, this is gonna be awesome!

In terms of the docs, I think it's almost there, I think a few more sections needs to be added:

  1. how to create dataframe-level parsers
  2. creating parsers with pa.DataFrameModel using the @parse and @dataframe_parse decorators
  3. explaining the order of the validation pipeline with an example: dataframe parsing -> column parsing -> dataframe checks -> column checks

@cosmicBboy
Copy link
Collaborator

Will also need to add more tests to get better code coverage of the new parser functionality. (see the codecov warnings: https://github.com/unionai-oss/pandera/pull/1181/files#diff-1756d33acfa3e810dfcc642cb0f42630446dea86a262e3201290e23c19400404)

self.parser_fn = partial(parser._parser_fn, **parser._parser_kwargs)

@overload
def prerprocess(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: prerprocess -> preprocess

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

ClassParser = Callable[[Union[classmethod, AnyCallable]], classmethod]


def parse(*fields, **parse_kwargs) -> ClassParser:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be parser? I feel like verb tense parse also makes sense, but my sense is it should just be the lowercase form of the Parser object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with either, so I respect your opinion.
Fixed

@cosmicBboy
Copy link
Collaborator

@utopianf if you can add the tests I can work on the docs

utopianf and others added 11 commits April 11, 2024 03:07
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: Shishin Mo <maoson0307@gmail.com>
@utopianf
Copy link
Contributor

@cosmicBboy sorry for being late. I have add some tests and do some refactoring to remove some unused codes. Hope this work

cosmicBboy and others added 4 commits April 11, 2024 15:02
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
…t branches

Signed-off-by: Shishin Mo <maoson0307@gmail.com>
… behaviour for DataframeSchame

Signed-off-by: Shishin Mo <maoson0307@gmail.com>
Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>
Copy link
Collaborator

@cosmicBboy cosmicBboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff @utopianf ! Thanks for you work on this 🚀

@cosmicBboy cosmicBboy merged commit eff9329 into unionai-oss:main Apr 12, 2024
73 of 74 checks passed
@utopianf utopianf deleted the feature/parse branch April 16, 2024 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants