Add parse function to DataFrameModel #1181

ghost · 2023-05-10T03:46:39Z

Added the parse function to DataFrameModel for doing some preprocessing on series.

I understand there are some discussing(#252) on this enhancement, so if this way doesn't meet your needs, please feel free to reject.

e.g. If we want to make a dataframe with 6-digit string series of codes, we can do it by following:

import pandas as pd
import pandera as pa

class PassSchema(pa.DataFrameModel):
    code: pa.typing.Series[str]

    @pa.parse("code")
    def to_str(cls, series):
        return series.astype(str).str.zfill(6)


class ErrorSchema(pa.DataFrameModel):
    code: pa.typing.Series[str]


df = pd.DataFrame({"code": [123, 234]})

print(PassSchema.parse_and_validate(df))

# Output:
#                  code
#  0           000123
#  1           000234

print(ErrorSchema.parse_and_validate(df))

# Output:
# ~~~~
# pandera.errors.SchemaError: expected series 'code' to have type str:
# failure cases:
#    index  failure_case
# 0      0           123
# 1      1           234

ghost · 2023-05-10T03:54:25Z

Reopen due to description being updated.

cosmicBboy · 2023-05-10T15:47:29Z

This is awesome @ShishinMo! I think this is a good start 🔥

We should also make sure this works with the object-based DataFrameSchema api. In fact, the parsing logic should be implemented in DataFrameSchemaBackend.validate method, similar to run_checks.

Essentially a Parser is analogous to a Check, but it returns transformed data instead of boolean data. The idea is that DataFrameSchema and DataFrameSchemaBackend (and the corresponding schema components) provides the base layer that implements the schema specification and the validation/parsing logic. DataFrameModel and model components "compile" down to DataFrameSchema and should actually do any validation/parsing.

Steps Needed

Code Example

object-based API

pa.DataFrameSchema(
    # parsers at the schema level have access to the entire dataframe
    parsers=[pa.Parser(lambda df: df.transform("sqrt"))],
    columns={
        # parsers at the column level
        "col1": pa.Column(parsers=[pa.Parser(lambda series: seriers.transform("sqrt"))])
    }
)

And the equivalent class-based API

# parsers at the schema level have access to the entire dataframe
class Model(pa.DataFrameModel):
    col1: pa.typing.Series[float]

    @pa.dataframe_parse
    def dataframe_sqrt(cls, df):
        return df.transform("sqrt")

    # parsers at the column level
    @pa.parse("col1")
    def sqrt(cls, series):
        return series.transform("sqrt")

Open Questions

How many parsers should be allowed per schema/schema component? Should it be one per schema? or can the user provide a list of parsers? If we want to support built-in parsers (e.g. normalizing by mean/std, clipping negative values, etc) it may make sense to support a list of parsers that form a sort of constrained data transformation pipeline, i.e. the output shape must match the input shape). This affects the next question:
In what order are parsers executed? I feel like type-coercion -> dataframe parsers -> component-level parsers -> checks makes sense. Basically first do all data transformations first, then apply the checks. Checks are all independent of each other anyway, so this is a clean way of guaranteeing that, after calling validate, the data returned is type-correct, parsed according to the parsers specified, and then the checks are all passing.

Whew! Sorry if this is a lot to dump on this one PR, but your initiative kicking this off got me going! This is a lot of additional changes, and I can definitely help implement many of these.

As a good starting point, I'd recommend adding support just for column-level/series-level parser support and getting everything working there. I can help get this over the finish-line, but if you can get start would love to get this feature into the core codebase!

utopianf · 2023-05-10T17:36:02Z

(Sorry for replying from different account since I merge my organization account and my individual account.)
Thanks @cosmicBboy, I will be trying to go through the steps you wrote and update the features as you suggested!

For your questions,

I defined def _collect_parse_infos() that collects all the parsers in to __parses__. __parses__ is a list so that parsers will be applied to one series with the same order of their definitions. Since this will force users to order the function definition which may be similar to pydantic?
In parse_and_validate(), the series will be first parsed then validated, which means the current order is parse -> coercion. I totally agree with your recommendation and will update it.

cosmicBboy · 2023-05-10T19:54:40Z

parses is a list so that parsers will be applied to one series with the same order of their definitions. Since this will force users to order the function definition which may be similar to pydantic?

Sounds good! That should also collect dataframe_parse methods, which should then be fed into the DataFrameSchema constructor

utopianf · 2023-06-14T05:02:29Z

@cosmicBboy Sorry for very late commit, but I have

updated container, array, and components modules
implemented pandera.api.pandas.model_components.parse and dataframe_parse

Now we can parse a dataframe in two ways as following.

import pandera as pa
import pandas as pd

df = pd.DataFrame({"col1": [144.0, 256.0, 1024.0], "col2": [144.0, 256.0, 1024.0]})

schema = pa.DataFrameSchema(
    parsers=[pa.Parser(lambda df: df.transform("sqrt"))],
    columns={
        "col1": pa.Column(parsers=[pa.Parser(lambda series: series.transform("sqrt"))])
    },
)
schema.validate(df)
###        col1  col2
### 0  3.464102  12.0
### 1  4.000000  16.0
### 2  5.656854  32.0

or,

import pandera as pa
import pandas as pd

df = pd.DataFrame({"col1": [144.0, 256.0, 1024.0], "col2": [144.0, 256.0, 1024.0]})


class Model(pa.DataFrameModel):
    col1: pa.typing.Series[float]

    @pa.dataframe_parse
    def df_sqrt(cls, df):
        return df.transform("sqrt")

    @pa.parse("col1")
    def sqrt(cls, series):
        return series.transform("sqrt")

print(Model.validate(df))
###        col1  col2
### 0  3.464102  12.0
### 1  4.000000  16.0
### 2  5.656854  32.0

cosmicBboy · 2023-06-22T17:27:21Z

thanks @utopianf there are linter and unit test errors (see failing checks). check out the contributing guide to learn how to run linters and unit tests locally to make sure everything's passing.

I can help out some time in the next two weeks if you're still have trouble

codecov · 2023-06-29T22:22:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.60%. Comparing base (4df61da) to head (3cc986d).
Report is 55 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1181       +/-   ##
===========================================
- Coverage   94.29%   83.60%   -10.69%     
===========================================
  Files          91      114       +23     
  Lines        7024     8446     +1422     
===========================================
+ Hits         6623     7061      +438     
- Misses        401     1385      +984

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cosmicBboy · 2024-03-22T07:10:43Z

@utopianf any progress on this PR?

utopianf · 2024-03-22T08:08:26Z

@cosmicBboy I'm sorry. I was busy for a while and then I neglected it. I will make corrections related to the Linter and push them this weekend.

utopianf · 2024-03-26T19:22:50Z

@cosmicBboy Sorry for keeping you waiting. The code from upstream has changed significantly compared to before, so I rewrite my code to fit the current code. I've pushed the version that passed all tests locally for now. ~~I'll start writing new tests from here.~~. Here are also some tests!
And now the remaining task should be the documentation.

utopianf · 2024-03-29T07:57:14Z

Hi @cosmicBboy - added documentation (sorry my english should be pool), feel free to request any additional work if needed

cosmicBboy · 2024-03-29T18:50:44Z

thanks @utopianf ! looks like a bunch of unit tests are failing ^^

cosmicBboy · 2024-04-01T01:28:27Z

@utopianf can you give me push permission to your fork of pandera? I recently made some changes to the docs using myst instead of rst) and need to make some changes

utopianf · 2024-04-01T01:49:42Z

@cosmicBboy of course! I have invited you to be a collaborator, but I am not sure if it is enough. Feel free to tell me if it does not work

cosmicBboy · 2024-04-01T01:57:45Z

Done! Hopefully all tests and docs build should pass now.

Thanks again for all the work on this PR, this is gonna be awesome!

In terms of the docs, I think it's almost there, I think a few more sections needs to be added:

how to create dataframe-level parsers
creating parsers with pa.DataFrameModel using the @parse and @dataframe_parse decorators
explaining the order of the validation pipeline with an example: dataframe parsing -> column parsing -> dataframe checks -> column checks

cosmicBboy · 2024-04-01T02:48:01Z

Will also need to add more tests to get better code coverage of the new parser functionality. (see the codecov warnings: https://github.com/unionai-oss/pandera/pull/1181/files#diff-1756d33acfa3e810dfcc642cb0f42630446dea86a262e3201290e23c19400404)

cosmicBboy · 2024-04-01T02:48:35Z

pandera/backends/pandas/parsers.py

+        self.parser_fn = partial(parser._parser_fn, **parser._parser_kwargs)
+
+    @overload
+    def prerprocess(


typo: prerprocess -> preprocess

cosmicBboy · 2024-04-01T02:53:22Z

pandera/api/dataframe/model_components.py

+ClassParser = Callable[[Union[classmethod, AnyCallable]], classmethod]
+
+
+def parse(*fields, **parse_kwargs) -> ClassParser:


should this be parser? I feel like verb tense parse also makes sense, but my sense is it should just be the lowercase form of the Parser object.

I am fine with either, so I respect your opinion.
Fixed

cosmicBboy · 2024-04-03T19:29:07Z

@utopianf if you can add the tests I can work on the docs

Signed-off-by: Shishin Mo <maoson0307@gmail.com>

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com> Signed-off-by: Shishin Mo <maoson0307@gmail.com>

Signed-off-by: Shishin Mo <maoson0307@gmail.com>

utopianf · 2024-04-10T18:10:47Z

@cosmicBboy sorry for being late. I have add some tests and do some refactoring to remove some unused codes. Hope this work

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>

…t branches Signed-off-by: Shishin Mo <maoson0307@gmail.com>

… behaviour for DataframeSchame Signed-off-by: Shishin Mo <maoson0307@gmail.com>

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>

cosmicBboy

Great stuff @utopianf ! Thanks for you work on this 🚀

ghost closed this May 10, 2023

ghost reopened this May 10, 2023

ghost force-pushed the feature/parse branch 2 times, most recently from fdc3c08 to ffa255f Compare May 10, 2023 05:08

utopianf force-pushed the feature/parse branch 4 times, most recently from 1d32676 to b7159f4 Compare June 14, 2023 04:52

utopianf force-pushed the feature/parse branch from 55faace to b950ee8 Compare June 28, 2023 15:55

utopianf force-pushed the feature/parse branch from 0413718 to 609f65b Compare March 26, 2024 19:18

utopianf force-pushed the feature/parse branch 7 times, most recently from 79a7071 to e21748c Compare March 29, 2024 07:51

utopianf force-pushed the feature/parse branch from e21748c to 2265d05 Compare March 29, 2024 22:55

cosmicBboy force-pushed the feature/parse branch from ccb6da5 to 29c7e13 Compare April 1, 2024 01:54

cosmicBboy reviewed Apr 1, 2024

View reviewed changes

cosmicBboy force-pushed the feature/parse branch from 9c72575 to f2a73b4 Compare April 9, 2024 21:26

utopianf force-pushed the feature/parse branch from e85de1b to 8d1857c Compare April 10, 2024 18:06

utopianf and others added 11 commits April 11, 2024 03:07

Enable parse functions in pandas backend

fc8b771

Signed-off-by: Shishin Mo <maoson0307@gmail.com>

add some tests

668be29

Signed-off-by: Shishin Mo <maoson0307@gmail.com>

add documentation

3d1b34d

Signed-off-by: Shishin Mo <maoson0307@gmail.com>

fix polar model_schema_equivalency

b7f7d6a

Signed-off-by: Shishin Mo <maoson0307@gmail.com>

fix linter issues

5d0c7fe

Signed-off-by: Shishin Mo <maoson0307@gmail.com>

fix docstrings

6fd1507

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com> Signed-off-by: Shishin Mo <maoson0307@gmail.com>

fix linter

995f2cb

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com> Signed-off-by: Shishin Mo <maoson0307@gmail.com>

update parsers docs

a6e148f

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com> Signed-off-by: Shishin Mo <maoson0307@gmail.com>

docs updates

adffd16

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com> Signed-off-by: Shishin Mo <maoson0307@gmail.com>

Add more tests and refactor codes

e77fd00

Signed-off-by: Shishin Mo <maoson0307@gmail.com>

fix typo

7bf0c08

Signed-off-by: Shishin Mo <maoson0307@gmail.com>

utopianf force-pushed the feature/parse branch from 6c5f631 to 7bf0c08 Compare April 10, 2024 18:07

cosmicBboy and others added 4 commits April 11, 2024 15:02

fix linter, docs

c677dd7

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>

add one test for element-wise parser and remove some unused try-excep…

99bc1a9

…t branches Signed-off-by: Shishin Mo <maoson0307@gmail.com>

add test for SeriesScheme with element_wise=True and fix element_wise…

71a875d

… behaviour for DataframeSchame Signed-off-by: Shishin Mo <maoson0307@gmail.com>

use DataFrame.map or applymap

3cc986d

Signed-off-by: cosmicBboy <niels.bantilan@gmail.com>

cosmicBboy approved these changes Apr 12, 2024

View reviewed changes

cosmicBboy merged commit eff9329 into unionai-oss:main Apr 12, 2024
73 of 74 checks passed

utopianf deleted the feature/parse branch April 16, 2024 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parse function to DataFrameModel #1181

Add parse function to DataFrameModel #1181

ghost commented May 10, 2023 •

edited by ghost

Loading

ghost commented May 10, 2023

cosmicBboy commented May 10, 2023 •

edited

Loading

utopianf commented May 10, 2023

cosmicBboy commented May 10, 2023

utopianf commented Jun 14, 2023

cosmicBboy commented Jun 22, 2023

codecov bot commented Jun 29, 2023 •

edited

Loading

cosmicBboy commented Mar 22, 2024

utopianf commented Mar 22, 2024

utopianf commented Mar 26, 2024 •

edited

Loading

utopianf commented Mar 29, 2024

cosmicBboy commented Mar 29, 2024

cosmicBboy commented Apr 1, 2024

utopianf commented Apr 1, 2024

cosmicBboy commented Apr 1, 2024

cosmicBboy commented Apr 1, 2024

cosmicBboy Apr 1, 2024

utopianf Apr 10, 2024

cosmicBboy Apr 1, 2024

utopianf Apr 10, 2024

cosmicBboy commented Apr 3, 2024

utopianf commented Apr 10, 2024

cosmicBboy left a comment

		ClassParser = Callable[[Union[classmethod, AnyCallable]], classmethod]


		def parse(fields, *parse_kwargs) -> ClassParser:

Add parse function to DataFrameModel #1181

Add parse function to DataFrameModel #1181

Conversation

ghost commented May 10, 2023 • edited by ghost Loading

ghost commented May 10, 2023

cosmicBboy commented May 10, 2023 • edited Loading

Steps Needed

Code Example

Open Questions

utopianf commented May 10, 2023

cosmicBboy commented May 10, 2023

utopianf commented Jun 14, 2023

cosmicBboy commented Jun 22, 2023

codecov bot commented Jun 29, 2023 • edited Loading

Codecov Report

cosmicBboy commented Mar 22, 2024

utopianf commented Mar 22, 2024

utopianf commented Mar 26, 2024 • edited Loading

utopianf commented Mar 29, 2024

cosmicBboy commented Mar 29, 2024

cosmicBboy commented Apr 1, 2024

utopianf commented Apr 1, 2024

cosmicBboy commented Apr 1, 2024

cosmicBboy commented Apr 1, 2024

cosmicBboy Apr 1, 2024

Choose a reason for hiding this comment

utopianf Apr 10, 2024

Choose a reason for hiding this comment

cosmicBboy Apr 1, 2024

Choose a reason for hiding this comment

utopianf Apr 10, 2024

Choose a reason for hiding this comment

cosmicBboy commented Apr 3, 2024

utopianf commented Apr 10, 2024

cosmicBboy left a comment

Choose a reason for hiding this comment

ghost commented May 10, 2023 •

edited by ghost

Loading

cosmicBboy commented May 10, 2023 •

edited

Loading

codecov bot commented Jun 29, 2023 •

edited

Loading

utopianf commented Mar 26, 2024 •

edited

Loading