Skip to content

Fast validation of large pyspark dataframes #1312

Answered by cosmicBboy
DanielLenz asked this question in Q&A
Discussion options

You must be logged in to vote

Is there a better way to validate the schema?

So the head kwarg isn't actually used in the validate method (it's there for API compatibility)... this should really raise an error or warning.

A few questions:

  • I'm assuming based on head=100 you don't actually want to validate the entire dataset. Is that correct?
  • Is there a way you can take only a few partitions of the data when you load it so that you're only validating a subset?

I do not need the .validate() method to return anything, just using it for the check. Is there a better way to use pandera here?

Currently the way you're using it is the current recommended way to pull out the errors.

@NeerajMalhotra-QB @jaskaransinghsidana

Replies: 3 comments 1 reply

Comment options

You must be logged in to vote
1 reply
@DanielLenz
Comment options

Answer selected by DanielLenz
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants