Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer Pandas string columns in Arrow conversion on Python 2 #679

Merged
merged 4 commits into from
May 7, 2020

Conversation

rshkv
Copy link

@rshkv rshkv commented May 2, 2020

Problem

Python 2 strings are stored as binary (as opposed to unicode in Python 3). That means when serializing a Pandas dataframe, Arrow can't tell if columns are string or actually binary, so it assumes binary.

In [1]: pandas_df = pd.DataFrame(data={"string_col": ["a", "b", "c"], "nested_string_col": [["a"], ["b"], ["c"]]})

In [2]: spark.createDataFrame(pandas_df).printSchema()
root
 |-- nested_string_col: array (nullable = true)  
 |    |-- element: binary (containsNull = true) # oops 
 |-- string_col: binary (nullable = true) # oops

Why We Care

This is only an issue when using createDataFrame(pandas_df) with the spark.sql.execution.arrow.enabled flag. We have two internal products that serialize Pandas data frames. And the one with the arrows enables this flag by default.

If used with pyarrow <0.10, Arrow serialization fails anyway courtesy of this check (GHE). This is why unexpected binary columns haven't surprised us until recently, when we upgraded to >0.10.

Note that this is irrelevant for @pandas_udf which requires you to provide a schema. If the schema says string, Arrow serialization will do the right thing and not give you binary.

Proposed Changes

We give pyarrow a chance to infer the right column type, and then we nudge it a little depending on what it thinks the type is. There are three different cases:

  1. If pyarrow says the column is binary, we use Pandas' type inference to check if it isn't actually string instead. And if is, we tell Arrow to serialize as string type. We only do this for Python 2.
  2. For array<binary>, we throw an error falling back to non-Arrow serialization which doesn't confuse binary and string. Again, only Python 2.
  3. For struct containing binary, we don't do anything because Arrow doesn't support it anyway. This will change when we bump to 0.15.1. I added a test to remind us.

Why Not Take This From Upstream

Upstream upgraded their arrow / pyarrow for the Spark 3 release which won't support Python 2. They don't have to deal with it. But we want a higher pyarrow version before we can move off Python 2.

How This Patch Was Tested

I added three unit tests for createDataFrame with string, array<string>, and struct<string, string> as input. Each verifies that the schema is the same regardless of whether Arrow serialization was used. I also verified manually that pandas_udf still works for array<string> and array<binary> outputs for pyarrow 0.8.0 and 0.12.1.

In [1]: pandas_df = pd.DataFrame({"string_col": ["a", "b", "c"]})

In [2]: spark.createDataFrame(pandas_df).printSchema()
root
 |-- string_col: string (nullable = true)


In [3]: pandas_df = pd.DataFrame({"string_col": ["a", "b", "c"], "nested_string_col": [["a"], ["b"], ["c"]]})

In [4]: spark.createDataFrame(pandas_df).printSchema()
/Volumes/git/spark/python/pyspark/sql/session.py:758: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  Unsupported type in conversion from Arrow: list<item: binary>
Please use Python3 for support of BinaryType in arrays.
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
  warnings.warn(msg)
root
 |-- nested_string_col: array (nullable = true)
 |    |-- element: string (containsNull = true) # nice
 |-- string_col: string (nullable = true) # nice

@rshkv rshkv force-pushed the wr/python2-arrow-strings branch from 2b34ede to 1828aee Compare May 2, 2020 17:08
@rshkv rshkv requested a review from robert3005 May 2, 2020 17:09
@rshkv rshkv changed the title Infer Pandas string columns in conversion to Arrow under Python 2 Infer Pandas string columns in Arrow conversion on Python 2 May 2, 2020
@rshkv rshkv force-pushed the wr/python2-arrow-strings branch from 3b9148e to 347a968 Compare May 3, 2020 16:44
@rshkv rshkv mentioned this pull request May 5, 2020
@robert3005
Copy link

I never looked into this issue but the code looks reasonable to me 👍

@rshkv rshkv merged commit e4f64d3 into master May 7, 2020
@rshkv rshkv deleted the wr/python2-arrow-strings branch May 7, 2020 10:39
rshkv added a commit that referenced this pull request Feb 26, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
rshkv added a commit that referenced this pull request Mar 2, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
jdcasale pushed a commit that referenced this pull request Mar 3, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
rshkv added a commit that referenced this pull request Mar 4, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
rshkv added a commit that referenced this pull request Mar 9, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants