Infer Pandas string columns in Arrow conversion on Python 2 #679

rshkv · 2020-05-02T16:50:54Z

Problem

Python 2 strings are stored as binary (as opposed to unicode in Python 3). That means when serializing a Pandas dataframe, Arrow can't tell if columns are string or actually binary, so it assumes binary.

In [1]: pandas_df = pd.DataFrame(data={"string_col": ["a", "b", "c"], "nested_string_col": [["a"], ["b"], ["c"]]})

In [2]: spark.createDataFrame(pandas_df).printSchema()
root
 |-- nested_string_col: array (nullable = true)  
 |    |-- element: binary (containsNull = true) # oops 
 |-- string_col: binary (nullable = true) # oops

Why We Care

This is only an issue when using createDataFrame(pandas_df) with the spark.sql.execution.arrow.enabled flag. We have two internal products that serialize Pandas data frames. And the one with the arrows enables this flag by default.

If used with pyarrow <0.10, Arrow serialization fails anyway courtesy of this check (GHE). This is why unexpected binary columns haven't surprised us until recently, when we upgraded to >0.10.

Note that this is irrelevant for @pandas_udf which requires you to provide a schema. If the schema says string, Arrow serialization will do the right thing and not give you binary.

Proposed Changes

We give pyarrow a chance to infer the right column type, and then we nudge it a little depending on what it thinks the type is. There are three different cases:

If pyarrow says the column is binary, we use Pandas' type inference to check if it isn't actually string instead. And if is, we tell Arrow to serialize as string type. We only do this for Python 2.
For array<binary>, we throw an error falling back to non-Arrow serialization which doesn't confuse binary and string. Again, only Python 2.
For struct containing binary, we don't do anything because Arrow doesn't support it anyway. This will change when we bump to 0.15.1. I added a test to remind us.

Why Not Take This From Upstream

Upstream upgraded their arrow / pyarrow for the Spark 3 release which won't support Python 2. They don't have to deal with it. But we want a higher pyarrow version before we can move off Python 2.

How This Patch Was Tested

I added three unit tests for createDataFrame with string, array<string>, and struct<string, string> as input. Each verifies that the schema is the same regardless of whether Arrow serialization was used. I also verified manually that pandas_udf still works for array<string> and array<binary> outputs for pyarrow 0.8.0 and 0.12.1.

In [1]: pandas_df = pd.DataFrame({"string_col": ["a", "b", "c"]})

In [2]: spark.createDataFrame(pandas_df).printSchema()
root
 |-- string_col: string (nullable = true)


In [3]: pandas_df = pd.DataFrame({"string_col": ["a", "b", "c"], "nested_string_col": [["a"], ["b"], ["c"]]})

In [4]: spark.createDataFrame(pandas_df).printSchema()
/Volumes/git/spark/python/pyspark/sql/session.py:758: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  Unsupported type in conversion from Arrow: list<item: binary>
Please use Python3 for support of BinaryType in arrays.
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
  warnings.warn(msg)
root
 |-- nested_string_col: array (nullable = true)
 |    |-- element: string (containsNull = true) # nice
 |-- string_col: string (nullable = true) # nice

robert3005 · 2020-05-05T14:05:26Z

I never looked into this issue but the code looks reasonable to me 👍

When serializing a Pandas dataframe using Arrow under Python 2, Arrow can't tell if string columns should be serialized as string type or as binary (due to how Python 2 stores strings). The result is that Arrow serializes string columns in Pandas dataframes to binary ones. We can remove this when we discontinue support for Python 2. See original PR [1] and follow-up for 'mixed' type columns [2]. [1] #679 [2] #702

helenyuyu and others added 3 commits May 2, 2020 19:01

Test and fix binary/string conversion through Arrow under Python 2

e75da85

Don't break pandas_udf

89c9828

Don't unnecessarily infer for pyarrow < 0.10.0

1828aee

rshkv force-pushed the wr/python2-arrow-strings branch from 2b34ede to 1828aee Compare May 2, 2020 17:08

rshkv requested a review from robert3005 May 2, 2020 17:09

rshkv changed the title ~~Infer Pandas string columns in conversion to Arrow under Python 2~~ Infer Pandas string columns in Arrow conversion on Python 2 May 2, 2020

rshkv mentioned this pull request May 2, 2020

Revert binary/string column logic once we move off Python 2 #678

Open

Support pyarrow 0.10.0 as well

347a968

rshkv force-pushed the wr/python2-arrow-strings branch from 3b9148e to 347a968 Compare May 3, 2020 16:44

rshkv mentioned this pull request May 5, 2020

Bump pyarrow to 0.12.1 #649

Merged

helenyugithub mentioned this pull request May 6, 2020

Decode strings in python2 prior to doing pandas to pyspark conversion #673

Closed

rshkv merged commit e4f64d3 into master May 7, 2020

rshkv deleted the wr/python2-arrow-strings branch May 7, 2020 10:39

rshkv mentioned this pull request Jul 21, 2020

Infer 'mixed' types as strings when using Arrow serialization #702

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infer Pandas string columns in Arrow conversion on Python 2 #679

Infer Pandas string columns in Arrow conversion on Python 2 #679

rshkv commented May 2, 2020 •

edited

Loading

robert3005 commented May 5, 2020

Infer Pandas string columns in Arrow conversion on Python 2 #679

Infer Pandas string columns in Arrow conversion on Python 2 #679

Conversation

rshkv commented May 2, 2020 • edited Loading

Problem

Why We Care

Proposed Changes

Why Not Take This From Upstream

How This Patch Was Tested

robert3005 commented May 5, 2020

rshkv commented May 2, 2020 •

edited

Loading