Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] csv_test:test_basic_csv_read FAILED #5211

Closed
NvTimLiu opened this issue Apr 12, 2022 · 6 comments · Fixed by #5230
Closed

[BUG] csv_test:test_basic_csv_read FAILED #5211

NvTimLiu opened this issue Apr 12, 2022 · 6 comments · Fixed by #5230
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Apr 12, 2022

Describe the bug

csv_test:test_basic_csv_read FAILED on branch-22.06 with assert_gpu_and_cpu_are_equal_collect error

[2022-04-12T04:54:29.825Z] =========================== short test summary info ============================
[2022-04-12T04:54:29.825Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[true--read_csv_df-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.825Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[true--read_csv_sql-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.825Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[true-csv-read_csv_df-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.825Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[true-csv-read_csv_sql-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.826Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[false--read_csv_df-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.826Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[false--read_csv_sql-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.826Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[false-csv-read_csv_df-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.826Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[false-csv-read_csv_sql-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]

Detailed log:

../../src/main/python/asserts.py:82: AssertionError
 ----------------------------- Captured stdout call -----------------------------
 ### CPU RUN ###
 ### GPU RUN ###
 ### COLLECT: GPU TOOK 0.2552223205566406 CPU TOOK 0.2035841941833496 ###
 CPU OUTPUT: [Row(number=1.042), Row(number=None), Row(number=None), Row(number=None), Row(number=None), Row(number=98.343), Row(number=223823.9484), Row(number=23848545.0374), Row(number=184721.23987223), Row(number=3.4028235e+38), Row(number=-0.0), Row(number=3.4028235e+38), Row(number=3.4028236e+38), Row(number=3.4028236e+38), Row(number=1.7976931348623157e+308), Row(number=1.7976931348623157e+308), Row(number=1.7976931348623157e+308), Row(number=1.2e-234)]
 GPU OUTPUT: [Row(number=1.042), Row(number=None), Row(number=None), Row(number=None), Row(number=None), Row(number=98.343), Row(number=223823.9484), Row(number=23848545.0374), Row(number=184721.23987223), Row(number=3.4028235e+38), Row(number=-0.0), Row(number=3.4028235e+38), Row(number=3.4028236e+38), Row(number=3.4028236e+38), Row(number=1.7976931348623157e+308), Row(number=1.7976931348623157e+308), Row(number=inf), Row(number=1.2e-234)]
 ----------------------------- Captured stderr call -----------------------------
 22/04/12 03:44:15 WARN SecuredHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `default`.`tmp_table_gw1_651411914_0` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
 22/04/12 03:44:15 WARN SecuredHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `default`.`tmp_table_gw1_651411914_1` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
 _ test_basic_csv_read[true-csv-read_csv_df-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}] _
 [gw1] linux -- Python 3.8.13 /databricks/conda/envs/cudf-udf/bin/python
 
 std_input_path = '/home/ubuntu/spark-rapids/integration_tests/src/test/resources'
 name = 'simple_float_values.csv'
 schema = StructType(List(StructField(number,DoubleType,true)))
 options = {'header': 'true'}
 read_func = <function read_csv_df at 0x7ff760e02b80>, v1_enabled_list = 'csv'
 ansi_enabled = 'true'
 spark_tmp_table_factory = <conftest.TmpTableFactory object at 0x7ff74a9faee0>
 
     @approximate_float00m
     @pytest00m.mark.parametrize('00mname,schema,options00m'00m, [
         ('Acquisition_2007Q3.txt', _acq_schema, {'sep': '|'}),
         ('Performance_2007Q3.txt_0', _perf_schema, {'sep': '|'}),
         ('ts.csv', _date_schema, {}),
         ('date.csv', _date_schema, {}),
         ('ts.csv', _ts_schema, {}),
         ('str.csv', _ts_schema, {}),
         ('str.csv', _bad_str_schema, {'header': 'true'}),
         ('str.csv', _good_str_schema, {'header': 'true'}),
         ('no-comments.csv', _three_str_schema, {}),
         ('empty.csv', _three_str_schema, {}),
         ('just_comments.csv', _three_str_schema, {'comment': '#'}),
         ('trucks.csv', _trucks_schema, {'header': 'true'}),
         ('trucks.tsv', _trucks_schema, {'sep': '\t', 'header': 'true'}),
         ('trucks-different.csv', _trucks_schema, {'sep': '|', 'header': 'true', 'quote': "'"}),
         ('trucks-blank-names.csv', _trucks_schema, {'header': 'true'}),
         ('trucks-windows.csv', _trucks_schema, {'header': 'true'}),
         ('trucks-empty-values.csv', _trucks_schema, {'header': 'true'}),
         ('trucks-extra-columns.csv', _trucks_schema, {'header': 'true'}),
         pytest.param('trucks-comments.csv', _trucks_schema, {'header': 'true', 'comment': '~'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/2066')),
         ('trucks-more-comments.csv', _trucks_schema,  {'header': 'true', 'comment': '#'}),
         pytest.param('trucks-missing-quotes.csv', _trucks_schema, {'header': 'true'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/130')),
         pytest.param('trucks-null.csv', _trucks_schema, {'header': 'true', 'nullValue': 'null'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/2068')),
         pytest.param('trucks-null.csv', _trucks_schema, {'header': 'true'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/1986')),
         pytest.param('simple_int_values.csv', _byte_schema, {'header': 'true'}),
         pytest.param('simple_int_values.csv', _short_schema, {'header': 'true'}),
         pytest.param('simple_int_values.csv', _int_schema, {'header': 'true'}),
         pytest.param('simple_int_values.csv', _long_schema, {'header': 'true'}),
         ('simple_int_values.csv', _float_schema, {'header': 'true'}),
         ('simple_int_values.csv', _double_schema, {'header': 'true'}),
         ('simple_int_values.csv', _decimal_10_2_schema, {'header': 'true'}),
         ('decimals.csv', _decimal_10_2_schema, {'header': 'true'}),
         ('decimals.csv', _decimal_10_3_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_byte_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_short_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_int_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_long_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_float_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_double_schema, {'header': 'true'}),
         pytest.param('nan_and_inf.csv', _float_schema, {'header': 'true'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/125')),
         pytest.param('floats_invalid.csv', _float_schema, {'header': 'true'}),
         pytest.param('floats_invalid.csv', _double_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _byte_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _short_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _int_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _long_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _float_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _double_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _decimal_10_2_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _decimal_10_3_schema, {'header': 'true'}),
         pytest.param('simple_boolean_values.csv', _bool_schema, {'header': 'true'}),
         pytest.param('ints_with_whitespace.csv', _number_as_string_schema, {'header': 'true'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/2069')),
         pytest.param('ints_with_whitespace.csv', _byte_schema, {'header': 'true'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/130'))
         ], ids=idfn)
     @pytest.mark.parametrize('read_func', [read_csv_df, read_csv_sql])
     @pytest.mark.parametrize('v1_enabled_list', ["", "csv"])
     @pytest.mark.parametrize('ansi_enabled', ["true", "false"])
     def test_basic_csv_read(std_input_path, name, schema, options, read_func, v1_enabled_list, ansi_enabled, spark_tmp_table_factory):
         updated_conf=copy_and_update(_enable_all_types_conf, {
             'spark.sql.sources.useV1SourceList': v1_enabled_list,
             'spark.sql.ansi.enabled': ansi_enabled
         })
 >       assert_gpu_and_cpu_are_equal_collect(read_func(std_input_path + '/' + name, schema, spark_tmp_table_factory, options),
                 conf=updated_conf)
 
 ../../src/main/python/csv_test.py:257: 
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 ../../src/main/python/asserts.py:508: in assert_gpu_and_cpu_are_equal_collect
     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
 ../../src/main/python/asserts.py:439: in _assert_gpu_and_cpu_are_equal
     assert_equal(from_cpu, from_gpu)
 ../../src/main/python/asserts.py:106: in assert_equal
     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
 ../../src/main/python/asserts.py:42: in _assert_equal
     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
 ../../src/main/python/asserts.py:35: in _assert_equal
     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 
 cpu = 1.7976931348623157e+308, gpu = inf
 float_check = <function get_float_check.<locals>.<lambda> at 0x7ff74b1b7ca0>
 path = [16, 'number']
 
     def _assert_equal(cpu, gpu, float_check, path):
         t = type(cpu)
         if (t is Row):
             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
                 for field in cpu.__fields__:
                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
             else:
                 for index in range(len(cpu)):
                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
         elif (t is list):
             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
             for index in range(len(cpu)):
                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
         elif (t is tuple):
             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
             for index in range(len(cpu)):
                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
         elif (t is pytypes.GeneratorType):
             index = 0
             # generator has no zip :( so we have to do this the hard way
             done = False
             while not done:
                 sub_cpu = None
                 sub_gpu = None
                 try:
                     sub_cpu = next(cpu)
                 except StopIteration:
                     done = True
     
                 try:
                     sub_gpu = next(gpu)
                 except StopIteration:
                     done = True
     
                 if done:
                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
                 else:
                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
     
                 index = index + 1
         elif (t is dict):
             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
             # so sort the items to do our best with ignoring the order of dicts
             cpu_items = list(cpu.items()).sort(key=_RowCmp)
             gpu_items = list(gpu.items()).sort(key=_RowCmp)
             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
         elif (t is int):
             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
         elif (t is float):
             if (math.isnan(cpu)):
                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
             else:
 >               assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
 E               AssertionError: GPU and CPU float values are different [16, 'number']
@NvTimLiu NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 12, 2022
@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Apr 12, 2022

May relate to commits :

#4992

#5195

#5185

Observed on 22.06 Databricks nightly build+IT: rapids_databricks_nightly-dev-github 357, let's check if these failures occur on other IT environments

@tgravescs
Copy link
Collaborator

just a guess but perhaps related to rapidsai/cudf@012af64

@tgravescs
Copy link
Collaborator

I did verify using the latest spark rapids plugin with a cudf jar from the 10th doesn't have the test failures.

@andygrove
Copy link
Contributor

Yes, this appears to be caused by rapidsai/cudf#10622 and hitting the documented limitations we have when casting from string to float. Specifically, 1.7976931348623157e+308 being parsed as Inf.

@andygrove andygrove self-assigned this Apr 12, 2022
@andygrove andygrove added this to the Apr 4 - Apr 15 milestone Apr 12, 2022
@andygrove
Copy link
Contributor

Yes, this appears to be caused by rapidsai/cudf#10622 and hitting the documented limitations we have when casting from string to float. Specifically, 1.7976931348623157e+308 being parsed as Inf.

Specifically, CPU is producing 1.7976931348623157e+308 and GPU is producing inf for at least one of these tests, which actually seems to be the opposite of the documented limitation. I am looking into this.

@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Apr 12, 2022
@sameerz
Copy link
Collaborator

sameerz commented Apr 12, 2022

The fix in rapidsai/cudf#10622 addresses the handling of small floating point values, which were converting in a way that appeared to lose too much precision (1.0 -> 0.99999). We should be ok on the upper ranges of floats where very large values, such as those in the test, are being parsed as Inf.

We should add tests to confirm float conversion at the lower end of the decimal range, regardless of the epsilon range for our tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
4 participants