[BUG] csv_test:test_basic_csv_read FAILED #5211

NvTimLiu · 2022-04-12T07:38:27Z

Describe the bug

csv_test:test_basic_csv_read FAILED on branch-22.06 with assert_gpu_and_cpu_are_equal_collect error

[2022-04-12T04:54:29.825Z] =========================== short test summary info ============================
[2022-04-12T04:54:29.825Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[true--read_csv_df-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.825Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[true--read_csv_sql-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.825Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[true-csv-read_csv_df-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.825Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[true-csv-read_csv_sql-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.826Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[false--read_csv_df-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.826Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[false--read_csv_sql-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.826Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[false-csv-read_csv_df-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]
[2022-04-12T04:54:29.826Z] FAILED ../../src/main/python/csv_test.py::test_basic_csv_read[false-csv-read_csv_sql-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}][APPROXIMATE_FLOAT]

Detailed log:

../../src/main/python/asserts.py:82: AssertionError
 ----------------------------- Captured stdout call -----------------------------
 ### CPU RUN ###
 ### GPU RUN ###
 ### COLLECT: GPU TOOK 0.2552223205566406 CPU TOOK 0.2035841941833496 ###
 CPU OUTPUT: [Row(number=1.042), Row(number=None), Row(number=None), Row(number=None), Row(number=None), Row(number=98.343), Row(number=223823.9484), Row(number=23848545.0374), Row(number=184721.23987223), Row(number=3.4028235e+38), Row(number=-0.0), Row(number=3.4028235e+38), Row(number=3.4028236e+38), Row(number=3.4028236e+38), Row(number=1.7976931348623157e+308), Row(number=1.7976931348623157e+308), Row(number=1.7976931348623157e+308), Row(number=1.2e-234)]
 GPU OUTPUT: [Row(number=1.042), Row(number=None), Row(number=None), Row(number=None), Row(number=None), Row(number=98.343), Row(number=223823.9484), Row(number=23848545.0374), Row(number=184721.23987223), Row(number=3.4028235e+38), Row(number=-0.0), Row(number=3.4028235e+38), Row(number=3.4028236e+38), Row(number=3.4028236e+38), Row(number=1.7976931348623157e+308), Row(number=1.7976931348623157e+308), Row(number=inf), Row(number=1.2e-234)]
 ----------------------------- Captured stderr call -----------------------------
 22/04/12 03:44:15 WARN SecuredHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `default`.`tmp_table_gw1_651411914_0` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
 22/04/12 03:44:15 WARN SecuredHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider csv. Persisting data source table `default`.`tmp_table_gw1_651411914_1` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
 _ test_basic_csv_read[true-csv-read_csv_df-simple_float_values.csv-StructType(List(StructField(number,DoubleType,true)))-{'header': 'true'}] _
 [gw1] linux -- Python 3.8.13 /databricks/conda/envs/cudf-udf/bin/python
 
 std_input_path = '/home/ubuntu/spark-rapids/integration_tests/src/test/resources'
 name = 'simple_float_values.csv'
 schema = StructType(List(StructField(number,DoubleType,true)))
 options = {'header': 'true'}
 read_func = <function read_csv_df at 0x7ff760e02b80>, v1_enabled_list = 'csv'
 ansi_enabled = 'true'
 spark_tmp_table_factory = <conftest.TmpTableFactory object at 0x7ff74a9faee0>
 
     @approximate_float00m
     @pytest00m.mark.parametrize('00mname,schema,options00m'00m, [
         ('Acquisition_2007Q3.txt', _acq_schema, {'sep': '|'}),
         ('Performance_2007Q3.txt_0', _perf_schema, {'sep': '|'}),
         ('ts.csv', _date_schema, {}),
         ('date.csv', _date_schema, {}),
         ('ts.csv', _ts_schema, {}),
         ('str.csv', _ts_schema, {}),
         ('str.csv', _bad_str_schema, {'header': 'true'}),
         ('str.csv', _good_str_schema, {'header': 'true'}),
         ('no-comments.csv', _three_str_schema, {}),
         ('empty.csv', _three_str_schema, {}),
         ('just_comments.csv', _three_str_schema, {'comment': '#'}),
         ('trucks.csv', _trucks_schema, {'header': 'true'}),
         ('trucks.tsv', _trucks_schema, {'sep': '\t', 'header': 'true'}),
         ('trucks-different.csv', _trucks_schema, {'sep': '|', 'header': 'true', 'quote': "'"}),
         ('trucks-blank-names.csv', _trucks_schema, {'header': 'true'}),
         ('trucks-windows.csv', _trucks_schema, {'header': 'true'}),
         ('trucks-empty-values.csv', _trucks_schema, {'header': 'true'}),
         ('trucks-extra-columns.csv', _trucks_schema, {'header': 'true'}),
         pytest.param('trucks-comments.csv', _trucks_schema, {'header': 'true', 'comment': '~'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/2066')),
         ('trucks-more-comments.csv', _trucks_schema,  {'header': 'true', 'comment': '#'}),
         pytest.param('trucks-missing-quotes.csv', _trucks_schema, {'header': 'true'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/130')),
         pytest.param('trucks-null.csv', _trucks_schema, {'header': 'true', 'nullValue': 'null'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/2068')),
         pytest.param('trucks-null.csv', _trucks_schema, {'header': 'true'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/1986')),
         pytest.param('simple_int_values.csv', _byte_schema, {'header': 'true'}),
         pytest.param('simple_int_values.csv', _short_schema, {'header': 'true'}),
         pytest.param('simple_int_values.csv', _int_schema, {'header': 'true'}),
         pytest.param('simple_int_values.csv', _long_schema, {'header': 'true'}),
         ('simple_int_values.csv', _float_schema, {'header': 'true'}),
         ('simple_int_values.csv', _double_schema, {'header': 'true'}),
         ('simple_int_values.csv', _decimal_10_2_schema, {'header': 'true'}),
         ('decimals.csv', _decimal_10_2_schema, {'header': 'true'}),
         ('decimals.csv', _decimal_10_3_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_byte_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_short_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_int_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_long_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_float_schema, {'header': 'true'}),
         pytest.param('empty_int_values.csv', _empty_double_schema, {'header': 'true'}),
         pytest.param('nan_and_inf.csv', _float_schema, {'header': 'true'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/125')),
         pytest.param('floats_invalid.csv', _float_schema, {'header': 'true'}),
         pytest.param('floats_invalid.csv', _double_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _byte_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _short_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _int_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _long_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _float_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _double_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _decimal_10_2_schema, {'header': 'true'}),
         pytest.param('simple_float_values.csv', _decimal_10_3_schema, {'header': 'true'}),
         pytest.param('simple_boolean_values.csv', _bool_schema, {'header': 'true'}),
         pytest.param('ints_with_whitespace.csv', _number_as_string_schema, {'header': 'true'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/2069')),
         pytest.param('ints_with_whitespace.csv', _byte_schema, {'header': 'true'}, marks=pytest.mark.xfail(reason='https://github.com/NVIDIA/spark-rapids/issues/130'))
         ], ids=idfn)
     @pytest.mark.parametrize('read_func', [read_csv_df, read_csv_sql])
     @pytest.mark.parametrize('v1_enabled_list', ["", "csv"])
     @pytest.mark.parametrize('ansi_enabled', ["true", "false"])
     def test_basic_csv_read(std_input_path, name, schema, options, read_func, v1_enabled_list, ansi_enabled, spark_tmp_table_factory):
         updated_conf=copy_and_update(_enable_all_types_conf, {
             'spark.sql.sources.useV1SourceList': v1_enabled_list,
             'spark.sql.ansi.enabled': ansi_enabled
         })
 >       assert_gpu_and_cpu_are_equal_collect(read_func(std_input_path + '/' + name, schema, spark_tmp_table_factory, options),
                 conf=updated_conf)
 
 ../../src/main/python/csv_test.py:257: 
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 ../../src/main/python/asserts.py:508: in assert_gpu_and_cpu_are_equal_collect
     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
 ../../src/main/python/asserts.py:439: in _assert_gpu_and_cpu_are_equal
     assert_equal(from_cpu, from_gpu)
 ../../src/main/python/asserts.py:106: in assert_equal
     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
 ../../src/main/python/asserts.py:42: in _assert_equal
     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
 ../../src/main/python/asserts.py:35: in _assert_equal
     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 
 cpu = 1.7976931348623157e+308, gpu = inf
 float_check = <function get_float_check.<locals>.<lambda> at 0x7ff74b1b7ca0>
 path = [16, 'number']
 
     def _assert_equal(cpu, gpu, float_check, path):
         t = type(cpu)
         if (t is Row):
             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
                 for field in cpu.__fields__:
                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
             else:
                 for index in range(len(cpu)):
                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
         elif (t is list):
             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
             for index in range(len(cpu)):
                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
         elif (t is tuple):
             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
             for index in range(len(cpu)):
                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
         elif (t is pytypes.GeneratorType):
             index = 0
             # generator has no zip :( so we have to do this the hard way
             done = False
             while not done:
                 sub_cpu = None
                 sub_gpu = None
                 try:
                     sub_cpu = next(cpu)
                 except StopIteration:
                     done = True
     
                 try:
                     sub_gpu = next(gpu)
                 except StopIteration:
                     done = True
     
                 if done:
                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
                 else:
                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
     
                 index = index + 1
         elif (t is dict):
             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark
             # so sort the items to do our best with ignoring the order of dicts
             cpu_items = list(cpu.items()).sort(key=_RowCmp)
             gpu_items = list(gpu.items()).sort(key=_RowCmp)
             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
         elif (t is int):
             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
         elif (t is float):
             if (math.isnan(cpu)):
                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
             else:
 >               assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
 E               AssertionError: GPU and CPU float values are different [16, 'number']

The text was updated successfully, but these errors were encountered:

NvTimLiu · 2022-04-12T07:39:33Z

May relate to commits :

#4992

#5195

#5185

Observed on 22.06 Databricks nightly build+IT: rapids_databricks_nightly-dev-github 357, let's check if these failures occur on other IT environments

tgravescs · 2022-04-12T13:52:37Z

just a guess but perhaps related to rapidsai/cudf@012af64

tgravescs · 2022-04-12T14:18:16Z

I did verify using the latest spark rapids plugin with a cudf jar from the 10th doesn't have the test failures.

andygrove · 2022-04-12T16:18:54Z

Yes, this appears to be caused by rapidsai/cudf#10622 and hitting the documented limitations we have when casting from string to float. Specifically, 1.7976931348623157e+308 being parsed as Inf.

andygrove · 2022-04-12T16:23:32Z

Yes, this appears to be caused by rapidsai/cudf#10622 and hitting the documented limitations we have when casting from string to float. Specifically, 1.7976931348623157e+308 being parsed as Inf.

Specifically, CPU is producing 1.7976931348623157e+308 and GPU is producing inf for at least one of these tests, which actually seems to be the opposite of the documented limitation. I am looking into this.

sameerz · 2022-04-12T20:12:36Z

The fix in rapidsai/cudf#10622 addresses the handling of small floating point values, which were converting in a way that appeared to lose too much precision (1.0 -> 0.99999). We should be ok on the upper ranges of floats where very large values, such as those in the test, are being parsed as Inf.

We should add tests to confirm float conversion at the lower end of the decimal range, regardless of the epsilon range for our tests.

NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Apr 12, 2022

tgravescs mentioned this issue Apr 12, 2022

[BUG]. csv_test failures in integration tests 22.06 #5216

Closed

jlowe mentioned this issue Apr 12, 2022

Support sample on ANSI interval types #5213

Merged

andygrove self-assigned this Apr 12, 2022

andygrove added this to the Apr 4 - Apr 15 milestone Apr 12, 2022

andygrove mentioned this issue Apr 12, 2022

XFAIL tests that are failing due to issue 5211 #5218

Merged

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Apr 12, 2022

This was referenced Apr 12, 2022

WIP: Enable reading floating-point types from CSV and JSON by default #5220

Closed

Update csv float tests to reflect changes in precision in cuDF #5230

Merged

sameerz modified the milestones: Apr 4 - Apr 15, Apr 18 - Apr 29 Apr 18, 2022

andygrove closed this as completed in #5230 Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] csv_test:test_basic_csv_read FAILED #5211

[BUG] csv_test:test_basic_csv_read FAILED #5211

NvTimLiu commented Apr 12, 2022 •

edited

Loading

NvTimLiu commented Apr 12, 2022 •

edited

Loading

tgravescs commented Apr 12, 2022

tgravescs commented Apr 12, 2022

andygrove commented Apr 12, 2022

andygrove commented Apr 12, 2022

sameerz commented Apr 12, 2022

[BUG] csv_test:test_basic_csv_read FAILED #5211

[BUG] csv_test:test_basic_csv_read FAILED #5211

Comments

NvTimLiu commented Apr 12, 2022 • edited Loading

NvTimLiu commented Apr 12, 2022 • edited Loading

tgravescs commented Apr 12, 2022

tgravescs commented Apr 12, 2022

andygrove commented Apr 12, 2022

andygrove commented Apr 12, 2022

sameerz commented Apr 12, 2022

NvTimLiu commented Apr 12, 2022 •

edited

Loading

NvTimLiu commented Apr 12, 2022 •

edited

Loading