[BUG] dataset_info sequence unexpected behavior in README.md YAML #7137

ain-soph · 2024-09-05T06:06:06Z

Describe the bug

When working on dataset_info yaml, I find my data column with format list[dict[str, str]] cannot be coded correctly.

My data looks like

{"answers":[{"text": "ADDRESS", "label": "abc"}]}

My dataset_info in README.md is:

dataset_info:
- config_name: default
  features:
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: label
      dtype: string

Error log:

pyarrow.lib.ArrowNotImplementedError: Unsupported cast from list<item: struct<text: string, label: string>> to struct using function cast_struct

Potential Reason

After some analysis, it turns out that my yaml config is requiring dict[str, list[str]] instead of list[dict[str, str]]. It would work if I change my data to

{"answers":{"text": ["ADDRESS"], "label": ["abc", "def"]}}

These following 2 different dataset_info are actually equivalent.

dataset_info:
- config_name: default
  features:
  - name: answers
    dtype:
    - name: text
      sequence: string
    - name: label
      sequence: string

dataset_info:
- config_name: default
  features:
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: label
      dtype: string

Steps to reproduce the bug

# README.md
---
dataset_info:
- config_name: default
  features:
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: label
      dtype: string
configs:
- config_name: default
  default: true
  data_files:
  - split: train
    path:
    - "test.jsonl"
---



# test.jsonl

# expected but not working
{"answers":[{"text": "ADDRESS", "label": "abc"}]}

# unexpected but working
{"answers":{"text": ["ADDRESS"], "label": ["abc", "def"]}}

Expected behavior

dataset_info:
- config_name: default
  features:
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: label
      dtype: string

Should work on following data format:

{"answers":[{"text":"ADDRESS", "label": "abc"}]}

Environment info

datasets version: 2.21.0
Platform: macOS-14.6.1-arm64-arm-64bit
Python version: 3.12.4
huggingface_hub version: 0.24.5
PyArrow version: 17.0.0
Pandas version: 2.2.2
fsspec version: 2024.6.1

The text was updated successfully, but these errors were encountered:

ain-soph · 2024-09-05T06:10:08Z

The non-sequence case works well (dict[str, str] instead of list[dict[str, str]]), which makes me believe it shall be a bug for sequence and my proposed behavior shall be expected.

dataset_info:
- config_name: default
  features:
  - name: answers
    dtype:
    - name: text
      dtype: string
    - name: label
      dtype: string


# data
{"answers": {"text": "ADDRESS", "label": "abc"}}

ain-soph changed the title ~~dataset_info sequence format unexpected behavior in README.md YAML~~ [BUG] dataset_info sequence unexpected behavior in README.md YAML Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] dataset_info sequence unexpected behavior in README.md YAML #7137

[BUG] dataset_info sequence unexpected behavior in README.md YAML #7137

ain-soph commented Sep 5, 2024 •

edited

Loading

ain-soph commented Sep 5, 2024 •

edited

Loading

[BUG] dataset_info sequence unexpected behavior in README.md YAML #7137

[BUG] dataset_info sequence unexpected behavior in README.md YAML #7137

Comments

ain-soph commented Sep 5, 2024 • edited Loading

Describe the bug

Potential Reason

Steps to reproduce the bug

Expected behavior

Environment info

ain-soph commented Sep 5, 2024 • edited Loading

ain-soph commented Sep 5, 2024 •

edited

Loading

ain-soph commented Sep 5, 2024 •

edited

Loading