Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dataset_info sequence unexpected behavior in README.md YAML #7137

Open
ain-soph opened this issue Sep 5, 2024 · 1 comment
Open

[BUG] dataset_info sequence unexpected behavior in README.md YAML #7137

ain-soph opened this issue Sep 5, 2024 · 1 comment

Comments

@ain-soph
Copy link

ain-soph commented Sep 5, 2024

Describe the bug

When working on dataset_info yaml, I find my data column with format list[dict[str, str]] cannot be coded correctly.

My data looks like

{"answers":[{"text": "ADDRESS", "label": "abc"}]}

My dataset_info in README.md is:

dataset_info:
- config_name: default
  features:
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: label
      dtype: string

Error log:

pyarrow.lib.ArrowNotImplementedError: Unsupported cast from list<item: struct<text: string, label: string>> to struct using function cast_struct

Potential Reason

After some analysis, it turns out that my yaml config is requiring dict[str, list[str]] instead of list[dict[str, str]]. It would work if I change my data to

{"answers":{"text": ["ADDRESS"], "label": ["abc", "def"]}}

These following 2 different dataset_info are actually equivalent.

dataset_info:
- config_name: default
  features:
  - name: answers
    dtype:
    - name: text
      sequence: string
    - name: label
      sequence: string

dataset_info:
- config_name: default
  features:
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: label
      dtype: string

Steps to reproduce the bug

# README.md
---
dataset_info:
- config_name: default
  features:
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: label
      dtype: string
configs:
- config_name: default
  default: true
  data_files:
  - split: train
    path:
    - "test.jsonl"
---



# test.jsonl

# expected but not working
{"answers":[{"text": "ADDRESS", "label": "abc"}]}

# unexpected but working
{"answers":{"text": ["ADDRESS"], "label": ["abc", "def"]}}

Expected behavior

dataset_info:
- config_name: default
  features:
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: label
      dtype: string

Should work on following data format:

{"answers":[{"text":"ADDRESS", "label": "abc"}]}

Environment info

  • datasets version: 2.21.0
  • Platform: macOS-14.6.1-arm64-arm-64bit
  • Python version: 3.12.4
  • huggingface_hub version: 0.24.5
  • PyArrow version: 17.0.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.6.1
@ain-soph
Copy link
Author

ain-soph commented Sep 5, 2024

The non-sequence case works well (dict[str, str] instead of list[dict[str, str]]), which makes me believe it shall be a bug for sequence and my proposed behavior shall be expected.

dataset_info:
- config_name: default
  features:
  - name: answers
    dtype:
    - name: text
      dtype: string
    - name: label
      dtype: string


# data
{"answers": {"text": "ADDRESS", "label": "abc"}}

@ain-soph ain-soph changed the title dataset_info sequence format unexpected behavior in README.md YAML [BUG] dataset_info sequence unexpected behavior in README.md YAML Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant