Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset with multiple jsonlines files interprets datastructure too early #7092

Open
Vipitis opened this issue Aug 6, 2024 · 5 comments

Comments

@Vipitis
Copy link

Vipitis commented Aug 6, 2024

Describe the bug

likely related to #6460

using datasets.load_dataset("json", data_dir= ... ) with multiple .jsonl files will error if one of the files (maybe the first file?) contains a full column of empty data.

Steps to reproduce the bug

real world example:
data is available in this PR-branch. Because my files are chunked by months, some months contain all empty data for some columns, just by chance - these are []. Otherwise it's all the same structure.

from datasets import load_dataset
ds = load_dataset("json", data_dir="./data/annotated/api")

you get a long error trace, where in the middle it says something like

TypeError: Couldn't cast array of type struct<id: int64, src: string, ctype: string, channel: int64, sampler: struct<filter: string, wrap: string, vflip: string, srgb: string, internal: string>, published: int64> to null

toy example: (on request)

Expected behavior

Some suggestions

  1. give a better error message to the user
  2. consider all files before deciding on a data structure for a given column.
  3. if you encounter a new structure, and can't cast that to null, replace the null-hypothesis. (maybe something for pyarrow)

as a workaround I have lazily implemented the following (essentially step 2)

import os 
import jsonlines
import datasets

api_files = os.listdir("./data/annotated/api")

api_files = [f"./data/annotated/api/{f}" for f in api_files]

api_file_contents = []
for f in api_files:
    with jsonlines.open(f) as reader:
        for obj in reader:
            api_file_contents.append(obj)

ds = datasets.Dataset.from_list(api_file_contents)

this works fine for my usecase, but is potentially slower and less memory efficient for really large datasets (where this is unlikely to happen in the first place).

Environment info

  • datasets version: 2.20.0
  • Platform: Windows-10-10.0.19041-SP0
  • Python version: 3.9.4
  • huggingface_hub version: 0.23.4
  • PyArrow version: 16.1.0
  • Pandas version: 2.2.2
  • fsspec version: 2023.10.0
@hvaara
Copy link

hvaara commented Aug 6, 2024

I’ll take a look

@hvaara
Copy link

hvaara commented Aug 6, 2024

Possible definitions of done for this issue:

  1. A fix so you can load your dataset specifically
  2. A general fix for datasets similar to this in the datasets library

Option 1 is trivial. I think option 2 requires significant changes to the library.

Since you outlined something akin to option 2 in Expected behavior I'm assuming that's what you'd like to see done. Is that right?

In the meantime, here's a solution for option 1:

import datasets

data_dir = './data/annotated/api'

features = datasets.Features({'id': datasets.Value(dtype='string'),
 'name': datasets.Value(dtype='string'),
 'author': datasets.Value(dtype='string'),
 'description': datasets.Value(dtype='string'),
 'tags': datasets.Sequence(feature=datasets.Value(dtype='string'), length=-1),
 'likes': datasets.Value(dtype='int64'),
 'viewed': datasets.Value(dtype='int64'),
 'published': datasets.Value(dtype='int64'),
 'date': datasets.Value(dtype='string'),
 'time_retrieved': datasets.Value(dtype='string'),
 'image_code': datasets.Value(dtype='string'),
 'image_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'common_code': datasets.Value(dtype='string'),
 'sound_code': datasets.Value(dtype='string'),
 'sound_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'buffer_a_code': datasets.Value(dtype='string'),
 'buffer_a_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'buffer_b_code': datasets.Value(dtype='string'),
 'buffer_b_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'buffer_c_code': datasets.Value(dtype='string'),
 'buffer_c_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'buffer_d_code': datasets.Value(dtype='string'),
 'buffer_d_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'cube_a_code': datasets.Value(dtype='string'),
 'cube_a_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'thumbnail': datasets.Value(dtype='string'),
 'access': datasets.Value(dtype='string'),
 'license': datasets.Value(dtype='string'),
 'functions': datasets.Sequence(feature=datasets.Sequence(feature=datasets.Value(dtype='int64'), length=-1), length=-1),
 'test': datasets.Value(dtype='string')})

datasets.load_dataset('json', data_dir=data_dir, features=features)

@albertvillanova
Copy link
Member

As pointed out by @hvaara, you can define explicit features so that you avoid the datasets library having to infer them (from the first few samples).

Note that the feature inference is done from the first few samples of JSON-Lines on purpose, so that the entire data does not need to be parsed twice (it would be inefficient for very large datasets).

@Vipitis
Copy link
Author

Vipitis commented Aug 8, 2024

I understand this. But can there be a solution that doesn't require the end user to write this shema by hand(in my case there is some fields that contain a nested structure)?

Maybe offer an option to infer the shema automatically before loading the dataset. Or perhaps - trigger such a method when this error arises?

Is this "first few files" heuristics accessible via kwargs perhaps. Maybe an error that says
`Cloud not cast some structure into feature shema, consider increasing shema_files to a large number or all".

There might be efficient implementations to solve this problem for larger datasets.

@hvaara
Copy link

hvaara commented Aug 8, 2024

@Vipitis raised a good point on the HF Discord regarding the use of a dataset script to provide the schema during initialization. Using this approach requires setting trust_remote_code=True, which is not allowed in certain evaluation frameworks.

For cases where using a dataset script is acceptable, would it be helpful to add functionality to the library (not necessarily in load_dataset) that can automatically discover the feature definitions and output them, so you don't have to manually define them?

Alternatively, for situations where features need to be known at load-time without using a dataset script, another option could be loading the dataset schema from a file format that doesn't require trust_remote_code=True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants