Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
dobraczka committed Mar 13, 2024
1 parent 30d1e8a commit 7e19334
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,13 +49,21 @@ print(ds.intra_ent_links[0])
print(ds.intra_ent_links[1])
```

Alternatively this dataset (among others) is also available in [`sylloge`](https://github.com/dobraczka/sylloge).

# Dataset structure
There are 3 entity resolution tasks in this repository: imdb-tmdb, imdb-tvdb, tmdb-tvdb, all contained in the `data` folder.
The data structure mainly follows the structure used in [OpenEA](https://github.com/nju-websoft/OpenEA).
Each folder contains the information of the knowledge graphs (`attr_triples_*`,`rel_triples_*`) and the gold standard of entity links between the datasets(`ent_links`). The triples are labeled with `1` and `2` where e.g. for imdb-tmdb `1` refers to imdb and `2` to tmdb. The folder 721_5fold contains pre-split entity link folds with 70-20-10 ratio for testing, training, validation.
Furthermore, there exists a file for each dataset with intra-dataset links called `*_intra_ent_links`.
For the binary cases each dataset has a `cluster` file in the respective folder. Each line here is a cluster with comma-seperated members of the cluster. This includes intra- and inter-dataset links.
For the multi-source setting, you can use the `multi_source_cluster` file in the `data` folder.
Using [`sylloge`](https://github.com/dobraczka/sylloge) you can also easily load this dataset as a multi-source task:

```
from sylloge import MovieGraphBenchmark
ds = MovieGraphBenchmark(graph_pair='multi')
```

# Citing
This dataset was first presented in this paper:
Expand Down

0 comments on commit 7e19334

Please sign in to comment.