Skip to content

Commit

Permalink
Add missing transitive links and adapt to highlight multi-source use-…
Browse files Browse the repository at this point in the history
…case (#9)

* Add missing links and make adaptations

* Remove cluster id

* Add cluster files for binary

* Update README

* Add new fold data and adapt loading from release

* Adapt tests!

* Update README

* Test downloading

* Adapt changelog
  • Loading branch information
dobraczka authored Mar 13, 2024
1 parent cfd4f9c commit 1c9dee7
Show file tree
Hide file tree
Showing 63 changed files with 78,789 additions and 37,412 deletions.
15 changes: 14 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [1.1.0] - 2024-03-13

### Fixed

- Remove intra-dataset links from ent_links

### Added

- intra_ent_links to ERDataset
- cluster files
- multi_source_cluster file

## [1.0.2] - 2023-03-24

### Changed
### Changed

- Loosen dependency restrictions to allow every Python >=3.7.1

Expand All @@ -24,6 +36,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

Make packagable and enable loading of datasets

[1.1.0]: https://github.com/ScaDS/MovieGraphBenchmark/releases/tag/v1.0.2
[1.0.2]: https://github.com/ScaDS/MovieGraphBenchmark/releases/tag/v1.0.2
[1.0.1]: https://github.com/ScaDS/MovieGraphBenchmark/releases/tag/v1.0.1
[1.0.0]: https://github.com/ScaDS/MovieGraphBenchmark/releases/tag/v1.0.0
26 changes: 24 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
# !! Update 2024-02-24 (fixed in 1.1.0) !!
We found that `ent_links` in some cases contained intra-dataset links, which is not immediately noticable by the user.
Another round of clerical review was performed, transitive links, which were previously missed are added and the `ent_links` files now only contain entity links _between_ the datasets. The `721_5fold` directories have been adapted accordingly.
The intra-dataset links are now in `{dataset_name}_intra_ent_links` for each of the three datasets.
What might also not be immediately obvious is that this dataset can be used as multi-source entity resolution task.
We therefore provide a `multi_source_cluster` file with each line consisting of a cluster id and comma-seperated cluster members of the three datasets, which can also include multiple entries for a single dataset.

# Dataset License
Due to licensing we are not allowed to distribute the IMDB datasets (more info on their license can be found [here](https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&pf_rd_r=2TNAA9FRS3TJWM3AEQ2X&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk1#))
What we can do is let you build the IMDB side of the entity resolution datasets yourself. Please be aware, that the mentioned license applies to the IMDB data you produce.
Expand Down Expand Up @@ -36,12 +43,27 @@ print(ds.rel_triples_2)
print(ds.ent_links)
for fold in ds.folds:
print(fold)

# the intra-dataset links are stored as tuple of dataframes
print(ds.intra_ent_links[0])
print(ds.intra_ent_links[1])
```

Alternatively this dataset (among others) is also available in [`sylloge`](https://github.com/dobraczka/sylloge).

# Dataset structure
There are 3 entity resolution tasks in this repository: imdb-tmdb, imdb-tvdb, tmdb-tvdb, all contained in the `data` folder.
The data structure follows the structure used in [OpenEA](https://github.com/nju-websoft/OpenEA).
Each folder contains the information of the knowledge graphs (`attr_triples_*`,`rel_triples_*`) and the gold standard of entity links (`ent_links`). The triples are labeled with `1` and `2` where e.g. for imdb-tmdb `1` refers to imdb and `2` to tmdb. The folder 721_5fold contains pre-split entity link folds with 70-20-10 ratio for testing, training, validation.
The data structure mainly follows the structure used in [OpenEA](https://github.com/nju-websoft/OpenEA).
Each folder contains the information of the knowledge graphs (`attr_triples_*`,`rel_triples_*`) and the gold standard of entity links between the datasets(`ent_links`). The triples are labeled with `1` and `2` where e.g. for imdb-tmdb `1` refers to imdb and `2` to tmdb. The folder 721_5fold contains pre-split entity link folds with 70-20-10 ratio for testing, training, validation.
Furthermore, there exists a file for each dataset with intra-dataset links called `*_intra_ent_links`.
For the binary cases each dataset has a `cluster` file in the respective folder. Each line here is a cluster with comma-seperated members of the cluster. This includes intra- and inter-dataset links.
For the multi-source setting, you can use the `multi_source_cluster` file in the `data` folder.
Using [`sylloge`](https://github.com/dobraczka/sylloge) you can also easily load this dataset as a multi-source task:

```
from sylloge import MovieGraphBenchmark
ds = MovieGraphBenchmark(graph_pair='multi')
```

# Citing
This dataset was first presented in this paper:
Expand Down
2,782 changes: 1,459 additions & 1,323 deletions data/imdb-tmdb/721_5fold/1/test_links

Large diffs are not rendered by default.

760 changes: 399 additions & 361 deletions data/imdb-tmdb/721_5fold/1/train_links

Large diffs are not rendered by default.

368 changes: 194 additions & 174 deletions data/imdb-tmdb/721_5fold/1/valid_links

Large diffs are not rendered by default.

2,777 changes: 1,456 additions & 1,321 deletions data/imdb-tmdb/721_5fold/2/test_links

Large diffs are not rendered by default.

761 changes: 400 additions & 361 deletions data/imdb-tmdb/721_5fold/2/train_links

Large diffs are not rendered by default.

368 changes: 194 additions & 174 deletions data/imdb-tmdb/721_5fold/2/valid_links

Large diffs are not rendered by default.

2,780 changes: 1,458 additions & 1,322 deletions data/imdb-tmdb/721_5fold/3/test_links

Large diffs are not rendered by default.

762 changes: 400 additions & 362 deletions data/imdb-tmdb/721_5fold/3/train_links

Large diffs are not rendered by default.

374 changes: 197 additions & 177 deletions data/imdb-tmdb/721_5fold/3/valid_links

Large diffs are not rendered by default.

2,775 changes: 1,455 additions & 1,320 deletions data/imdb-tmdb/721_5fold/4/test_links

Large diffs are not rendered by default.

764 changes: 401 additions & 363 deletions data/imdb-tmdb/721_5fold/4/train_links

Large diffs are not rendered by default.

365 changes: 193 additions & 172 deletions data/imdb-tmdb/721_5fold/4/valid_links

Large diffs are not rendered by default.

2,789 changes: 1,462 additions & 1,327 deletions data/imdb-tmdb/721_5fold/5/test_links

Large diffs are not rendered by default.

755 changes: 397 additions & 358 deletions data/imdb-tmdb/721_5fold/5/train_links

Large diffs are not rendered by default.

366 changes: 193 additions & 173 deletions data/imdb-tmdb/721_5fold/5/valid_links

Large diffs are not rendered by default.

2,201 changes: 2,201 additions & 0 deletions data/imdb-tmdb/cluster

Large diffs are not rendered by default.

228 changes: 211 additions & 17 deletions data/imdb-tmdb/ent_links

Large diffs are not rendered by default.

3,688 changes: 1,995 additions & 1,693 deletions data/imdb-tvdb/721_5fold/1/test_links

Large diffs are not rendered by default.

1,028 changes: 557 additions & 471 deletions data/imdb-tvdb/721_5fold/1/train_links

Large diffs are not rendered by default.

501 changes: 272 additions & 229 deletions data/imdb-tvdb/721_5fold/1/valid_links

Large diffs are not rendered by default.

3,693 changes: 1,997 additions & 1,696 deletions data/imdb-tvdb/721_5fold/2/test_links

Large diffs are not rendered by default.

1,031 changes: 559 additions & 472 deletions data/imdb-tvdb/721_5fold/2/train_links

Large diffs are not rendered by default.

507 changes: 275 additions & 232 deletions data/imdb-tvdb/721_5fold/2/valid_links

Large diffs are not rendered by default.

3,682 changes: 1,992 additions & 1,690 deletions data/imdb-tvdb/721_5fold/3/test_links

Large diffs are not rendered by default.

1,030 changes: 558 additions & 472 deletions data/imdb-tvdb/721_5fold/3/train_links

Large diffs are not rendered by default.

509 changes: 276 additions & 233 deletions data/imdb-tvdb/721_5fold/3/valid_links

Large diffs are not rendered by default.

3,689 changes: 1,995 additions & 1,694 deletions data/imdb-tvdb/721_5fold/4/test_links

Large diffs are not rendered by default.

1,026 changes: 556 additions & 470 deletions data/imdb-tvdb/721_5fold/4/train_links

Large diffs are not rendered by default.

502 changes: 273 additions & 229 deletions data/imdb-tvdb/721_5fold/4/valid_links

Large diffs are not rendered by default.

3,697 changes: 1,999 additions & 1,698 deletions data/imdb-tvdb/721_5fold/5/test_links

Large diffs are not rendered by default.

1,021 changes: 554 additions & 467 deletions data/imdb-tvdb/721_5fold/5/train_links

Large diffs are not rendered by default.

507 changes: 275 additions & 232 deletions data/imdb-tvdb/721_5fold/5/valid_links

Large diffs are not rendered by default.

1,483 changes: 1,483 additions & 0 deletions data/imdb-tvdb/cluster

Large diffs are not rendered by default.

4,831 changes: 2,631 additions & 2,200 deletions data/imdb-tvdb/ent_links

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/imdb_intra_ent_links
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
https://www.scads.de/movieBenchmark/resource/IMDB/nm3361496 https://www.scads.de/movieBenchmark/resource/IMDB/nm3361266
3,598 changes: 3,598 additions & 0 deletions data/multi_source_cluster

Large diffs are not rendered by default.

Loading

0 comments on commit 1c9dee7

Please sign in to comment.