Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relation Classifier Encoding Strategies #3023

Merged
merged 23 commits into from
Dec 20, 2022

Conversation

dobbersc
Copy link
Collaborator

This PR implements several encoding strategies from this paper for the relation classifier. The original relation classifier was designed for the (typed) entity mask encoding strategy. Now, the model is more general, with the encoding strategy as a hyperparameter. The interface also enables us to easily add more encoding strategies.

Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dobbersc for improving this!

In testing this PR, I noticed some discrepancies between the parameters of RelationExtractor and RelationClassifier related to the (kind of confusing) issue of whether a RE dataset is fully annotated like RE_ENGLISH_CONLL04 and thus usable without a prior "relation identification" step, or whether only a subset of entity pairs is annotated like in TACRED or SemEval.

For RE_ENGLISH_CONLL04 one would instantiate the RelationClassifier like this:

model: RelationClassifier = RelationClassifier(
    embeddings=embeddings,
    label_dictionary=label_dictionary,
    label_type="relation",
    entity_label_types="ner",
    cross_augmentation=True,
)

and the RelationExtractor like this:

model: RelationExtractor = RelationExtractor(
    embeddings=embeddings,
    label_dictionary=label_dictionary,
    label_type="relation",
    entity_label_type="ner",
    train_on_gold_pairs_only=False,
)

So, the params entity_label_type / entity_label_types and train_on_gold_pairs_only / cross_augmentation have different names. Perhaps align the naming?

@dobbersc
Copy link
Collaborator Author

Thanks, @alanakbik for testing the PR.

So, the params entity_label_type / entity_label_types and train_on_gold_pairs_only / cross_augmentation have different names. Perhaps align the naming?

I agree that cross_augmentation is properly less descriptive for the end-user and what they care about than train_on_gold_pairs_only. We want to convey that this flag is relevant for datasets that are not fully annotated, and thus only the gold entity pairs should be used for training. How about we name it train_on_gold_entity_pairs or train_on_gold_relations then? We could also think about moving this parameter to the transform_corpus etc. function since it is only required for model training and not for inference.

The entity_label_types is intentionally written with an s at the end since the new RelationClassifier also allows multiple entity label types. The old RelationExtractor does not support this feature. This is consistent with other parameters, e.g. sentences in the predict function.

@alanakbik
Copy link
Collaborator

@dobbersc thanks, merging now!

@alanakbik alanakbik merged commit 3b91e8c into master Dec 20, 2022
@alanakbik alanakbik deleted the relation-classifier-encoding-strategies branch December 20, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants