Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[src, egs] Support limiting max cluster size in agglomerative clustering #2971

Closed
wants to merge 3 commits into from

Conversation

dogancan
Copy link
Contributor

@dogancan dogancan commented Jan 6, 2019

@david-ryan-snyder @mmaciej2

This PR adds a new --max-spk-fraction option to agglomerative-cluster which can be used to limit the size of the largest cluster. This is useful when you know the number of speakers in a recording and you want to make sure no speaker cluster gets more than a predefined fraction of utterances. This threshold is active only when 1.0 / num_spk <= max-spk-fraction <= 1.0.

One shortcoming of the current implementation is that max-spk-fraction value applies to all recordings. It is conceivable that the user would like to adjust this value based on the number of speakers in a recording. To support that use case, we can instead accept a reco2max_spk_fraction_rspecifier. I went with the simple global option because I didn't want to add another file input for a rare use case. Let me know what you guys think.

@danpovey
Copy link
Contributor

danpovey commented Jan 6, 2019

@david-ryan-snyder and @mmaciej2 please comment.

@david-ryan-snyder
Copy link
Contributor

@dogancan, is there any existing recipe that benefits from this change? E.g., the callhome_diarization recipe? Generally we prefer to add functionality when there's a recipe that uses it. If there's no recipe that benefits from this, can you spend some more time explaining the scenarios where this option is helpful?

This is useful when you know the number of speakers in a recording and you want to make sure no speaker cluster gets more than a predefined fraction of utterances.

Let's say there's a recording with 2 speakers, and so we set the reco2num_spks to be 2 for that recording. Let's say that in reality, one speaker speaks about 80% of the time, and the other speaker speaks the remaining 20% of the time. Then, we set max_spk_fraction to equal, let's say 50%. Are we likely to end up with 3 clusters? Or 2 clusters with an equal amount of speech in each one (even though that is likely to be the wrong thing to do)?

std::vector<int32> *assignments_out)
: count_(0), costs_(costs), thresh_(thresh), min_clust_(min_clust),
assignments_(assignments_out) {
num_clusters_ = costs.NumRows();
num_points_ = costs.NumRows();
max_cluster_size_ = int(num_points_ * max_cluster_fraction);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs to be rounded. Unless I'm mistaken, int() rounds towards zero by default, so for example if we have 5 points, a 0.5 max_cluster_fraction, and 2 speakers, the fraction will pass the check of 0.5 >= 1/2, but the max cluster size will end up being 2, and so the clustering algorithm will never be able to produce 2 clusters, as one would have to be of size 3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. This needs to be rounded up.

@dogancan
Copy link
Contributor Author

dogancan commented Jan 6, 2019

I'm not sure if there is an existing recipe that would directly benefit from this. This option is helpful when you know the number of speakers in the recording and you want to limit the number of utterances assigned to the largest cluster. The idea is not to force each cluster to be roughly the same size but to prevent the clustering algorithm from assigning almost all utterances to the same cluster. This kind of catastrophic failure happens if there are some outlier utterances that form their own small clusters and force other speaker clusters to be merged.

Let's say there's a recording with 2 speakers, and so we set the reco2num_spks to be 2 for that recording. Let's say that in reality, one speaker speaks about 80% of the time, and the other speaker speaks the remaining 20% of the time. Then, we set max_spk_fraction to equal, let's say 50%. Are we likely to end up with 3 clusters? Or 2 clusters with an equal amount of speech in each one (even though that is likely to be the wrong thing to do)?

This option should not affect the number of clusters produced. I mean that is the intention. I did not realize the code could produce different number of clusters until Matthew pointed out the rounding bug. I'll fix that in a bit. In the scenario you are describing, you would end up with 2 clusters with an equal amount of speech, but that is not how the fraction should be set. If you believe one speaker might speak about 80% of the time, then you should set the fraction to something larger than that. This fraction is supposed to act like an upper limit beyond which clusters should not be merged. For instance, in this specific scenario you can set the max-spk-fraction to 85% to ensure that no cluster grows beyond that. In general this fraction should be set to a value in the middle of the range [1.0/num_spk, 1.0] rather than a value close to the extremes.

@david-ryan-snyder
Copy link
Contributor

OK, thanks for the explanation... It would be nice to see if this can help in something like the callhome_diarization recipe. In that recipe, we evaluate it with both a threshold for stopping, as well as with a given number of speakers, specified by reco2num_spks. So it would have the potential to help in the latter case.

@dogancan
Copy link
Contributor Author

dogancan commented Jan 6, 2019

I think it would help as long as there are outliers in the data and a reasonable upper limit on cluster size can be estimated.

@danpovey
Copy link
Contributor

danpovey commented Jan 6, 2019 via email

@david-ryan-snyder
Copy link
Contributor

There are two recipes this can be tested on, dihard_2018 and callhome_diarization. @dogancan do you have access to the datasets to test either recipe?

@danpovey
Copy link
Contributor

danpovey commented Jan 6, 2019

Maybe we can have @HuangZiliAndy test it out?

@dogancan
Copy link
Contributor Author

dogancan commented Jan 6, 2019

Unfortunately, I don't have access to either of those at the moment.

@mmaciej2
Copy link
Contributor

mmaciej2 commented Jan 6, 2019

I've been thinking about this and I'm skeptical that this will ever "fix" an issue, but at best will do damage control.

In the case of only 2 speakers, if the normal clustering algorithm is throwing everything into one speaker due to outliers, you are just going to truncate your larger speaker to whatever proportion this parameter is set to and then assign the rest to the other speaker. I think that probably will result in a lower error rate but only because it would be doing some kind of naive smoothing of the diarization output.

The case of more than 2 speakers is a little more complicated, but I'm skeptical that it would be possible to even get the "smoothed" performance gains of the 2-speaker case any reasonable amount of time. If there's outliers that are causing a cluster to erroneously absorb other speakers, if you put a limit on how much of the recording it can absorb, I think once it hits that limit it will then just cause some other cluster to start erroneously absorbing other speakers. And then, I think being able to tune the threshold to not allow that to happen would require a threshold so low that it would interfere with the actual diarization output.

This is all just a hypothesis, of course, so I'm not arguing that this shouldn't be tried out on some recipes to see if it can improve performance. I'm just skeptical this can improve diarization performance with 3+ speakers, and I think any gains with 2 speakers won't be particularly meaningful.

@dogancan
Copy link
Contributor Author

dogancan commented Jan 6, 2019

I've been thinking about this and I'm skeptical that this will ever "fix" an issue, but at best will do damage control.

I agree.

.I'm just skeptical this can improve diarization performance with 3+ speakers, and I think any gains with 2 speakers won't be particularly meaningful.

I think gains in the case of 2 speakers will be statistically significant as long as there are outliers and a reasonable upper limit can be estimated. That is what I am seeing on my end. But you agree that setting this to a large enough value, e.g. in the range [0.8, 1.0], won't makes things worse, right?

@HuangZiliAndy
Copy link
Contributor

HuangZiliAndy commented Jan 6, 2019 via email

@dogancan
Copy link
Contributor Author

dogancan commented Jan 6, 2019

Having thought a bit more, setting max cluster size might cause the algorithm to produce more clusters than what is provided in the min_clust argument. For instance, in the case of 2 speakers, if max-cluster-fraction is set to 0.6 and there are currently 3 clusters of roughly equal size, they will not be merged into 2 clusters. I am not sure whether it makes sense to try to avoid that situation by tweaking the algorithm though. Maybe we can print a warning to let the user know so they can set a larger max-cluster-fraction or increase the number of speakers. I think it is best to think of this option as another knob that can be used to determine the number of clusters.

@mmaciej2
Copy link
Contributor

mmaciej2 commented Jan 6, 2019

I think gains in the case of 2 speakers will be statistically significant as long as there are outliers and a reasonable upper limit can be estimated. That is what I am seeing on my end. But you agree that setting this to a large enough value, e.g. in the range [0.8, 1.0], won't makes things worse, right?

I do not think that the gains won't be statistically significant, just that they won't be terribly meaningful. It's essentially a mechanism that tries to identify degenerate output and if it finds it, says "give me the X% of the recording that is the most similar and treat that as one speaker". I think that's a perfectly reasonable thing to do, and could even be potentially useful in certain applications (for example speaker ID, where even if you can't find the secondary speaker, it might be a good idea to minimize the contamination to the dominant speaker).

I do agree that if the parameter is set to a large enough value, it is not likely to make things worse. Current diarization performance tends to break down when one of the speakers contributes very little speech, and this is particularly true when clustering to oracle number of speakers. If there is a case where one person was speaking 90% of the time and our threshold is set to 0.85, I think it's unlikely the performance without the threshold was particularly good to begin with.

@dogancan
Copy link
Contributor Author

dogancan commented Jan 6, 2019

Ok, we are in complete agreement on this :)

@HuangZiliAndy
Copy link
Contributor

Hi all! I conduct the experiments on Callhome dataset, and the DER is increasing. https://docs.google.com/document/d/1Z23ZMWqy1Rxw-H3MaBQIKVFbozkTAmZtb3UfW3-cQhs/edit?usp=sharing

I think limiting max cluster size might help in situation when the speech of each speaker is quite even. However, it requires too much prior knowledge. And obviously there are some cases such operation will harm the prediction.

Best,
Zili

@david-ryan-snyder
Copy link
Contributor

Thanks Zili! Any chance you have a dihard_2018 setup lying around that you could quickly test this on? If you only have one of the systems trained (e.g., x-vector) that's OK.

@HuangZiliAndy
Copy link
Contributor

Hi David! Dihard 2018 results updated.

https://docs.google.com/document/d/1Z23ZMWqy1Rxw-H3MaBQIKVFbozkTAmZtb3UfW3-cQhs/edit?usp=sharing

X-vector has a little bit improvement when max_spk_fraction = 0.9, and i-vector performance declined. Thanks!

@david-ryan-snyder
Copy link
Contributor

Thanks a lot @HuangZiliAndy for running these experiments!

In my opinion, unless we can show a tangible benefit in the existing examples, I don't think we should merge this feature.

My suggestion is to leave this unmerged for the time being, but keep it open for those who might want to try it on their setups. We might also decide to come back to this PR, if a recipe is created that looks like it could benefit from this.

@dogancan
Copy link
Contributor Author

dogancan commented Jan 7, 2019

Thanks a lot Zili!

Having seen these results, I kind of agree with David. This functionality was helpful when dealing with some out-of-domain data I was diarizing, but I guess it is not really helpful when the system is well matched to the data.

@david-ryan-snyder
Copy link
Contributor

@danpovey I believe this PR can be closed, since we merged PR #3058 which I think does the same thing.

@danpovey danpovey closed this Dec 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants