[src, egs] Support limiting max cluster size in agglomerative clustering #2971

dogancan · 2019-01-06T10:29:52Z

This PR adds a new --max-spk-fraction option to agglomerative-cluster which can be used to limit the size of the largest cluster. This is useful when you know the number of speakers in a recording and you want to make sure no speaker cluster gets more than a predefined fraction of utterances. This threshold is active only when 1.0 / num_spk <= max-spk-fraction <= 1.0.

One shortcoming of the current implementation is that max-spk-fraction value applies to all recordings. It is conceivable that the user would like to adjust this value based on the number of speakers in a recording. To support that use case, we can instead accept a reco2max_spk_fraction_rspecifier. I went with the simple global option because I didn't want to add another file input for a rare use case. Let me know what you guys think.

danpovey · 2019-01-06T19:24:46Z

@david-ryan-snyder and @mmaciej2 please comment.

david-ryan-snyder · 2019-01-06T20:47:53Z

@dogancan, is there any existing recipe that benefits from this change? E.g., the callhome_diarization recipe? Generally we prefer to add functionality when there's a recipe that uses it. If there's no recipe that benefits from this, can you spend some more time explaining the scenarios where this option is helpful?

This is useful when you know the number of speakers in a recording and you want to make sure no speaker cluster gets more than a predefined fraction of utterances.

Let's say there's a recording with 2 speakers, and so we set the reco2num_spks to be 2 for that recording. Let's say that in reality, one speaker speaks about 80% of the time, and the other speaker speaks the remaining 20% of the time. Then, we set max_spk_fraction to equal, let's say 50%. Are we likely to end up with 3 clusters? Or 2 clusters with an equal amount of speech in each one (even though that is likely to be the wrong thing to do)?

mmaciej2 · 2019-01-06T20:58:37Z

src/ivector/agglomerative-clustering.h

      std::vector<int32> *assignments_out)
      : count_(0), costs_(costs), thresh_(thresh), min_clust_(min_clust),
        assignments_(assignments_out) {
    num_clusters_ = costs.NumRows();
    num_points_ = costs.NumRows();
+    max_cluster_size_ = int(num_points_ * max_cluster_fraction);


I believe this needs to be rounded. Unless I'm mistaken, int() rounds towards zero by default, so for example if we have 5 points, a 0.5 max_cluster_fraction, and 2 speakers, the fraction will pass the check of 0.5 >= 1/2, but the max cluster size will end up being 2, and so the clustering algorithm will never be able to produce 2 clusters, as one would have to be of size 3.

I think you are right. This needs to be rounded up.

dogancan · 2019-01-06T21:28:02Z

I'm not sure if there is an existing recipe that would directly benefit from this. This option is helpful when you know the number of speakers in the recording and you want to limit the number of utterances assigned to the largest cluster. The idea is not to force each cluster to be roughly the same size but to prevent the clustering algorithm from assigning almost all utterances to the same cluster. This kind of catastrophic failure happens if there are some outlier utterances that form their own small clusters and force other speaker clusters to be merged.

Let's say there's a recording with 2 speakers, and so we set the reco2num_spks to be 2 for that recording. Let's say that in reality, one speaker speaks about 80% of the time, and the other speaker speaks the remaining 20% of the time. Then, we set max_spk_fraction to equal, let's say 50%. Are we likely to end up with 3 clusters? Or 2 clusters with an equal amount of speech in each one (even though that is likely to be the wrong thing to do)?

This option should not affect the number of clusters produced. I mean that is the intention. I did not realize the code could produce different number of clusters until Matthew pointed out the rounding bug. I'll fix that in a bit. In the scenario you are describing, you would end up with 2 clusters with an equal amount of speech, but that is not how the fraction should be set. If you believe one speaker might speak about 80% of the time, then you should set the fraction to something larger than that. This fraction is supposed to act like an upper limit beyond which clusters should not be merged. For instance, in this specific scenario you can set the max-spk-fraction to 85% to ensure that no cluster grows beyond that. In general this fraction should be set to a value in the middle of the range [1.0/num_spk, 1.0] rather than a value close to the extremes.

david-ryan-snyder · 2019-01-06T21:30:53Z

OK, thanks for the explanation... It would be nice to see if this can help in something like the callhome_diarization recipe. In that recipe, we evaluate it with both a threshold for stopping, as well as with a given number of speakers, specified by reco2num_spks. So it would have the potential to help in the latter case.

dogancan · 2019-01-06T21:46:29Z

I think it would help as long as there are outliers in the data and a reasonable upper limit on cluster size can be estimated.

danpovey · 2019-01-06T21:51:09Z

Let's leave the PR up for a bit and play around with it here before we decide.

…

On Sun, Jan 6, 2019 at 1:46 PM Dogan Can ***@***.***> wrote: I think it would help as long as there are outliers in the data and a reasonable upper limit on cluster size can be estimated. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2971 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu9vDme9n063hPd6h8F5K4lEQKNBlks5vAm64gaJpZM4ZyXTi> .

david-ryan-snyder · 2019-01-06T21:55:54Z

There are two recipes this can be tested on, dihard_2018 and callhome_diarization. @dogancan do you have access to the datasets to test either recipe?

danpovey · 2019-01-06T21:57:38Z

Maybe we can have @HuangZiliAndy test it out?

dogancan · 2019-01-06T22:38:57Z

Unfortunately, I don't have access to either of those at the moment.

mmaciej2 · 2019-01-06T22:38:58Z

I've been thinking about this and I'm skeptical that this will ever "fix" an issue, but at best will do damage control.

In the case of only 2 speakers, if the normal clustering algorithm is throwing everything into one speaker due to outliers, you are just going to truncate your larger speaker to whatever proportion this parameter is set to and then assign the rest to the other speaker. I think that probably will result in a lower error rate but only because it would be doing some kind of naive smoothing of the diarization output.

The case of more than 2 speakers is a little more complicated, but I'm skeptical that it would be possible to even get the "smoothed" performance gains of the 2-speaker case any reasonable amount of time. If there's outliers that are causing a cluster to erroneously absorb other speakers, if you put a limit on how much of the recording it can absorb, I think once it hits that limit it will then just cause some other cluster to start erroneously absorbing other speakers. And then, I think being able to tune the threshold to not allow that to happen would require a threshold so low that it would interfere with the actual diarization output.

This is all just a hypothesis, of course, so I'm not arguing that this shouldn't be tried out on some recipes to see if it can improve performance. I'm just skeptical this can improve diarization performance with 3+ speakers, and I think any gains with 2 speakers won't be particularly meaningful.

dogancan · 2019-01-06T22:54:00Z

I've been thinking about this and I'm skeptical that this will ever "fix" an issue, but at best will do damage control.

I agree.

.I'm just skeptical this can improve diarization performance with 3+ speakers, and I think any gains with 2 speakers won't be particularly meaningful.

I think gains in the case of 2 speakers will be statistically significant as long as there are outliers and a reasonable upper limit can be estimated. That is what I am seeing on my end. But you agree that setting this to a large enough value, e.g. in the range [0.8, 1.0], won't makes things worse, right?

HuangZiliAndy · 2019-01-06T23:04:16Z

Okay. I will take a look at it. Best, Zili Dogan Can <notifications@github.com> 于2019年1月6日周日下午5:54写道：

…

I've been thinking about this and I'm skeptical that this will ever "fix" an issue, but at best will do damage control. I agree. .I'm just skeptical this can improve diarization performance with 3+ speakers, and I think any gains with 2 speakers won't be particularly meaningful. I think gains in the case of 2 speakers will be statistically significant as long as there are outliers and a reasonable upper limit can be estimated. That is what I am seeing on my end. But you agree that setting this to a large enough value, e.g. in the range [0.8, 1.0], won't makes things worse, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2971 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AhaD7fB1npEJRZPsORBaSi9IH6m8v3Xwks5vAn6UgaJpZM4ZyXTi> .

dogancan · 2019-01-06T23:04:44Z

Having thought a bit more, setting max cluster size might cause the algorithm to produce more clusters than what is provided in the min_clust argument. For instance, in the case of 2 speakers, if max-cluster-fraction is set to 0.6 and there are currently 3 clusters of roughly equal size, they will not be merged into 2 clusters. I am not sure whether it makes sense to try to avoid that situation by tweaking the algorithm though. Maybe we can print a warning to let the user know so they can set a larger max-cluster-fraction or increase the number of speakers. I think it is best to think of this option as another knob that can be used to determine the number of clusters.

mmaciej2 · 2019-01-06T23:06:21Z

I think gains in the case of 2 speakers will be statistically significant as long as there are outliers and a reasonable upper limit can be estimated. That is what I am seeing on my end. But you agree that setting this to a large enough value, e.g. in the range [0.8, 1.0], won't makes things worse, right?

I do not think that the gains won't be statistically significant, just that they won't be terribly meaningful. It's essentially a mechanism that tries to identify degenerate output and if it finds it, says "give me the X% of the recording that is the most similar and treat that as one speaker". I think that's a perfectly reasonable thing to do, and could even be potentially useful in certain applications (for example speaker ID, where even if you can't find the secondary speaker, it might be a good idea to minimize the contamination to the dominant speaker).

I do agree that if the parameter is set to a large enough value, it is not likely to make things worse. Current diarization performance tends to break down when one of the speakers contributes very little speech, and this is particularly true when clustering to oracle number of speakers. If there is a case where one person was speaking 90% of the time and our threshold is set to 0.85, I think it's unlikely the performance without the threshold was particularly good to begin with.

dogancan · 2019-01-06T23:08:16Z

Ok, we are in complete agreement on this :)

HuangZiliAndy · 2019-01-07T03:47:58Z

Hi all! I conduct the experiments on Callhome dataset, and the DER is increasing. https://docs.google.com/document/d/1Z23ZMWqy1Rxw-H3MaBQIKVFbozkTAmZtb3UfW3-cQhs/edit?usp=sharing

I think limiting max cluster size might help in situation when the speech of each speaker is quite even. However, it requires too much prior knowledge. And obviously there are some cases such operation will harm the prediction.

Best,
Zili

david-ryan-snyder · 2019-01-07T05:02:49Z

Thanks Zili! Any chance you have a dihard_2018 setup lying around that you could quickly test this on? If you only have one of the systems trained (e.g., x-vector) that's OK.

HuangZiliAndy · 2019-01-07T15:13:33Z

Hi David! Dihard 2018 results updated.

https://docs.google.com/document/d/1Z23ZMWqy1Rxw-H3MaBQIKVFbozkTAmZtb3UfW3-cQhs/edit?usp=sharing

X-vector has a little bit improvement when max_spk_fraction = 0.9, and i-vector performance declined. Thanks!

david-ryan-snyder · 2019-01-07T16:18:25Z

Thanks a lot @HuangZiliAndy for running these experiments!

In my opinion, unless we can show a tangible benefit in the existing examples, I don't think we should merge this feature.

My suggestion is to leave this unmerged for the time being, but keep it open for those who might want to try it on their setups. We might also decide to come back to this PR, if a recipe is created that looks like it could benefit from this.

dogancan · 2019-01-07T18:55:25Z

Thanks a lot Zili!

Having seen these results, I kind of agree with David. This functionality was helpful when dealing with some out-of-domain data I was diarizing, but I guess it is not really helpful when the system is well matched to the data.

david-ryan-snyder · 2019-12-06T17:25:11Z

@danpovey I believe this PR can be closed, since we merged PR #3058 which I think does the same thing.

[src, egs] Support limiting max cluster size in agglomerative clustering

8a71f6a

Change type of max_cluster_size_ to int

e4b77e0

mmaciej2 reviewed Jan 6, 2019

View reviewed changes

Round up max cluster size

cc6873a

dogancan mentioned this pull request Feb 27, 2019

[src,egs] Add support for two-pass agglomerative clustering. #3058

Merged

danpovey closed this Dec 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[src, egs] Support limiting max cluster size in agglomerative clustering #2971

[src, egs] Support limiting max cluster size in agglomerative clustering #2971

dogancan commented Jan 6, 2019

danpovey commented Jan 6, 2019

david-ryan-snyder commented Jan 6, 2019

mmaciej2 Jan 6, 2019

dogancan Jan 6, 2019

dogancan commented Jan 6, 2019

david-ryan-snyder commented Jan 6, 2019

dogancan commented Jan 6, 2019

danpovey commented Jan 6, 2019 via email

david-ryan-snyder commented Jan 6, 2019

danpovey commented Jan 6, 2019

dogancan commented Jan 6, 2019

mmaciej2 commented Jan 6, 2019

dogancan commented Jan 6, 2019

HuangZiliAndy commented Jan 6, 2019 via email

dogancan commented Jan 6, 2019

mmaciej2 commented Jan 6, 2019

dogancan commented Jan 6, 2019

HuangZiliAndy commented Jan 7, 2019

david-ryan-snyder commented Jan 7, 2019

HuangZiliAndy commented Jan 7, 2019

david-ryan-snyder commented Jan 7, 2019

dogancan commented Jan 7, 2019

david-ryan-snyder commented Dec 6, 2019

[src, egs] Support limiting max cluster size in agglomerative clustering #2971

[src, egs] Support limiting max cluster size in agglomerative clustering #2971

Conversation

dogancan commented Jan 6, 2019

danpovey commented Jan 6, 2019

david-ryan-snyder commented Jan 6, 2019

mmaciej2 Jan 6, 2019

Choose a reason for hiding this comment

dogancan Jan 6, 2019

Choose a reason for hiding this comment

dogancan commented Jan 6, 2019

david-ryan-snyder commented Jan 6, 2019

dogancan commented Jan 6, 2019

danpovey commented Jan 6, 2019 via email

david-ryan-snyder commented Jan 6, 2019

danpovey commented Jan 6, 2019

dogancan commented Jan 6, 2019

mmaciej2 commented Jan 6, 2019

dogancan commented Jan 6, 2019

HuangZiliAndy commented Jan 6, 2019 via email

dogancan commented Jan 6, 2019

mmaciej2 commented Jan 6, 2019

dogancan commented Jan 6, 2019

HuangZiliAndy commented Jan 7, 2019

david-ryan-snyder commented Jan 7, 2019

HuangZiliAndy commented Jan 7, 2019

david-ryan-snyder commented Jan 7, 2019

dogancan commented Jan 7, 2019

david-ryan-snyder commented Dec 6, 2019