Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Silero VAD in Batched Mode #936

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

MahmoudAshraf97
Copy link
Contributor

@MahmoudAshraf97 MahmoudAshraf97 commented Jul 28, 2024

This PR tries to close the gap between Batched and Sequential versions
Summary of Changes:

  • Reimplementation of Silero model and inference is now 3x faster
  • Batched pipeline now uses silero instead of pyannote vad, this reduces the amount of code in the repo that is needed to handle two vad models
  • added a script to evaluate WER on Youtube Commons ASR dataset (useful for longform and batched evaluation)
  • Unified the batched and sequential transcribe function as much as I could

WER Comparisons

Batched (without_timestamps=True, vad_filter=True, chunk_length=25) on Youtube Commons using distil-large-v3:
Before: WER: 13.910
After: WER: 13.712

Vad Parameters are not completely tuned, but I don't have the resources to evaluate on multilingual datasets, contributions are welcome

@hoonlight
Copy link
Contributor

hoonlight commented Jul 29, 2024

When I used the batch version, I got better transcription results compared to the sequential version. I'm not sure if this is due to pyannote VAD or if there is an additional process in the batch version that improves WER. Have you ever compared Silero VAD with pyannote VAD?

By the way, thank you for your contribution to improving faster-whisper. Even though it was a well-discussed and approved PR, anyone is entitled to have their opinion about it, but no one has the right to be rude.

@Jiltseb
Copy link
Contributor

Jiltseb commented Jul 29, 2024

When I used the batch version, I got better transcription results compared to the sequential version. I'm not sure if this is due to pyannote VAD or if there is an additional process in the batch version that improves WER. Have you ever compared Silero VAD with pyannote VAD?

By the way, thank you for your contribution to improving faster-whisper. Even though it was a well-discussed and approved PR, anyone is entitled to have their opinion about it, but no one has the right to be rude.

It is indeed possible to have better results for long-form transcription in batched mode. This is because there is no context passing between batches and it prevents ambiguous text from the previous context passing to the next set of frames for computation.

Thanks for your kind words regarding the batched PR.

@MahmoudAshraf97 I would suggest adding the numbers with pyannote VAD and silero VAD (WER and the speed-up) for completeness.

@MahmoudAshraf97 MahmoudAshraf97 marked this pull request as draft July 29, 2024 11:33
@zh-plus
Copy link
Contributor

zh-plus commented Jul 29, 2024

Is it better to let users choose the VAD model from pyannote VAD or Silero VAD?

I get better VAD segments for Chinese & Japanese audios with pyannote than Silero, even though I try hard to tune the VAD-related parameters for Silero.

Other users have also encountered this kinda issue: #925, #934.

@Jiltseb
Copy link
Contributor

Jiltseb commented Jul 29, 2024

Is it better to let users choose the VAD model from pyannote VAD or Silero VAD?

I get better VAD segments for Chinese & Japanese audios with pyannote than Silero, even though I try hard to tune the VAD-related parameters for Silero.

Other users have also encountered this kinda issue: #925, #934.

Pyannote model could be superior VAD, but the extra dependency on pyannote and torch is a concern at the moment.

@MahmoudAshraf97
Copy link
Contributor Author

@zh-plus it can be an option of course, but keeping pyannote will force us to keep pytorch in the requirements which we are trying to remove based on users feedback, i'm trying to think of a structure to make the whole batching thing optional with optional dependencies for those who want it

@MahmoudAshraf97
Copy link
Contributor Author

Performance numbers added, tests are passing locally but are failing on CI because torchaudio can't find a backend to use since they are not installed after the removal of pyannote (along with 78 other packages so I guess it's a win)
this PR should not be merged until we do one of the following:

  1. Include soundfile or sox in the requirements as a backend to torchaudio
  2. Revert back to using PyAV and use manual garbage collection to avoid the resampler memory leak if needed, this will make us one step closer to removing torch completely

@MahmoudAshraf97 MahmoudAshraf97 marked this pull request as ready for review July 30, 2024 18:31
* add onnx files to manifest
* change `merge_segments` to use audio indixes directly
@ozancaglayan
Copy link
Contributor

ozancaglayan commented Aug 13, 2024

Thanks for the PR!

Could you add your script that exports the Silero V5 model to encoder and decoder ONNX files? Also, why does it help to separate the model into two ONNX sessions for the performance?

min_speech_duration_ms: int = 250
onset: float = 0.5
offset: float = onset - 0.15
min_speech_duration_ms: int = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe leave these options (threshold, onset, offset) as they were, e.g. not rename them as it would break signature & parameter passing APIs?

Why are you changing min_speech_duration_ms to 0? I think 250ms is a sane default otherwise you may end up with segments that are very small for having speech inside, maybe even empty ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's best to give the users the freedom to tune the parameters as they wish, previously offset was fixed to threshold - 0.15, but now users have the option to tune it as they with without having to play with the code internals, it might not be backwards compatible but it's a very minimal change to adapt

as for min_speech_duration_ms, benchmarks (YT Commons and Librispeech) showed that dropping it from 250 to 0 had minimal positive or no effect on sequential inference, but it had a very positive impact on batched inference as it combines segments differently than the sequential

@MahmoudAshraf97
Copy link
Contributor Author

Thanks for the PR!

Could you add your script that exports the Silero V5 model to encoder and decoder ONNX files? Also, why does it help to separate the model into two ONNX sessions for the performance?

V4
V5

As for the reason, Silero models in general require the output of the previous sample to give a correct output for the next sample, but the input of the previous sample is only needed in the decoder stage which makes up a small amount of the total computation cost, so by splitting the model to an encoder and decoder and then batching the input to the encoder only, we gain 3X speedup while still generating identical outputs
for more reference check this discussion in the original repo

@ozancaglayan
Copy link
Contributor

ozancaglayan commented Aug 14, 2024

Thanks. Out of curiosity did you find those reference implementations elsewhere or did you rewrite them based on JIT'ted or is there a way to automatically generate from JIT'ted models?

PS: OK I think you can get the compiled graph from .code variables but that one does not seem to be a pure python implementation.

@MahmoudAshraf97
Copy link
Contributor Author

Thanks. Out of curiosity did you find those reference implementations elsewhere or did you rewrite them based on JIT'ted or is there a way to automatically generate from JIT'ted models?

PS: OK I think you can get the compiled graph from .code variables but that one does not seem to be a pure python implementation.

I reimplemented it from scratch based on what I could understand from the JITed code and mapped the weights manually using the dictionary, both implementations are within 1e-5 tolerance from the original implementation

@MahmoudAshraf97
Copy link
Contributor Author

Performance numbers added, tests are passing locally but are failing on CI because torchaudio can't find a backend to use since they are not installed after the removal of pyannote (along with 78 other packages so I guess it's a win) this PR should not be merged until we do one of the following:

  1. Include soundfile or sox in the requirements as a backend to torchaudio
  2. Revert back to using PyAV and use manual garbage collection to avoid the resampler memory leak if needed, this will make us one step closer to removing torch completely

Reverted back to PyAV in #961, once it is merged then this one is merged we can get rid of torch dependency

@Jiltseb
Copy link
Contributor

Jiltseb commented Aug 14, 2024

Performance numbers added, tests are passing locally but are failing on CI because torchaudio can't find a backend to use since they are not installed after the removal of pyannote (along with 78 other packages so I guess it's a win) this PR should not be merged until we do one of the following:

  1. Include soundfile or sox in the requirements as a backend to torchaudio
  2. Revert back to using PyAV and use manual garbage collection to avoid the resampler memory leak if needed, this will make us one step closer to removing torch completely

Reverted back to PyAV in #961, once it is merged then this one is merged we can get rid of torch dependency

Nice. I have also re-implemented numpy version to get rid of torch dependency. But will stick to this for removing the torch in two steps. I will test the memory leakage and report in #961.

@MahmoudAshraf97 MahmoudAshraf97 changed the title Use Silero VAD in Batched Mode, Other Vad refactors in Sequential mode Use Silero VAD in Batched Mode Aug 20, 2024
@kenho211
Copy link

Encounter another error for audio without speech. Not the same one as in #973
File "/home/ubuntu/.local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 362, in transcribe clip_timestamps = merge_segments(active_segments, vad_parameters) File "/home/ubuntu/.local/lib/python3.10/site-packages/faster_whisper/vad.py", line 315, in merge_segments curr_start = segments_list[0]["start"] IndexError: list index out of range

Can we just return an empty list in merge_segments if segment_list is empty?

@MahmoudAshraf97
Copy link
Contributor Author

should be fixed now

@hobodrifterdavid
Copy link

hobodrifterdavid commented Sep 8, 2024

Hi. I'm running a lot of audio through the batch transcribe function on this PR, getting a couple of exceptions on some files:

image

image

Appreciate the work guys.

@MahmoudAshraf97
Copy link
Contributor Author

@hobodrifterdavid can you upload audios that reproduce the two exceptions?

@hobodrifterdavid
Copy link

hobodrifterdavid commented Sep 8, 2024

I don't have the clips on hand. I just added a check to make sure the audio clips I am sending are at least 5s long (it's possible I was requesting transcription of some zero-length files), and I will improve the logging to record what is processing when an error occurs, will let you know if I see the error again.

If the passed audio data has zero length, it might be wise to throw a specific error up-front 'Passed audio is zero samples long' etc., if you don't already.

@carolinaxxxxx
Copy link

@MahmoudAshraf97 good job sir. However I have a problem - it does not respect the --hotwords option at all in standard interference and batching mode. I tested on different materials. In the "old" version before the batching mode was introduced, --hotwords worked very well. I'm using your forks now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants