Another strong reduce of concatenation for a small optimization (-4% inference time) #1130
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reorganize embeddings + add some padding such a way we have only one call to concat (and no stack operation). Padding is preallocated to limit memory allocation. Get a constant 1 second inference time reduction on CONLL 2003 on 2080TI (25 -> 24s).
Had no measurable effect on my French dataset.
Good thing, the code is easier to read :-)
Nb: not related but I made a mistake in my measures... unfortunately memory transfer is not the main remaining bottleneck, I measured some synchronization operation which was happening before memory transfer. now I am using
;CUDA_LAUNCH_BLOCKING=1
before using cProfile instead of trying to be smart and calling synchronize manually here and there... Another thing, little functions called million of times appears slower than they are with the profiler...