Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Able to run at scale: handle larger datasets #2

Closed
neomatrix369 opened this issue Sep 12, 2020 · 18 comments
Closed

Able to run at scale: handle larger datasets #2

neomatrix369 opened this issue Sep 12, 2020 · 18 comments
Assignees
Labels
enhancement New feature or request performance

Comments

@neomatrix369
Copy link
Owner

neomatrix369 commented Sep 12, 2020

At the moment the library runs slow and takes a long time to handle large datasets due to the processing require per record, this could be optimised and improved in small steps to be able to handle larger datasets

Opened on the back discussions in #1. Partially related to #3 although independent of the issue.

@neomatrix369 neomatrix369 changed the title Able to run at scale Able to run at scale: handle larger datasets Sep 12, 2020
@neomatrix369 neomatrix369 added the enhancement New feature or request label Sep 12, 2020
@neomatrix369 neomatrix369 self-assigned this Sep 12, 2020
@strivedi02
Copy link

strivedi02 commented Sep 13, 2020

I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining.
Screenshot from 2020-09-13 03-03-35
I have modified the code a little bit like this:

        for sentence in tqdm_notebook(list(new_dataframe[text_column])):
            spelling_quality_score_list.append(spelling_quality_score(sentence))
        new_dataframe['spelling_quality_score'] = spelling_quality_score_list

Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.

@neomatrix369
Copy link
Owner Author

I'm glad you have been able to resolve this temporarily for yourself.

@neomatrix369
Copy link
Owner Author

neomatrix369 commented Sep 13, 2020

I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining.
Screenshot from 2020-09-13 03-03-35
I have modified the code a little bit like this:

        for sentence in tqdm_notebook(list(new_dataframe[text_column])):
            spelling_quality_score_list.append(spelling_quality_score(sentence))
        new_dataframe['spelling_quality_score'] = spelling_quality_score_list

Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.

It adds the progress bar which is good but not sure if performance is handled in this manner. For high-level NLP features, it might need to be handled differently.

Any thoughts on how you would test for this change? How would the tests look?

@neomatrix369
Copy link
Owner Author

neomatrix369 commented Sep 14, 2020

While you think about it and share your thoughts, I will take a look at it and try to improve this aspect of the library.

Thanks for nudging me about it.

@neomatrix369
Copy link
Owner Author

neomatrix369 commented Sep 18, 2020

@strivedi02 I'm working on an implementation to improve this issue via this branch https://github.com/neomatrix369/nlp_profiler/tree/scale-when-applied-to-larger-datasets, if you can test this out separately it would be cool, also have a look at this conversation for more context: https://www.kaggle.com/viratkothari/nlp-profiler-profiling-of-textual-dataset/comments#1015859

@neomatrix369
Copy link
Owner Author

I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining.
Screenshot from 2020-09-13 03-03-35
I have modified the code a little bit like this:

        for sentence in tqdm_notebook(list(new_dataframe[text_column])):
            spelling_quality_score_list.append(spelling_quality_score(sentence))
        new_dataframe['spelling_quality_score'] = spelling_quality_score_list

Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.

So now that I have looked into this again, and also worked on my implementation, you approach would help get it to use with tdqm but speed improvements won't happen till we look at it from parallelising point of view!

@neomatrix369
Copy link
Owner Author

neomatrix369 commented Sep 22, 2020

Some metrics gathered during implementation of this feature, comparing before and after the implementation:

commit/branch dataset(rows) time taken in seconds speed up (x times) run by
master (~ 55c6347) 7 6.82 seconds 6.82 baseline Mani
master (~ 55c6347) 100 211.2 seconds 211.2 seconds baseline Virat
master (~ 55c6347) 210 1min 19s 79 baseline Mani
master (~ 55c6347 500 (TBC) (TBC) baseline Virat
master (~ 55c6347) 5,000 (TBC) (TBC) baseline Virat
master (~ 55c6347) 10,240 26 minutes 2 seconds 1562 baseline Shubam Trivedi
nlp_profiler.py on AI-ML-DL repo (~ bf601172) 22,742 1 hour 24 mins 5040 baseline Kurian
master (~ 55c6347) 64,295 ~4-6 hours 21600 baseline Mani
scale-when-applied-to-larger-datasets (~ a411c13) 7 7.42 seconds 7 -0.0879x Mani
scale-when-applied-to-larger-datasets (~ 78eb810) 210 39.2 seconds 39.2 2x Mani
scale-when-applied-to-larger-datasets (~ a411c13 500 455.3 seconds 455.3 no baseline yet Virat
scale-when-applied-to-larger-datasets (~ a411c13) 5,000 (TBC) (TBC) no baseline yet Virat
scale-when-applied-to-larger-datasets (~ a411c13) 10,240 2 minutes 35 seconds 95 ~16.44x Shubam Trivedi
scale-when-applied-to-larger-datasets (~ a411c13) 22,742 4min 37s 277 ~18.19x Kurian
scale-when-applied-to-larger-datasets (~ a411c13) 64,295 16-23 minutes 1380 ~15.65x Mani

@neomatrix369
Copy link
Owner Author

@strivedi02 can you please share your metrics for the above (#2 (comment)) - please provide info for each and every column possible

@neomatrix369
Copy link
Owner Author

Closed by PR #9

@kurianbenoy
Copy link

@neomatrix369 for me when applied to scale-when-applied-to-larger-datasets is 4 minutes 37 seconds

Output of %%time
CPU times: user 42.1 s, sys: 747 ms, total: 42.8 s
Wall time: 4min 37s

@neomatrix369
Copy link
Owner Author

neomatrix369 commented Sep 22, 2020

4min 37s

@kurianbenoy Can you please provide the other before and after details like commit ids of the branch you used to install the library? It should not be hard to find out, if you look at the logs it should be there.

@strivedi02
Copy link

Screenshot from 2020-09-23 01-30-19
For this above time, the master branch was used and this was tested on colab.

Screenshot from 2020-09-23 01-42-55
and with the same settings the scale-when-applied-to-larger-datasets branch was used.

@kurianbenoy
Copy link

@neomatrix369 I was running this in Kaggle. The previous experiment with associated time can be found here. I was probably using version 21 of your NLP Profiler Class notebook.

The recent version can be found here. I hope it helps you find the exact version

@neomatrix369
Copy link
Owner Author

neomatrix369 commented Sep 23, 2020

@strivedi02 @kurianbenoy 🙇 thanks both for the references, you can see above the updated table of the approximate speed ups

@neomatrix369
Copy link
Owner Author

neomatrix369 commented Sep 23, 2020

@strivedi02 thanks for raising the initial discussion #1, and pointers raised about the different issues, this and other issues have been resolved (we still have pending ones but that is fine) as result of user/community feedback and interactions.

With regards to performance of the library, it's an ongoing effort to keep in mind. But adding new NLP features would usually precede such issues.

@strivedi02
Copy link

@neomatrix369 I always had to struggle to keep all my scripts in one place or I would have to remember which code is where, but now thanks to you we won't have to remember all that. Through this package, a lot of things will become easy, and I think in the future it will keep growing in terms of usage by the community.

@neomatrix369
Copy link
Owner Author

neomatrix369 commented Sep 23, 2020

@neomatrix369 I always had to struggle to keep all my scripts in one place or I would have to remember which code is where, but now thanks to you we won't have to remember all that. Through this package, a lot of things will become easy, and I think in the future it will keep growing in terms of usage by the community.

That's really good to know. Glad it helps everyone. It is also what I observed, everyone was using their own recipes, now you can share and contribute and extend a central recipe.

@strivedi02 Does the library have most if not all of the things you use or one would need when dealing with text? I think there is room for a lot more.

Feel free to open issues/pull requests to extend the existing functionalities to contain additional relevant ones - that is useful for NLP practitioners.

@neomatrix369
Copy link
Owner Author

@loopyme I'll be happy to hear your feedback on the work done via this issue, please let me know how I can answer your questions and clarify any doubts.

I have tried to build this library from ground-up paying attention to the cohesive modules and structure of the library as a whole.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

3 participants