Able to run at scale: handle larger datasets #2

neomatrix369 · 2020-09-12T20:15:35Z

At the moment the library runs slow and takes a long time to handle large datasets due to the processing require per record, this could be optimised and improved in small steps to be able to handle larger datasets

Opened on the back discussions in #1. Partially related to #3 although independent of the issue.

strivedi02 · 2020-09-13T08:19:53Z

I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining.

I have modified the code a little bit like this:

        for sentence in tqdm_notebook(list(new_dataframe[text_column])):
            spelling_quality_score_list.append(spelling_quality_score(sentence))
        new_dataframe['spelling_quality_score'] = spelling_quality_score_list

Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.

neomatrix369 · 2020-09-13T09:45:12Z

I'm glad you have been able to resolve this temporarily for yourself.

neomatrix369 · 2020-09-13T09:48:38Z

I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining.

I have modified the code a little bit like this:
        for sentence in tqdm_notebook(list(new_dataframe[text_column])):
            spelling_quality_score_list.append(spelling_quality_score(sentence))
        new_dataframe['spelling_quality_score'] = spelling_quality_score_list
Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.

It adds the progress bar which is good but not sure if performance is handled in this manner. For high-level NLP features, it might need to be handled differently.

Any thoughts on how you would test for this change? How would the tests look?

neomatrix369 · 2020-09-14T13:47:59Z

While you think about it and share your thoughts, I will take a look at it and try to improve this aspect of the library.

Thanks for nudging me about it.

neomatrix369 · 2020-09-18T14:03:28Z

@strivedi02 I'm working on an implementation to improve this issue via this branch https://github.com/neomatrix369/nlp_profiler/tree/scale-when-applied-to-larger-datasets, if you can test this out separately it would be cool, also have a look at this conversation for more context: https://www.kaggle.com/viratkothari/nlp-profiler-profiling-of-textual-dataset/comments#1015859

neomatrix369 · 2020-09-18T14:05:22Z

I tried to figure out why this whole thing was slow and it turns out the spelling quality operations are slowing it down. And right now I have only checked till high-level analysis still granular level analysis is remaining.

I have modified the code a little bit like this:
        for sentence in tqdm_notebook(list(new_dataframe[text_column])):
            spelling_quality_score_list.append(spelling_quality_score(sentence))
        new_dataframe['spelling_quality_score'] = spelling_quality_score_list
Am not sure if this is a very efficient method or not but atleast I got to know which line is taking a lot of time.

So now that I have looked into this again, and also worked on my implementation, you approach would help get it to use with tdqm but speed improvements won't happen till we look at it from parallelising point of view!

neomatrix369 · 2020-09-22T11:59:40Z

Some metrics gathered during implementation of this feature, comparing before and after the implementation:

commit/branch	dataset(rows)	time taken	in seconds	speed up (x times)	run by
master (~ `55c6347`)	7	6.82 seconds	6.82	baseline	Mani
master (~ `55c6347`)	100	211.2 seconds	211.2 seconds	baseline	Virat
master (~ `55c6347`)	210	1min 19s	79	baseline	Mani
master (~ `55c6347`	500	(TBC)	(TBC)	baseline	Virat
master (~ `55c6347`)	5,000	(TBC)	(TBC)	baseline	Virat
master (~ `55c6347`)	10,240	26 minutes 2 seconds	1562	baseline	Shubam Trivedi
nlp_profiler.py on AI-ML-DL repo (~ bf601172)	22,742	1 hour 24 mins	5040	baseline	Kurian
master (~ `55c6347`)	64,295	~4-6 hours	21600	baseline	Mani
scale-when-applied-to-larger-datasets (~ `a411c13`)	7	7.42 seconds	7	-0.0879x	Mani
scale-when-applied-to-larger-datasets (~ `78eb810`)	210	39.2 seconds	39.2	2x	Mani
scale-when-applied-to-larger-datasets (~ `a411c13`	500	455.3 seconds	455.3	no baseline yet	Virat
scale-when-applied-to-larger-datasets (~ `a411c13`)	5,000	(TBC)	(TBC)	no baseline yet	Virat
scale-when-applied-to-larger-datasets (~ `a411c13`)	10,240	2 minutes 35 seconds	95	~16.44x	Shubam Trivedi
scale-when-applied-to-larger-datasets (~ `a411c13`)	22,742	4min 37s	277	~18.19x	Kurian
scale-when-applied-to-larger-datasets (~ `a411c13`)	64,295	16-23 minutes	1380	~15.65x	Mani

neomatrix369 · 2020-09-22T12:01:01Z

@strivedi02 can you please share your metrics for the above (#2 (comment)) - please provide info for each and every column possible

neomatrix369 · 2020-09-22T12:28:00Z

Closed by PR #9

kurianbenoy · 2020-09-22T12:44:54Z

@neomatrix369 for me when applied to scale-when-applied-to-larger-datasets is 4 minutes 37 seconds

Output of %%time
CPU times: user 42.1 s, sys: 747 ms, total: 42.8 s
Wall time: 4min 37s

neomatrix369 · 2020-09-22T13:34:44Z

4min 37s

@kurianbenoy Can you please provide the other before and after details like commit ids of the branch you used to install the library? It should not be hard to find out, if you look at the logs it should be there.

strivedi02 · 2020-09-22T21:03:57Z

For this above time, the master branch was used and this was tested on colab.

and with the same settings the scale-when-applied-to-larger-datasets branch was used.

kurianbenoy · 2020-09-23T01:53:48Z

@neomatrix369 I was running this in Kaggle. The previous experiment with associated time can be found here. I was probably using version 21 of your NLP Profiler Class notebook.

The recent version can be found here. I hope it helps you find the exact version

neomatrix369 · 2020-09-23T14:51:40Z

@strivedi02 @kurianbenoy 🙇 thanks both for the references, you can see above the updated table of the approximate speed ups

neomatrix369 · 2020-09-23T16:52:04Z

@strivedi02 thanks for raising the initial discussion #1, and pointers raised about the different issues, this and other issues have been resolved (we still have pending ones but that is fine) as result of user/community feedback and interactions.

With regards to performance of the library, it's an ongoing effort to keep in mind. But adding new NLP features would usually precede such issues.

strivedi02 · 2020-09-23T17:00:51Z

@neomatrix369 I always had to struggle to keep all my scripts in one place or I would have to remember which code is where, but now thanks to you we won't have to remember all that. Through this package, a lot of things will become easy, and I think in the future it will keep growing in terms of usage by the community.

neomatrix369 · 2020-09-23T20:58:53Z

@neomatrix369 I always had to struggle to keep all my scripts in one place or I would have to remember which code is where, but now thanks to you we won't have to remember all that. Through this package, a lot of things will become easy, and I think in the future it will keep growing in terms of usage by the community.

That's really good to know. Glad it helps everyone. It is also what I observed, everyone was using their own recipes, now you can share and contribute and extend a central recipe.

@strivedi02 Does the library have most if not all of the things you use or one would need when dealing with text? I think there is room for a lot more.

Feel free to open issues/pull requests to extend the existing functionalities to contain additional relevant ones - that is useful for NLP practitioners.

neomatrix369 · 2020-09-28T15:16:39Z

@loopyme I'll be happy to hear your feedback on the work done via this issue, please let me know how I can answer your questions and clarify any doubts.

I have tried to build this library from ground-up paying attention to the cohesive modules and structure of the library as a whole.

neomatrix369 changed the title ~~Able to run at scale~~ Able to run at scale: handle larger datasets Sep 12, 2020

neomatrix369 added the enhancement New feature or request label Sep 12, 2020

neomatrix369 self-assigned this Sep 12, 2020

neomatrix369 added the performance label Sep 12, 2020

neomatrix369 mentioned this issue Sep 12, 2020

Package is not installing on Python 3.6 #1

Closed

neomatrix369 mentioned this issue Sep 13, 2020

Show progress bar when processing text data #3

Closed

This was referenced Sep 20, 2020

Improve logic behind spell checking text #8

Closed

Scale when applied to larger datasets #9

Merged

neomatrix369 closed this as completed Sep 22, 2020

neomatrix369 mentioned this issue Oct 3, 2020

Scale when applied to larger datasets #12

Merged

bitanb1999 mentioned this issue Mar 12, 2023

Spelling checker has been modified #71

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Able to run at scale: handle larger datasets #2

Able to run at scale: handle larger datasets #2

neomatrix369 commented Sep 12, 2020 •

edited

Loading

strivedi02 commented Sep 13, 2020 •

edited

Loading

neomatrix369 commented Sep 13, 2020

neomatrix369 commented Sep 13, 2020 •

edited

Loading

neomatrix369 commented Sep 14, 2020 •

edited

Loading

neomatrix369 commented Sep 18, 2020 •

edited

Loading

neomatrix369 commented Sep 18, 2020

neomatrix369 commented Sep 22, 2020 •

edited

Loading

neomatrix369 commented Sep 22, 2020

neomatrix369 commented Sep 22, 2020

kurianbenoy commented Sep 22, 2020

neomatrix369 commented Sep 22, 2020 •

edited

Loading

strivedi02 commented Sep 22, 2020

kurianbenoy commented Sep 23, 2020

neomatrix369 commented Sep 23, 2020 •

edited

Loading

neomatrix369 commented Sep 23, 2020 •

edited

Loading

strivedi02 commented Sep 23, 2020

neomatrix369 commented Sep 23, 2020 •

edited

Loading

neomatrix369 commented Sep 28, 2020

Able to run at scale: handle larger datasets #2

Able to run at scale: handle larger datasets #2

Comments

neomatrix369 commented Sep 12, 2020 • edited Loading

strivedi02 commented Sep 13, 2020 • edited Loading

neomatrix369 commented Sep 13, 2020

neomatrix369 commented Sep 13, 2020 • edited Loading

neomatrix369 commented Sep 14, 2020 • edited Loading

neomatrix369 commented Sep 18, 2020 • edited Loading

neomatrix369 commented Sep 18, 2020

neomatrix369 commented Sep 22, 2020 • edited Loading

neomatrix369 commented Sep 22, 2020

neomatrix369 commented Sep 22, 2020

kurianbenoy commented Sep 22, 2020

neomatrix369 commented Sep 22, 2020 • edited Loading

strivedi02 commented Sep 22, 2020

kurianbenoy commented Sep 23, 2020

neomatrix369 commented Sep 23, 2020 • edited Loading

neomatrix369 commented Sep 23, 2020 • edited Loading

strivedi02 commented Sep 23, 2020

neomatrix369 commented Sep 23, 2020 • edited Loading

neomatrix369 commented Sep 28, 2020

neomatrix369 commented Sep 12, 2020 •

edited

Loading

strivedi02 commented Sep 13, 2020 •

edited

Loading

neomatrix369 commented Sep 13, 2020 •

edited

Loading

neomatrix369 commented Sep 14, 2020 •

edited

Loading

neomatrix369 commented Sep 18, 2020 •

edited

Loading

neomatrix369 commented Sep 22, 2020 •

edited

Loading

neomatrix369 commented Sep 22, 2020 •

edited

Loading

neomatrix369 commented Sep 23, 2020 •

edited

Loading

neomatrix369 commented Sep 23, 2020 •

edited

Loading

neomatrix369 commented Sep 23, 2020 •

edited

Loading