Merge pull request #73 from neomatrix369/reformating-code-and-minor-f…

…ixes Refactor: reformatting python code across all the source files
neomatrix369 · Mar 13, 2023 · f9cb2e6 · f9cb2e6
2 parents a3538c6 + def1ee8
commit f9cb2e6
Show file tree

Hide file tree

Showing 59 changed files with 1,103 additions and 1,174 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -14,7 +14,7 @@ Please check the options that you have completed and strike-out the options that
 - [ ] you have read
     - [ ] the [Contributing doc](https://github.com/neomatrix369/nlp_profiler/blob/master/CONTRIBUTING.md) 
     - [ ] the [Developer Guide](https://github.com/neomatrix369/nlp_profiler/blob/master/developer-guide.md)
-- [ ] the pull request passes the tests (`./test-coverage "tests slow-tests"`) - this will also be visible via the Code coverage report and CI/CD task on the Pull Request
+- [ ] the pull request passes the tests (`./test-coverage.sh "tests slow-tests"`) - this will also be visible via the Code coverage report and CI/CD task on the Pull Request
 - [ ] you have performed some kind of smoke test by running your changes in an isolated environment i.e. Docker container, Google Colab, Kaggle, etc...
 - [ ] the notebooks are updated (see `notebooks` folder, read the [Notebooks](./notebooks/README.md) docs)    
 - [ ]  [CHANGELOG.md](https://github.com/neomatrix369/nlp_profiler/blob/master/CHANGELOG.md) has been updated (please follow the existing format)

diff --git a/.github/workflows/end-to-end-flow.yml b/.github/workflows/end-to-end-flow.yml
@@ -57,15 +57,17 @@ jobs:
       - name: install-line-profiler-on-windows-python-3.7
         run: |
           ### https://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console
+          ### https://github.com/conda/conda/issues/7445#issuecomment-774579800
           set PYTHONIOENCODING="utf-8"
+          set PYTHONLEGACYWINDOWSSTDIO="utf-8"
           pip install win-unicode-console
           python -m pip install line-profiler@https://github.com/neomatrix369/nlp_profiler/releases/download/v0.0.2-dev/line_profiler-3.2.6-cp37-cp37m-win_amd64.whl
         if: matrix.python-version == '3.7' &&  matrix.os == 'windows-latest'
 
       - name: install-line-profiler-on-windows-python-3.8
         run: |
           ### https://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console
-          set PYTHONIOENCODING="utf-8"
+          ### https://github.com/conda/conda/issues/7445#issuecomment-774579800
           pip install win-unicode-console
           python -m pip install line-profiler@https://github.com/neomatrix369/nlp_profiler/releases/download/v0.0.2-dev/line_profiler-3.2.6-cp38-cp38-win_amd64.whl
         if: matrix.python-version == '3.8' &&  matrix.os == 'windows-latest'
@@ -76,8 +78,12 @@ jobs:
       # Runs a set of commands using the runners shell
       - name: run-test-coverage-shell-script
         shell: bash
+        env:
+          PYTHONUTF8: 1
+          PYTHONIOENCODING: utf-8
+          PYTHONLEGACYWINDOWSSTDIO: utf-8
         run: |
-          ./test-coverage.sh "tests slow-tests"
+           ./test-coverage.sh "tests slow-tests"
       - name: Upload coverage to Codecov
         uses: codecov/codecov-action@v1
         if: matrix.python-version == '3.8' && matrix.os == 'ubuntu-18.04'

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -205,4 +205,24 @@ Replaced language tool with Gingerit for faster calculations
 
 ---
 
+### GitHub branch `revert-76-sourcery/revert-71-spelling_check` Granular features: reverted change made to spell checks
+
+Implemented functionality via PR [#75](https://github.com/neomatrix369/nlp_profiler/pull/75) - details described in the body of the PR.
+
+Reverting spell check functionality as it is not tested and tests change/break with new implementation.
+
+[2cddf51](https://github.com/neomatrix369/nlp_profiler/commit/2cddf51a605b434d12604d2dba9457c415808bfe) [@neomatrix369](https://github.com/neomatrix369) _Mon Mar 13 02:56:40 2023 +0000_
+
+---
+
+### GitHub branch `reformating-code-and-minor-fixes` Reformatting code, refactoring as per Sourcery, minor fixes and test fixes
+
+Implemented functionality via PR [#73](https://github.com/neomatrix369/nlp_profiler/pull/73) - details described in the body of the PR.
+
+Reformatting code, refactoring as per Sourcery, minor fixes and test fixes. Bringing back the build system in order. Fixes old regressed tests.
+
+[7caeb47](https://github.com/neomatrix369/nlp_profiler/commit/7caeb47e3795a22c1884731320a02d4b0c39ca0c) [@neomatrix369](https://github.com/neomatrix369) _Mon Mar 13 11:23:49 2023 +0000_
+
+---
+
 Return to [README.md](README.md)
diff --git a/developer-guide.md b/developer-guide.md
@@ -66,7 +66,7 @@ pip install --prefix .
 Run all the tests with coverage information using the below command after all packages have been successfully installed:
 
 ```bash
-./test-coverage tests slow-tests
+./test-coverage.sh tests slow-tests
 ```
 
 On the tests passing (or partially passing), these folders will be created:

diff --git a/nlp_profiler/__init__.py b/nlp_profiler/__init__.py
@@ -1 +1 @@
-__version__ = "0.0.3"
+__version__ = "0.0.3"
diff --git a/nlp_profiler/constants.py b/nlp_profiler/constants.py
@@ -1,64 +1,64 @@
-DEFAULT_PARALLEL_METHOD = 'default'
+DEFAULT_PARALLEL_METHOD = "default"
 
-SWIFTER_METHOD = 'using_swifter'
+SWIFTER_METHOD = "using_swifter"
 
-GRANULAR_OPTION = 'granular'
-HIGH_LEVEL_OPTION = 'high_level'
-GRAMMAR_CHECK_OPTION = 'grammar_check'
-SPELLING_CHECK_OPTION = 'spelling_check'
-EASE_OF_READING_CHECK_OPTION = 'ease_of_reading_check'
-PARALLELISATION_METHOD_OPTION = 'parallelisation_method'
+GRANULAR_OPTION = "granular"
+HIGH_LEVEL_OPTION = "high_level"
+GRAMMAR_CHECK_OPTION = "grammar_check"
+SPELLING_CHECK_OPTION = "spelling_check"
+EASE_OF_READING_CHECK_OPTION = "ease_of_reading_check"
+PARALLELISATION_METHOD_OPTION = "parallelisation_method"
 NOT_APPLICABLE = "N/A"
 
-NaN = float('nan')
+NaN = float("nan")
 
 # --- Columns generated
 # High-level
 ## Grammar check
-GRAMMAR_CHECK_COL = 'grammar_check'
-GRAMMAR_CHECK_SCORE_COL = 'grammar_check_score'
+GRAMMAR_CHECK_COL = "grammar_check"
+GRAMMAR_CHECK_SCORE_COL = "grammar_check_score"
 
 ## Spelling check
-SPELLING_QUALITY_SCORE_COL = 'spelling_quality_score'
-SPELLING_QUALITY_COL = 'spelling_quality'
-SPELLING_QUALITY_SUMMARISED_COL = 'spelling_quality_summarised'
+SPELLING_QUALITY_SCORE_COL = "spelling_quality_score"
+SPELLING_QUALITY_COL = "spelling_quality"
+SPELLING_QUALITY_SUMMARISED_COL = "spelling_quality_summarised"
 
 ## Sentiment analysis
-SENTIMENT_POLARITY_SCORE_COL = 'sentiment_polarity_score'
-SENTIMENT_POLARITY_COL = 'sentiment_polarity'
-SENTIMENT_POLARITY_SUMMARISED_COL = 'sentiment_polarity_summarised'
-SENTIMENT_SUBJECTIVITY_SCORE_COL = 'sentiment_subjectivity_score'
-SENTIMENT_SUBJECTIVITY_COL = 'sentiment_subjectivity'
-SENTIMENT_SUBJECTIVITY_SUMMARISED_COL = 'sentiment_subjectivity_summarised'
+SENTIMENT_POLARITY_SCORE_COL = "sentiment_polarity_score"
+SENTIMENT_POLARITY_COL = "sentiment_polarity"
+SENTIMENT_POLARITY_SUMMARISED_COL = "sentiment_polarity_summarised"
+SENTIMENT_SUBJECTIVITY_SCORE_COL = "sentiment_subjectivity_score"
+SENTIMENT_SUBJECTIVITY_COL = "sentiment_subjectivity"
+SENTIMENT_SUBJECTIVITY_SUMMARISED_COL = "sentiment_subjectivity_summarised"
 
 ## Spelling check
-EASE_OF_READING_SCORE_COL = 'ease_of_reading_score'
-EASE_OF_READING_COL = 'ease_of_reading_quality'
-EASE_OF_READING_SUMMARISED_COL = 'ease_of_reading_summarised'
+EASE_OF_READING_SCORE_COL = "ease_of_reading_score"
+EASE_OF_READING_COL = "ease_of_reading_quality"
+EASE_OF_READING_SUMMARISED_COL = "ease_of_reading_summarised"
 
 # ---
 # Granular
-DATES_COUNT_COL = 'dates_count'
-STOP_WORDS_COUNT_COL = 'stop_words_count'
-PUNCTUATIONS_COUNT_COL = 'punctuations_count'
+DATES_COUNT_COL = "dates_count"
+STOP_WORDS_COUNT_COL = "stop_words_count"
+PUNCTUATIONS_COUNT_COL = "punctuations_count"
 REPEATED_PUNCTUATIONS_COUNT_COL = "repeated_punctuations_count"
-NON_ALPHA_NUMERIC_COUNT_COL = 'non_alpha_numeric_count'
-ALPHA_NUMERIC_COUNT_COL = 'alpha_numeric_count'
-REPEATED_LETTERS_COUNT_COL = 'repeated_letters_count'
+NON_ALPHA_NUMERIC_COUNT_COL = "non_alpha_numeric_count"
+ALPHA_NUMERIC_COUNT_COL = "alpha_numeric_count"
+REPEATED_LETTERS_COUNT_COL = "repeated_letters_count"
 REPEATED_DIGITS_COUNT_COL = "repeated_digits_count"
-WHOLE_NUMBERS_COUNT_COL = 'whole_numbers_count'
-EMOJI_COUNT_COL = 'emoji_count'
-DUPLICATES_COUNT_COL = 'duplicates_count'
-COUNT_WORDS_COL = 'count_words'
-SPACES_COUNT_COL = 'spaces_count'
-CHARS_EXCL_SPACES_COUNT_COL = 'chars_excl_spaces_count'
-REPEATED_SPACES_COUNT_COL = 'repeated_spaces_count'
-WHITESPACES_COUNT_COL = 'whitespaces_count'
-CHARS_EXCL_WHITESPACES_COUNT_COL = 'chars_excl_whitespaces_count'
-REPEATED_WHITESPACES_COUNT_COL = 'repeated_whitespaces_count'
-CHARACTERS_COUNT_COL = 'characters_count'
-ENGLISH_CHARACTERS_COUNT_COL = 'english_characters_count'
-NON_ENGLISH_CHARACTERS_COUNT_COL = 'non_english_characters_count'
-SYLLABLES_COUNT_COL = 'syllables_count'
-SENTENCES_COUNT_COL = 'sentences_count'
-NOUN_PHRASE_COUNT_COL = 'noun_phrase_count'
+WHOLE_NUMBERS_COUNT_COL = "whole_numbers_count"
+EMOJI_COUNT_COL = "emoji_count"
+DUPLICATES_COUNT_COL = "duplicates_count"
+COUNT_WORDS_COL = "count_words"
+SPACES_COUNT_COL = "spaces_count"
+CHARS_EXCL_SPACES_COUNT_COL = "chars_excl_spaces_count"
+REPEATED_SPACES_COUNT_COL = "repeated_spaces_count"
+WHITESPACES_COUNT_COL = "whitespaces_count"
+CHARS_EXCL_WHITESPACES_COUNT_COL = "chars_excl_whitespaces_count"
+REPEATED_WHITESPACES_COUNT_COL = "repeated_whitespaces_count"
+CHARACTERS_COUNT_COL = "characters_count"
+ENGLISH_CHARACTERS_COUNT_COL = "english_characters_count"
+NON_ENGLISH_CHARACTERS_COUNT_COL = "non_english_characters_count"
+SYLLABLES_COUNT_COL = "syllables_count"
+SENTENCES_COUNT_COL = "sentences_count"
+NOUN_PHRASE_COUNT_COL = "noun_phrase_count"
diff --git a/nlp_profiler/core.py b/nlp_profiler/core.py
@@ -18,24 +18,29 @@
 
 import pandas as pd
 
-from nlp_profiler.constants import \
-    PARALLELISATION_METHOD_OPTION, DEFAULT_PARALLEL_METHOD, GRANULAR_OPTION, HIGH_LEVEL_OPTION, \
-    GRAMMAR_CHECK_OPTION, SPELLING_CHECK_OPTION, EASE_OF_READING_CHECK_OPTION
+from nlp_profiler.constants import (
+    PARALLELISATION_METHOD_OPTION,
+    DEFAULT_PARALLEL_METHOD,
+    GRANULAR_OPTION,
+    HIGH_LEVEL_OPTION,
+    GRAMMAR_CHECK_OPTION,
+    SPELLING_CHECK_OPTION,
+    EASE_OF_READING_CHECK_OPTION,
+)
 from nlp_profiler.generate_features import get_progress_bar
 from nlp_profiler.granular_features import apply_granular_features
 from nlp_profiler.high_level_features import apply_high_level_features
-from nlp_profiler.high_level_features.grammar_quality_check \
-    import apply_grammar_check
-from nlp_profiler.high_level_features.spelling_quality_check \
-    import apply_spelling_check
-from nlp_profiler.high_level_features.ease_of_reading_check \
-    import apply_ease_of_reading_check
+from nlp_profiler.high_level_features.grammar_quality_check import apply_grammar_check
+from nlp_profiler.high_level_features.spelling_quality_check import apply_spelling_check
+from nlp_profiler.high_level_features.ease_of_reading_check import apply_ease_of_reading_check
 
 
-def apply_text_profiling(dataframe: pd.DataFrame,
-                         text_column: str,
-                         params: dict = {}) -> pd.DataFrame:
-    columns_to_drop = list(set(dataframe.columns) - set([text_column]))
+def apply_text_profiling(dataframe: pd.DataFrame, text_column: str, params: dict = None) -> pd.DataFrame:
+    if params is None:
+        params = {}
+
+    # sourcery skip: dict-assign-update-to-union
+    columns_to_drop = list(set(dataframe.columns) - {text_column})
     new_dataframe = dataframe.drop(columns=columns_to_drop, axis=1).copy()
 
     default_params = {
@@ -44,7 +49,7 @@ def apply_text_profiling(dataframe: pd.DataFrame,
         GRAMMAR_CHECK_OPTION: False,  # default: False as slow process but can Enabled
         SPELLING_CHECK_OPTION: True,  # default: True although slightly slow process but can Disabled
         EASE_OF_READING_CHECK_OPTION: True,
-        PARALLELISATION_METHOD_OPTION: DEFAULT_PARALLEL_METHOD
+        PARALLELISATION_METHOD_OPTION: DEFAULT_PARALLEL_METHOD,
     }
 
     default_params.update(params)
@@ -55,21 +60,17 @@ def apply_text_profiling(dataframe: pd.DataFrame,
         (HIGH_LEVEL_OPTION, "High-level features", apply_high_level_features),
         (GRAMMAR_CHECK_OPTION, "Grammar checks", apply_grammar_check),
         (SPELLING_CHECK_OPTION, "Spelling checks", apply_spelling_check),
-        (EASE_OF_READING_CHECK_OPTION, "Ease of reading check", apply_ease_of_reading_check)
+        (EASE_OF_READING_CHECK_OPTION, "Ease of reading check", apply_ease_of_reading_check),
     ]
 
-    for index, item in enumerate(actions_mappings.copy()):
+    for item in actions_mappings.copy():
         (param, _, _) = item
         if not default_params[param]:
             actions_mappings.remove(item)
 
     apply_profiling_progress_bar = get_progress_bar(actions_mappings)
-    for _, (param, action_description, action_function) in \
-            enumerate(apply_profiling_progress_bar):
+    for param, action_description, action_function in apply_profiling_progress_bar:
         apply_profiling_progress_bar.set_description(action_description)
-        action_function(
-            action_description, new_dataframe,
-            text_column, default_params[PARALLELISATION_METHOD_OPTION]
-        )
+        action_function(action_description, new_dataframe, text_column, default_params[PARALLELISATION_METHOD_OPTION])
 
     return new_dataframe
diff --git a/nlp_profiler/generate_features/__init__.py b/nlp_profiler/generate_features/__init__.py
@@ -2,28 +2,28 @@
 import swifter  # noqa
 
 from nlp_profiler.constants import DEFAULT_PARALLEL_METHOD, SWIFTER_METHOD
-from nlp_profiler.generate_features.parallelisation_methods \
-    import get_progress_bar, using_joblib_parallel, using_swifter
+from nlp_profiler.generate_features.parallelisation_methods import (
+    get_progress_bar,
+    using_joblib_parallel,
+    using_swifter,
+)
 
 
-def generate_features(main_header: str,
-                      high_level_features_steps: list,
-                      new_dataframe: pd.DataFrame,
-                      parallelisation_method: str = DEFAULT_PARALLEL_METHOD):
+def generate_features(
+    main_header: str,
+    high_level_features_steps: list,
+    new_dataframe: pd.DataFrame,
+    parallelisation_method: str = DEFAULT_PARALLEL_METHOD,
+):
     generate_feature_progress_bar = get_progress_bar(high_level_features_steps)
 
     # Using swifter or Using joblib Parallel and delay method:
     parallelisation_method_function = using_joblib_parallel
     if parallelisation_method == SWIFTER_METHOD:
         parallelisation_method_function = using_swifter
 
-    for _, (new_column, source_column, transformation_function) in \
-            enumerate(generate_feature_progress_bar):
+    for new_column, source_column, transformation_function in generate_feature_progress_bar:
         source_field = new_dataframe[source_column]
-        generate_feature_progress_bar.set_description(
-            f'{main_header}: {source_column} => {new_column}'
-        )
+        generate_feature_progress_bar.set_description(f"{main_header}: {source_column} => {new_column}")
 
-        new_dataframe[new_column] = parallelisation_method_function(
-            source_field, transformation_function, new_column
-        )
+        new_dataframe[new_column] = parallelisation_method_function(source_field, transformation_function, new_column)
diff --git a/nlp_profiler/generate_features/parallelisation_methods/__init__.py b/nlp_profiler/generate_features/parallelisation_methods/__init__.py
@@ -10,7 +10,7 @@
 
 
 def is_running_from_ipython():
-    return sys.argv[-1].endswith('json')
+    return sys.argv[-1].endswith("json")
 
 
 PROGRESS_BAR_WIDTH = 900 if is_running_from_ipython() else None
@@ -27,26 +27,25 @@ def run_task(task_function, value: str):  # pragma: no cover
 
 
 def using_joblib_parallel(
-        source_field, apply_function, new_column: str,
+    source_field,
+    apply_function,
+    new_column: str,
 ) -> pd.DataFrame:
     source_values_to_transform = get_progress_bar(source_field.values)
     source_values_to_transform.set_description(new_column)
 
     result = Parallel(n_jobs=-1)(
-        delayed(run_task)(
-            apply_function, each_value
-        ) for _, each_value in enumerate(source_values_to_transform)
+        delayed(run_task)(apply_function, each_value)
+        for each_value in source_values_to_transform
     )
     source_values_to_transform.update()
     return result
 
 
-def using_swifter(
-        source_field, apply_function, new_column: str = None
-) -> pd.DataFrame:
-    return source_field \
-        .swifter \
-        .set_dask_scheduler(scheduler="processes") \
-        .allow_dask_on_strings(enable=True) \
-        .progress_bar(enable=True, desc=new_column) \
+def using_swifter(source_field, apply_function, new_column: str = None) -> pd.DataFrame:
+    return (
+        source_field.swifter.set_dask_scheduler(scheduler="processes")
+        .allow_dask_on_strings(enable=True)
+        .progress_bar(enable=True, desc=new_column)
         .apply(apply_function, axis=1)
+    )