Skip to content

Commit

Permalink
Merge pull request #73 from neomatrix369/reformating-code-and-minor-f…
Browse files Browse the repository at this point in the history
…ixes

Refactor: reformatting python code across all the source files
  • Loading branch information
neomatrix369 committed Mar 13, 2023
2 parents a3538c6 + def1ee8 commit f9cb2e6
Show file tree
Hide file tree
Showing 59 changed files with 1,103 additions and 1,174 deletions.
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Please check the options that you have completed and strike-out the options that
- [ ] you have read
- [ ] the [Contributing doc](https://github.com/neomatrix369/nlp_profiler/blob/master/CONTRIBUTING.md)
- [ ] the [Developer Guide](https://github.com/neomatrix369/nlp_profiler/blob/master/developer-guide.md)
- [ ] the pull request passes the tests (`./test-coverage "tests slow-tests"`) - this will also be visible via the Code coverage report and CI/CD task on the Pull Request
- [ ] the pull request passes the tests (`./test-coverage.sh "tests slow-tests"`) - this will also be visible via the Code coverage report and CI/CD task on the Pull Request
- [ ] you have performed some kind of smoke test by running your changes in an isolated environment i.e. Docker container, Google Colab, Kaggle, etc...
- [ ] the notebooks are updated (see `notebooks` folder, read the [Notebooks](./notebooks/README.md) docs)
- [ ] [CHANGELOG.md](https://github.com/neomatrix369/nlp_profiler/blob/master/CHANGELOG.md) has been updated (please follow the existing format)
Expand Down
10 changes: 8 additions & 2 deletions .github/workflows/end-to-end-flow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,15 +57,17 @@ jobs:
- name: install-line-profiler-on-windows-python-3.7
run: |
### https://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console
### https://github.com/conda/conda/issues/7445#issuecomment-774579800
set PYTHONIOENCODING="utf-8"
set PYTHONLEGACYWINDOWSSTDIO="utf-8"
pip install win-unicode-console
python -m pip install line-profiler@https://github.com/neomatrix369/nlp_profiler/releases/download/v0.0.2-dev/line_profiler-3.2.6-cp37-cp37m-win_amd64.whl
if: matrix.python-version == '3.7' && matrix.os == 'windows-latest'

- name: install-line-profiler-on-windows-python-3.8
run: |
### https://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console
set PYTHONIOENCODING="utf-8"
### https://github.com/conda/conda/issues/7445#issuecomment-774579800
pip install win-unicode-console
python -m pip install line-profiler@https://github.com/neomatrix369/nlp_profiler/releases/download/v0.0.2-dev/line_profiler-3.2.6-cp38-cp38-win_amd64.whl
if: matrix.python-version == '3.8' && matrix.os == 'windows-latest'
Expand All @@ -76,8 +78,12 @@ jobs:
# Runs a set of commands using the runners shell
- name: run-test-coverage-shell-script
shell: bash
env:
PYTHONUTF8: 1
PYTHONIOENCODING: utf-8
PYTHONLEGACYWINDOWSSTDIO: utf-8
run: |
./test-coverage.sh "tests slow-tests"
./test-coverage.sh "tests slow-tests"
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
if: matrix.python-version == '3.8' && matrix.os == 'ubuntu-18.04'
Expand Down
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,4 +205,24 @@ Replaced language tool with Gingerit for faster calculations

---

### GitHub branch `revert-76-sourcery/revert-71-spelling_check` Granular features: reverted change made to spell checks

Implemented functionality via PR [#75](https://github.com/neomatrix369/nlp_profiler/pull/75) - details described in the body of the PR.

Reverting spell check functionality as it is not tested and tests change/break with new implementation.

[2cddf51](https://github.com/neomatrix369/nlp_profiler/commit/2cddf51a605b434d12604d2dba9457c415808bfe) [@neomatrix369](https://github.com/neomatrix369) _Mon Mar 13 02:56:40 2023 +0000_

---

### GitHub branch `reformating-code-and-minor-fixes` Reformatting code, refactoring as per Sourcery, minor fixes and test fixes

Implemented functionality via PR [#73](https://github.com/neomatrix369/nlp_profiler/pull/73) - details described in the body of the PR.

Reformatting code, refactoring as per Sourcery, minor fixes and test fixes. Bringing back the build system in order. Fixes old regressed tests.

[7caeb47](https://github.com/neomatrix369/nlp_profiler/commit/7caeb47e3795a22c1884731320a02d4b0c39ca0c) [@neomatrix369](https://github.com/neomatrix369) _Mon Mar 13 11:23:49 2023 +0000_

---

Return to [README.md](README.md)
2 changes: 1 addition & 1 deletion developer-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ pip install --prefix .
Run all the tests with coverage information using the below command after all packages have been successfully installed:

```bash
./test-coverage tests slow-tests
./test-coverage.sh tests slow-tests
```

On the tests passing (or partially passing), these folders will be created:
Expand Down
2 changes: 1 addition & 1 deletion nlp_profiler/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.0.3"
__version__ = "0.0.3"
90 changes: 45 additions & 45 deletions nlp_profiler/constants.py
Original file line number Diff line number Diff line change
@@ -1,64 +1,64 @@
DEFAULT_PARALLEL_METHOD = 'default'
DEFAULT_PARALLEL_METHOD = "default"

SWIFTER_METHOD = 'using_swifter'
SWIFTER_METHOD = "using_swifter"

GRANULAR_OPTION = 'granular'
HIGH_LEVEL_OPTION = 'high_level'
GRAMMAR_CHECK_OPTION = 'grammar_check'
SPELLING_CHECK_OPTION = 'spelling_check'
EASE_OF_READING_CHECK_OPTION = 'ease_of_reading_check'
PARALLELISATION_METHOD_OPTION = 'parallelisation_method'
GRANULAR_OPTION = "granular"
HIGH_LEVEL_OPTION = "high_level"
GRAMMAR_CHECK_OPTION = "grammar_check"
SPELLING_CHECK_OPTION = "spelling_check"
EASE_OF_READING_CHECK_OPTION = "ease_of_reading_check"
PARALLELISATION_METHOD_OPTION = "parallelisation_method"
NOT_APPLICABLE = "N/A"

NaN = float('nan')
NaN = float("nan")

# --- Columns generated
# High-level
## Grammar check
GRAMMAR_CHECK_COL = 'grammar_check'
GRAMMAR_CHECK_SCORE_COL = 'grammar_check_score'
GRAMMAR_CHECK_COL = "grammar_check"
GRAMMAR_CHECK_SCORE_COL = "grammar_check_score"

## Spelling check
SPELLING_QUALITY_SCORE_COL = 'spelling_quality_score'
SPELLING_QUALITY_COL = 'spelling_quality'
SPELLING_QUALITY_SUMMARISED_COL = 'spelling_quality_summarised'
SPELLING_QUALITY_SCORE_COL = "spelling_quality_score"
SPELLING_QUALITY_COL = "spelling_quality"
SPELLING_QUALITY_SUMMARISED_COL = "spelling_quality_summarised"

## Sentiment analysis
SENTIMENT_POLARITY_SCORE_COL = 'sentiment_polarity_score'
SENTIMENT_POLARITY_COL = 'sentiment_polarity'
SENTIMENT_POLARITY_SUMMARISED_COL = 'sentiment_polarity_summarised'
SENTIMENT_SUBJECTIVITY_SCORE_COL = 'sentiment_subjectivity_score'
SENTIMENT_SUBJECTIVITY_COL = 'sentiment_subjectivity'
SENTIMENT_SUBJECTIVITY_SUMMARISED_COL = 'sentiment_subjectivity_summarised'
SENTIMENT_POLARITY_SCORE_COL = "sentiment_polarity_score"
SENTIMENT_POLARITY_COL = "sentiment_polarity"
SENTIMENT_POLARITY_SUMMARISED_COL = "sentiment_polarity_summarised"
SENTIMENT_SUBJECTIVITY_SCORE_COL = "sentiment_subjectivity_score"
SENTIMENT_SUBJECTIVITY_COL = "sentiment_subjectivity"
SENTIMENT_SUBJECTIVITY_SUMMARISED_COL = "sentiment_subjectivity_summarised"

## Spelling check
EASE_OF_READING_SCORE_COL = 'ease_of_reading_score'
EASE_OF_READING_COL = 'ease_of_reading_quality'
EASE_OF_READING_SUMMARISED_COL = 'ease_of_reading_summarised'
EASE_OF_READING_SCORE_COL = "ease_of_reading_score"
EASE_OF_READING_COL = "ease_of_reading_quality"
EASE_OF_READING_SUMMARISED_COL = "ease_of_reading_summarised"

# ---
# Granular
DATES_COUNT_COL = 'dates_count'
STOP_WORDS_COUNT_COL = 'stop_words_count'
PUNCTUATIONS_COUNT_COL = 'punctuations_count'
DATES_COUNT_COL = "dates_count"
STOP_WORDS_COUNT_COL = "stop_words_count"
PUNCTUATIONS_COUNT_COL = "punctuations_count"
REPEATED_PUNCTUATIONS_COUNT_COL = "repeated_punctuations_count"
NON_ALPHA_NUMERIC_COUNT_COL = 'non_alpha_numeric_count'
ALPHA_NUMERIC_COUNT_COL = 'alpha_numeric_count'
REPEATED_LETTERS_COUNT_COL = 'repeated_letters_count'
NON_ALPHA_NUMERIC_COUNT_COL = "non_alpha_numeric_count"
ALPHA_NUMERIC_COUNT_COL = "alpha_numeric_count"
REPEATED_LETTERS_COUNT_COL = "repeated_letters_count"
REPEATED_DIGITS_COUNT_COL = "repeated_digits_count"
WHOLE_NUMBERS_COUNT_COL = 'whole_numbers_count'
EMOJI_COUNT_COL = 'emoji_count'
DUPLICATES_COUNT_COL = 'duplicates_count'
COUNT_WORDS_COL = 'count_words'
SPACES_COUNT_COL = 'spaces_count'
CHARS_EXCL_SPACES_COUNT_COL = 'chars_excl_spaces_count'
REPEATED_SPACES_COUNT_COL = 'repeated_spaces_count'
WHITESPACES_COUNT_COL = 'whitespaces_count'
CHARS_EXCL_WHITESPACES_COUNT_COL = 'chars_excl_whitespaces_count'
REPEATED_WHITESPACES_COUNT_COL = 'repeated_whitespaces_count'
CHARACTERS_COUNT_COL = 'characters_count'
ENGLISH_CHARACTERS_COUNT_COL = 'english_characters_count'
NON_ENGLISH_CHARACTERS_COUNT_COL = 'non_english_characters_count'
SYLLABLES_COUNT_COL = 'syllables_count'
SENTENCES_COUNT_COL = 'sentences_count'
NOUN_PHRASE_COUNT_COL = 'noun_phrase_count'
WHOLE_NUMBERS_COUNT_COL = "whole_numbers_count"
EMOJI_COUNT_COL = "emoji_count"
DUPLICATES_COUNT_COL = "duplicates_count"
COUNT_WORDS_COL = "count_words"
SPACES_COUNT_COL = "spaces_count"
CHARS_EXCL_SPACES_COUNT_COL = "chars_excl_spaces_count"
REPEATED_SPACES_COUNT_COL = "repeated_spaces_count"
WHITESPACES_COUNT_COL = "whitespaces_count"
CHARS_EXCL_WHITESPACES_COUNT_COL = "chars_excl_whitespaces_count"
REPEATED_WHITESPACES_COUNT_COL = "repeated_whitespaces_count"
CHARACTERS_COUNT_COL = "characters_count"
ENGLISH_CHARACTERS_COUNT_COL = "english_characters_count"
NON_ENGLISH_CHARACTERS_COUNT_COL = "non_english_characters_count"
SYLLABLES_COUNT_COL = "syllables_count"
SENTENCES_COUNT_COL = "sentences_count"
NOUN_PHRASE_COUNT_COL = "noun_phrase_count"
45 changes: 23 additions & 22 deletions nlp_profiler/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,24 +18,29 @@

import pandas as pd

from nlp_profiler.constants import \
PARALLELISATION_METHOD_OPTION, DEFAULT_PARALLEL_METHOD, GRANULAR_OPTION, HIGH_LEVEL_OPTION, \
GRAMMAR_CHECK_OPTION, SPELLING_CHECK_OPTION, EASE_OF_READING_CHECK_OPTION
from nlp_profiler.constants import (
PARALLELISATION_METHOD_OPTION,
DEFAULT_PARALLEL_METHOD,
GRANULAR_OPTION,
HIGH_LEVEL_OPTION,
GRAMMAR_CHECK_OPTION,
SPELLING_CHECK_OPTION,
EASE_OF_READING_CHECK_OPTION,
)
from nlp_profiler.generate_features import get_progress_bar
from nlp_profiler.granular_features import apply_granular_features
from nlp_profiler.high_level_features import apply_high_level_features
from nlp_profiler.high_level_features.grammar_quality_check \
import apply_grammar_check
from nlp_profiler.high_level_features.spelling_quality_check \
import apply_spelling_check
from nlp_profiler.high_level_features.ease_of_reading_check \
import apply_ease_of_reading_check
from nlp_profiler.high_level_features.grammar_quality_check import apply_grammar_check
from nlp_profiler.high_level_features.spelling_quality_check import apply_spelling_check
from nlp_profiler.high_level_features.ease_of_reading_check import apply_ease_of_reading_check


def apply_text_profiling(dataframe: pd.DataFrame,
text_column: str,
params: dict = {}) -> pd.DataFrame:
columns_to_drop = list(set(dataframe.columns) - set([text_column]))
def apply_text_profiling(dataframe: pd.DataFrame, text_column: str, params: dict = None) -> pd.DataFrame:
if params is None:
params = {}

# sourcery skip: dict-assign-update-to-union
columns_to_drop = list(set(dataframe.columns) - {text_column})
new_dataframe = dataframe.drop(columns=columns_to_drop, axis=1).copy()

default_params = {
Expand All @@ -44,7 +49,7 @@ def apply_text_profiling(dataframe: pd.DataFrame,
GRAMMAR_CHECK_OPTION: False, # default: False as slow process but can Enabled
SPELLING_CHECK_OPTION: True, # default: True although slightly slow process but can Disabled
EASE_OF_READING_CHECK_OPTION: True,
PARALLELISATION_METHOD_OPTION: DEFAULT_PARALLEL_METHOD
PARALLELISATION_METHOD_OPTION: DEFAULT_PARALLEL_METHOD,
}

default_params.update(params)
Expand All @@ -55,21 +60,17 @@ def apply_text_profiling(dataframe: pd.DataFrame,
(HIGH_LEVEL_OPTION, "High-level features", apply_high_level_features),
(GRAMMAR_CHECK_OPTION, "Grammar checks", apply_grammar_check),
(SPELLING_CHECK_OPTION, "Spelling checks", apply_spelling_check),
(EASE_OF_READING_CHECK_OPTION, "Ease of reading check", apply_ease_of_reading_check)
(EASE_OF_READING_CHECK_OPTION, "Ease of reading check", apply_ease_of_reading_check),
]

for index, item in enumerate(actions_mappings.copy()):
for item in actions_mappings.copy():
(param, _, _) = item
if not default_params[param]:
actions_mappings.remove(item)

apply_profiling_progress_bar = get_progress_bar(actions_mappings)
for _, (param, action_description, action_function) in \
enumerate(apply_profiling_progress_bar):
for param, action_description, action_function in apply_profiling_progress_bar:
apply_profiling_progress_bar.set_description(action_description)
action_function(
action_description, new_dataframe,
text_column, default_params[PARALLELISATION_METHOD_OPTION]
)
action_function(action_description, new_dataframe, text_column, default_params[PARALLELISATION_METHOD_OPTION])

return new_dataframe
28 changes: 14 additions & 14 deletions nlp_profiler/generate_features/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,28 +2,28 @@
import swifter # noqa

from nlp_profiler.constants import DEFAULT_PARALLEL_METHOD, SWIFTER_METHOD
from nlp_profiler.generate_features.parallelisation_methods \
import get_progress_bar, using_joblib_parallel, using_swifter
from nlp_profiler.generate_features.parallelisation_methods import (
get_progress_bar,
using_joblib_parallel,
using_swifter,
)


def generate_features(main_header: str,
high_level_features_steps: list,
new_dataframe: pd.DataFrame,
parallelisation_method: str = DEFAULT_PARALLEL_METHOD):
def generate_features(
main_header: str,
high_level_features_steps: list,
new_dataframe: pd.DataFrame,
parallelisation_method: str = DEFAULT_PARALLEL_METHOD,
):
generate_feature_progress_bar = get_progress_bar(high_level_features_steps)

# Using swifter or Using joblib Parallel and delay method:
parallelisation_method_function = using_joblib_parallel
if parallelisation_method == SWIFTER_METHOD:
parallelisation_method_function = using_swifter

for _, (new_column, source_column, transformation_function) in \
enumerate(generate_feature_progress_bar):
for new_column, source_column, transformation_function in generate_feature_progress_bar:
source_field = new_dataframe[source_column]
generate_feature_progress_bar.set_description(
f'{main_header}: {source_column} => {new_column}'
)
generate_feature_progress_bar.set_description(f"{main_header}: {source_column} => {new_column}")

new_dataframe[new_column] = parallelisation_method_function(
source_field, transformation_function, new_column
)
new_dataframe[new_column] = parallelisation_method_function(source_field, transformation_function, new_column)
25 changes: 12 additions & 13 deletions nlp_profiler/generate_features/parallelisation_methods/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@


def is_running_from_ipython():
return sys.argv[-1].endswith('json')
return sys.argv[-1].endswith("json")


PROGRESS_BAR_WIDTH = 900 if is_running_from_ipython() else None
Expand All @@ -27,26 +27,25 @@ def run_task(task_function, value: str): # pragma: no cover


def using_joblib_parallel(
source_field, apply_function, new_column: str,
source_field,
apply_function,
new_column: str,
) -> pd.DataFrame:
source_values_to_transform = get_progress_bar(source_field.values)
source_values_to_transform.set_description(new_column)

result = Parallel(n_jobs=-1)(
delayed(run_task)(
apply_function, each_value
) for _, each_value in enumerate(source_values_to_transform)
delayed(run_task)(apply_function, each_value)
for each_value in source_values_to_transform
)
source_values_to_transform.update()
return result


def using_swifter(
source_field, apply_function, new_column: str = None
) -> pd.DataFrame:
return source_field \
.swifter \
.set_dask_scheduler(scheduler="processes") \
.allow_dask_on_strings(enable=True) \
.progress_bar(enable=True, desc=new_column) \
def using_swifter(source_field, apply_function, new_column: str = None) -> pd.DataFrame:
return (
source_field.swifter.set_dask_scheduler(scheduler="processes")
.allow_dask_on_strings(enable=True)
.progress_bar(enable=True, desc=new_column)
.apply(apply_function, axis=1)
)
Loading

0 comments on commit f9cb2e6

Please sign in to comment.