Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(redash): add parallelism support for ingestion #5061

Merged

Conversation

anshbansal
Copy link
Collaborator

  • Add parallelism support for ingestion
  • fix removing all dataset as inputs in case there was any parsing issue
  • cleanup logging
  • add timing for various parts

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@@ -348,6 +355,10 @@ def error(self, log: logging.Logger, key: str, reason: str) -> None:
self.report.report_failure(key, reason)
log.error(f"{key} => {reason}")

def warn(self, log: logging.Logger, key: str, reason: str) -> None:
self.report.report_warning(key, reason)
log.warning(f"{key} => {reason}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more descriptive here or is this how we do other places as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is how we do it in other places too

f"sql-parsing-query-{query_id}-datasource-{data_source_id}",
f"exception {e} in parsing {query}",
)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't print out the actual error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error was moved up. Here we can only find problem when table name is wrong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table name is wrong usually due to sql parsing incorrectly getting intermediate temp tables as actual tables.

self.report.max_page_queries = max_page
chart_exec_pool = ThreadPool(self.config.parallelism)
for response in chart_exec_pool.imap_unordered(
self._process_query_response, range(1, max_page + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you generate here a range from 1 to min(maxpage+1,self.api_page_limit)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.api_page_limit is set to math.inf which is float. Cannot turn it into int. The min function does not support mixed operand types.

@github-actions
Copy link

github-actions bot commented Jun 1, 2022

Unit Test Results (build & test)

  77 files  ±0    77 suites  ±0   2m 28s ⏱️ -8s
333 tests ±0  333 ✔️ ±0  0 💤 ±0  0 ±0 

Results for commit 21e4ec6. ± Comparison against base commit 259a63a.

Copy link
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions
Copy link

github-actions bot commented Jun 1, 2022

Unit Test Results (metadata ingestion)

       5 files         5 suites   1h 27m 44s ⏱️
   551 tests    547 ✔️     3 💤 1
2 470 runs  2 364 ✔️ 105 💤 1

For more details on these failures, see this check.

Results for commit 21e4ec6.

@anshbansal anshbansal merged commit f81ead3 into datahub-project:master Jun 1, 2022
@anshbansal anshbansal deleted the redash-chart-parallelism branch June 1, 2022 14:06
maggiehays pushed a commit to maggiehays/datahub that referenced this pull request Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants