Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/pipeline): catch pipeline exceptions #10753

Merged
merged 4 commits into from
Jun 27, 2024

Conversation

pie1nthesky
Copy link
Contributor

@pie1nthesky pie1nthesky commented Jun 20, 2024

For now unhandled exception are not reported properly.
When pipeline fails with e.g. 'Connection timeout' exception during source processing,
pipeline exits with final_status = 'unknown' and with cut off log in report.
That makes impossible to troubleshoot ingestion issues from ingestion report page.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

  • New Features

    • Introduced detailed pipeline status enumeration to improve error handling and reporting.
  • Chores

    • Updated GitHub workflow with a timeout for job steps and renamed Docker containers for better logging.

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Jun 20, 2024
@@ -494,6 +492,10 @@ def run(self) -> None:
self.final_status = "cancelled"
logger.error("Caught error", exc_info=e)
raise
except Exception as exc:
self.final_status = "pipeline_failure"
logger.error("pipline run error: ", exc_info=exc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would've thought this log line would be redundant, since we log for any exception as part of entrypoints.py

Can you provide more details about this?

with cut off log in report.

Copy link
Contributor Author

@pie1nthesky pie1nthesky Jun 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ingestion Logs are put into report by self._notify_reporters_on_ingestion_completion() method in finally clause.
So if we don't log a pipeline exception before finally clause code block is called, exception that caused pipeline failure is not present in a report.

It can be tested by mangling host_port in any recipe and checking a report in ingestion page.

Copy link
Contributor Author

@pie1nthesky pie1nthesky Jun 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about handling redundancy?

Should I wrap this exception up PipelineRunError and handle it in entrypoints.py without calling logger.exception(f"Command failed: {exc}") ?
I don't like import from pipeline to entrypoints, because there is intermediary module ingest_cli.

Another option is to cut tracebacks:

except Exception as exc:
   self.final_status = "pipeline_failure"
   logger.error("pipline run error: ", exc_info=exc.with_traceback(None))
   raise exc from None

Copy link
Contributor

coderabbitai bot commented Jun 27, 2024

Walkthrough

Significant updates were made to the pipeline.py file to enhance status reporting and error handling using a new PipelineStatus enum with values like UNKNOWN, COMPLETED, PIPELINE_ERROR, and CANCELLED. Changes also include a timeout addition and container renaming in a GitHub workflow file for better process flow management and logging.

Changes

Files Summaries
.../ingestion/run/pipeline.py Introduced PipelineStatus enum for more robust status handling and improved exception handling.
.github/workflows/docker-unified.yml Added job step timeout of 15 minutes, and renamed Docker container for clearer logging.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Pipeline
    participant Logger
    
    User->>Pipeline: Start pipeline
    Pipeline->>Pipeline: Set status to PipelineStatus.UNKNOWN
    Pipeline->>+Logger: Log initial status
    Pipeline->>Pipeline: Perform tasks
    
    alt Successful completion
        Pipeline->>Pipeline: Set status to PipelineStatus.COMPLETED
    else Exception raised
        Pipeline->>Pipeline: Set status to PipelineStatus.PIPELINE_ERROR
    end
    
    Pipeline->>Pipeline: Handle specific exceptions (set status to CANCELLED)
    Pipeline->>-Logger: Log final status
    Logger->>User: Provide status update
Loading

Poem

The pipeline now knows its fate,
With statuses to celebrate.
From unknown paths to tasks complete,
Errors faced and still it’s neat.
Timeout set, logs all precise,
Improvements made, oh, how nice!


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@hsheth2
Copy link
Collaborator

hsheth2 commented Jun 27, 2024

@pie1nthesky made some tweaks to this, hopefully it does what you're looking for

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Jun 27, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (1)
metadata-ingestion/src/datahub/ingestion/run/pipeline.py (1)

Line range hint 541-543: Refactor: Simplify Boolean Expression

Instead of using a ternary operation for a boolean result, use the bool function directly for clarity.

-        has_errors: bool = (
-            True
-            if self.source.get_report().failures or self.sink.get_report().failures
-            else False
-        )
+        has_errors: bool = bool(self.source.get_report().failures or self.sink.get_report().failures)
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 190f09a and acc1e9c.

Files selected for processing (1)
  • metadata-ingestion/src/datahub/ingestion/run/pipeline.py (7 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/run/pipeline.py

98-99: Use a single if statement instead of nested if statements

(SIM102)


541-543: Use bool(...) instead of True if ... else False

Replace with `bool(...)

(SIM210)

Comment on lines +120 to +124
class PipelineStatus(enum.Enum):
UNKNOWN = enum.auto()
COMPLETED = enum.auto()
PIPELINE_ERROR = enum.auto()
CANCELLED = enum.auto()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor: Simplify Enum Declaration

The use of enum.auto() is great for avoiding manual assignment of values, but it results in the enum values being integers, which might not be the expected behavior since the original status values were strings. Consider explicitly setting the enum values to their string counterparts.

class PipelineStatus(enum.Enum):
-    UNKNOWN = enum.auto()
-    COMPLETED = enum.auto()
-    PIPELINE_ERROR = enum.auto()
-    CANCELLED = enum.auto()
+    UNKNOWN = "unknown"
+    COMPLETED = "completed"
+    PIPELINE_ERROR = "pipeline_error"
+    CANCELLED = "cancelled"
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class PipelineStatus(enum.Enum):
UNKNOWN = enum.auto()
COMPLETED = enum.auto()
PIPELINE_ERROR = enum.auto()
CANCELLED = enum.auto()
class PipelineStatus(enum.Enum):
UNKNOWN = "unknown"
COMPLETED = "completed"
PIPELINE_ERROR = "pipeline_error"
CANCELLED = "cancelled"

Comment on lines +376 to +380
if self.final_status == PipelineStatus.CANCELLED
else "FAILURE"
if self.has_failures()
else "SUCCESS"
if self.final_status == "completed"
if self.final_status == PipelineStatus.COMPLETED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve Readability: Simplify Conditional Logic

The nested ternary operators make this code hard to read. Consider refactoring this to use a more straightforward conditional structure.

-                    if self.final_status == PipelineStatus.CANCELLED
-                    else "FAILURE"
-                    if self.has_failures()
-                    else "SUCCESS"
-                    if self.final_status == PipelineStatus.COMPLETED
-                    else "UNKNOWN",
+                    status = "UNKNOWN"
+                    if self.final_status == PipelineStatus.CANCELLED:
+                        status = "CANCELLED"
+                    elif self.has_failures():
+                        status = "FAILURE"
+                    elif self.final_status == PipelineStatus.COMPLETED:
+                        status = "SUCCESS"
+                    report=self._get_structured_report(),
+                    ctx=self.ctx,

Committable suggestion was skipped due to low confidence.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between acc1e9c and 008e2ec.

Files selected for processing (1)
  • .github/workflows/docker-unified.yml (2 hunks)
Additional comments not posted (2)
.github/workflows/docker-unified.yml (2)

799-799: Approval for Timeout Addition

Adding a 15-minute timeout to the job step is a good practice. It helps in managing resources efficiently and prevents potential issues with jobs that may hang or stall.


1026-1026: Approval for Docker Container Renaming

Renaming the Docker container to datahub-datahub-upgrade-1 enhances clarity in identifying related logs and reports, which is beneficial for troubleshooting and monitoring.

This reverts commit 008e2ec.
@hsheth2 hsheth2 merged commit 5e9afc6 into datahub-project:master Jun 27, 2024
51 of 55 checks passed
@pie1nthesky
Copy link
Contributor Author

@hsheth2
HACK seems to be unnecessary since logs are appended here and readily available in UI.

I would go with split traceback like:

except Exception as exc:
   self.final_status = PipelineStatus.PIPELINE_FAILURE
   logger.exception("Ingestion pipeline threw an uncaught exception")
   raise RuntimeError("Ingestion pipeline threw an uncaught exception") from None

No redundancy, no hacks, but status and logs are available in report.
Oh, well...

@hsheth2
Copy link
Collaborator

hsheth2 commented Jun 27, 2024

@pie1nthesky it's a bit more tricky than that - I want the logs in the UI to closely match whatever is printed to the CLI. The reporting code you linked to only works for CLI ingestion, but UI-driven ingestion is only based on the stdout/stderr logs. Additionally, I wanted to change it so that pipeline.run() to never throws. Finally, I want the error to show up somewhere in the structured report (to help with #10790).

yoonhyejin pushed a commit that referenced this pull request Jul 16, 2024
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
aviv-julienjehannet pushed a commit to aviv-julienjehannet/datahub that referenced this pull request Jul 17, 2024
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants