Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround to DBT ingestion to avoid recursion errors #10934

Closed

Conversation

MatMoore
Copy link

@MatMoore MatMoore commented Jul 17, 2024

Our DBT setup contains some very long queries that break sqlglot's generator, due to Python's recursion limit. I am looking into whether this problem can be solved upstream, but in the meantime, this PR contains a couple of workarounds for this.

The problem

When Datahub's DBT source calls the code now in _parse_cll we get a recursion error, which is handled and printed to the report:

INFO {datahub.ingestion.source.dbt.dbt_common:1161} - Failed to parse compiled code for snapshot.xxxx: maximum recursion depth exceeded

(Note: this message has since changed slightly in this refactor, but the logic seems unchanged)

Also if we have include_compiled_code enabled, then we encounter a recursion error that is not handled, breaking the ingestion.

So we have one code path that throws exceptions and handles it, and one path that throws unhandled exceptions.

This was observed in v0.13.2.

Workaround for the unhandled exception

We can bypass calling _parse_cll in cases where

  • include_column_lineage is off
  • the node schema has already been found in the graph or dbt catalog

In these cases the output, inferred_schema_fields shouldn't be used, because as far as I can tell, any schema information will be taken from either the DBT catalog or the existing graph.

This code can also be bypassed by switching off infer_dbt_schemas (set to true by default), but that would switch off schema inference completely, when in our case we just want to avoid it in a few cases that happen to trigger the recursion error.

Since this is such a minor feature, I'm assuming it's not worth adding any guidance to metadata-ingestion/docs/sources/dbt/dbt.md.

Workaround for handled exception leading to spurious logs

The second change adds an option reformat_compiled_code to skip running compiled_code through try_format_query. By default, the behaviour is unchanged, but if disabled, then we will use the formatting that is already present in the DBT output.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

  • New Features

    • Introduced a new setting to control the reformatting of compiled code in DBT configurations.
  • Tests

    • Added new integration tests to ensure the correct functioning of the reformat compiled code setting.

If column lineage is disabled, we don't need to call _parse_cll to parse
`node.compiled_code` with sqlglot.

If this fails, we get spurious "Failed to generate any CLL lineage"
messages in the logs.

An exception is the case where `added_to_schema_resolver` is False, as
in this case the parsed code is used to infer the schema fields.
Calling `try_format_query` on `compiled_code` sometimes does not work,
because sqlglot's generator is recursive, and sufficiently complex
queries can exceed python's recursion limit.

As a workaround, we can just skip the sqlglot formatting and take
the value as it is formatted in the DBT manifest.

Switching off this option is less obtrusive than switching off
`ignore_compiled_code`
Copy link
Contributor

coderabbitai bot commented Jul 17, 2024

Walkthrough

The recent update introduces a new boolean field reformat_compiled_code to the DBTCommonConfig in dbt_common.py. This field allows users to control whether compiled code should be reformatted before being included in emitted metadata. By default, this field is set to True. Correspondingly, test configurations were added to ensure the functionality works when this flag is set to False.

Changes

Files Change Summary
metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py Added reformat_compiled_code boolean field to DBTCommonConfig and updated _create_view_properties_aspect method to respect this setting.
metadata-ingestion/tests/integration/dbt/test_dbt.py Introduced a new test configuration dbt-reformat_compiled_code_disabled to test the reformat_compiled_code flag set to False.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DBTCommonConfig
    participant DBTCommon

    User->>DBTCommonConfig: Set reformat_compiled_code (True/False)
    DBTCommonConfig->>DBTCommon: Pass reformat_compiled_code flag
    DBTCommon->>DBTCommon: Check reformat_compiled_code
    alt reformat_compiled_code is True
        DBTCommon->>DBTCommon: Reformat compiled code
    else reformat_compiled_code is False
        DBTCommon->>DBTCommon: Use compiled code as is
    end
    DBTCommon->>User: Emit metadata with compiled code
Loading

Poem

In the world of data, where code does flow,
A new flag appears, to shape the glow.
With a twist and a turn, compiled code's fate,
Reformatted grace or raw, straight.
Thanks to this change, so clear and bright,
Metadata gleams in the user's sight. 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Jul 17, 2024
@MatMoore MatMoore marked this pull request as ready for review July 17, 2024 13:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 82e5a04 and 4a1f4db.

Files selected for processing (2)
  • metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py (3 hunks)
  • metadata-ingestion/tests/integration/dbt/test_dbt.py (1 hunks)
Additional comments not posted (3)
metadata-ingestion/tests/integration/dbt/test_dbt.py (1)

219-224: New Test Configuration Added

The new DbtTestConfig instance for testing with reformat_compiled_code set to False is correctly configured. This ensures that the new functionality is covered by integration tests, which is crucial for maintaining robustness.

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py (2)

368-371: New Configuration Field Added

The addition of the reformat_compiled_code field in DBTCommonConfig with a default value of True is well-documented and follows the coding standards. This field allows control over whether compiled SQL code should be reformatted, addressing the recursion error issue by potentially bypassing the reformatting process.


1538-1543: Conditional Formatting Logic Implemented

The conditional logic in _create_view_properties_aspect that checks the reformat_compiled_code flag before deciding to format the compiled_code is correctly implemented. This change supports the new configuration's intended functionality, allowing users to bypass potentially problematic code formatting if needed.

@hsheth2
Copy link
Collaborator

hsheth2 commented Jul 17, 2024

@MatMoore we recently had a related PR #10678 - is this still an issue despite that?

Either way, I do think having a format_compiled_code flag makes sense.

@MatMoore
Copy link
Author

@MatMoore we recently had a related PR #10678 - is this still an issue despite that?

Either way, I do think having a format_compiled_code flag makes sense.

Ah I didn't see that one. I haven't tested it directly but it looks like that will work and maximum recursion depth exceeded will only appear in the debug logs 👍🏻

We probably don't need the reformat_compiled_code flag in that case. Would you prefer I drop that commit, or leave it in? I expect there might be a performance benefit to skipping the parsing & reformatting but it's probably not a big saving unless there are a very large number of nodes to process.

We still need to handle or avoid the unhandled exception in the _parse_cll code though (first commit) - does this change look reasonable?

@hsheth2
Copy link
Collaborator

hsheth2 commented Jul 18, 2024

@MatMoore do you have a stack trace from when _parse_cll throws an exception? The contract was supposed to be that _parse_cll never throws an exception, but instead returns it using SqlParsingResult.make_from_error(e)

@MatMoore
Copy link
Author

MatMoore commented Jul 23, 2024

@MatMoore do you have a stack trace from when _parse_cll throws an exception? The contract was supposed to be that _parse_cll never throws an exception, but instead returns it using SqlParsingResult.make_from_error(e)

Ah, so it does. Now that I look at it again, I'm not sure we've actually run into it since upgrading to 0.13.3 and picking up #10553, so it's possible it was fixed as part of that refactor. Let me do some more testing and get back to you.

The stack trace from CLI version 0.13.2.4 are here. It seems that it was supposed to handle the error, but the handler threw another exception.

@hsheth2 hsheth2 closed this Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants