Add workaround to DBT ingestion to avoid recursion errors #10934

MatMoore · 2024-07-17T13:11:31Z

Our DBT setup contains some very long queries that break sqlglot's generator, due to Python's recursion limit. I am looking into whether this problem can be solved upstream, but in the meantime, this PR contains a couple of workarounds for this.

The problem

When Datahub's DBT source calls the code now in _parse_cll we get a recursion error, which is handled and printed to the report:

INFO {datahub.ingestion.source.dbt.dbt_common:1161} - Failed to parse compiled code for snapshot.xxxx: maximum recursion depth exceeded

(Note: this message has since changed slightly in this refactor, but the logic seems unchanged)

Also if we have include_compiled_code enabled, then we encounter a recursion error that is not handled, breaking the ingestion.

So we have one code path that throws exceptions and handles it, and one path that throws unhandled exceptions.

This was observed in v0.13.2.

Workaround for the unhandled exception

We can bypass calling _parse_cll in cases where

include_column_lineage is off
the node schema has already been found in the graph or dbt catalog

In these cases the output, inferred_schema_fields shouldn't be used, because as far as I can tell, any schema information will be taken from either the DBT catalog or the existing graph.

This code can also be bypassed by switching off infer_dbt_schemas (set to true by default), but that would switch off schema inference completely, when in our case we just want to avoid it in a few cases that happen to trigger the recursion error.

Since this is such a minor feature, I'm assuming it's not worth adding any guidance to metadata-ingestion/docs/sources/dbt/dbt.md.

Workaround for handled exception leading to spurious logs

The second change adds an option reformat_compiled_code to skip running compiled_code through try_format_query. By default, the behaviour is unchanged, but if disabled, then we will use the formatting that is already present in the DBT output.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

New Features
- Introduced a new setting to control the reformatting of compiled code in DBT configurations.
Tests
- Added new integration tests to ensure the correct functioning of the reformat compiled code setting.

If column lineage is disabled, we don't need to call _parse_cll to parse `node.compiled_code` with sqlglot. If this fails, we get spurious "Failed to generate any CLL lineage" messages in the logs. An exception is the case where `added_to_schema_resolver` is False, as in this case the parsed code is used to infer the schema fields.

Calling `try_format_query` on `compiled_code` sometimes does not work, because sqlglot's generator is recursive, and sufficiently complex queries can exceed python's recursion limit. As a workaround, we can just skip the sqlglot formatting and take the value as it is formatted in the DBT manifest. Switching off this option is less obtrusive than switching off `ignore_compiled_code`

coderabbitai · 2024-07-17T13:11:37Z

Walkthrough

The recent update introduces a new boolean field reformat_compiled_code to the DBTCommonConfig in dbt_common.py. This field allows users to control whether compiled code should be reformatted before being included in emitted metadata. By default, this field is set to True. Correspondingly, test configurations were added to ensure the functionality works when this flag is set to False.

Changes

Files	Change Summary
`metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py`	Added `reformat_compiled_code` boolean field to `DBTCommonConfig` and updated `_create_view_properties_aspect` method to respect this setting.
`metadata-ingestion/tests/integration/dbt/test_dbt.py`	Introduced a new test configuration `dbt-reformat_compiled_code_disabled` to test the `reformat_compiled_code` flag set to `False`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DBTCommonConfig
    participant DBTCommon

    User->>DBTCommonConfig: Set reformat_compiled_code (True/False)
    DBTCommonConfig->>DBTCommon: Pass reformat_compiled_code flag
    DBTCommon->>DBTCommon: Check reformat_compiled_code
    alt reformat_compiled_code is True
        DBTCommon->>DBTCommon: Reformat compiled code
    else reformat_compiled_code is False
        DBTCommon->>DBTCommon: Use compiled code as is
    end
    DBTCommon->>User: Emit metadata with compiled code

Poem

In the world of data, where code does flow,
A new flag appears, to shape the glow.
With a twist and a turn, compiled code's fate,
Reformatted grace or raw, straight.
Thanks to this change, so clear and bright,
Metadata gleams in the user's sight. 🌟

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 82e5a04 and 4a1f4db.

Files selected for processing (2)

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py (3 hunks)
metadata-ingestion/tests/integration/dbt/test_dbt.py (1 hunks)

Additional comments not posted (3)

metadata-ingestion/tests/integration/dbt/test_dbt.py (1)

219-224: New Test Configuration Added

The new DbtTestConfig instance for testing with reformat_compiled_code set to False is correctly configured. This ensures that the new functionality is covered by integration tests, which is crucial for maintaining robustness.

metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py (2)

368-371: New Configuration Field Added

The addition of the reformat_compiled_code field in DBTCommonConfig with a default value of True is well-documented and follows the coding standards. This field allows control over whether compiled SQL code should be reformatted, addressing the recursion error issue by potentially bypassing the reformatting process.

1538-1543: Conditional Formatting Logic Implemented

The conditional logic in _create_view_properties_aspect that checks the reformat_compiled_code flag before deciding to format the compiled_code is correctly implemented. This change supports the new configuration's intended functionality, allowing users to bypass potentially problematic code formatting if needed.

hsheth2 · 2024-07-17T18:26:33Z

@MatMoore we recently had a related PR #10678 - is this still an issue despite that?

Either way, I do think having a format_compiled_code flag makes sense.

MatMoore · 2024-07-18T09:30:11Z

@MatMoore we recently had a related PR #10678 - is this still an issue despite that?

Either way, I do think having a format_compiled_code flag makes sense.

Ah I didn't see that one. I haven't tested it directly but it looks like that will work and maximum recursion depth exceeded will only appear in the debug logs 👍🏻

We probably don't need the reformat_compiled_code flag in that case. Would you prefer I drop that commit, or leave it in? I expect there might be a performance benefit to skipping the parsing & reformatting but it's probably not a big saving unless there are a very large number of nodes to process.

We still need to handle or avoid the unhandled exception in the _parse_cll code though (first commit) - does this change look reasonable?

hsheth2 · 2024-07-18T17:20:43Z

@MatMoore do you have a stack trace from when _parse_cll throws an exception? The contract was supposed to be that _parse_cll never throws an exception, but instead returns it using SqlParsingResult.make_from_error(e)

MatMoore · 2024-07-23T08:01:50Z

@MatMoore do you have a stack trace from when _parse_cll throws an exception? The contract was supposed to be that _parse_cll never throws an exception, but instead returns it using SqlParsingResult.make_from_error(e)

Ah, so it does. Now that I look at it again, I'm not sure we've actually run into it since upgrading to 0.13.3 and picking up #10553, so it's possible it was fixed as part of that refactor. Let me do some more testing and get back to you.

The stack trace from CLI version 0.13.2.4 are here. It seems that it was supposed to handle the error, but the handler threw another exception.

MatMoore added 2 commits July 17, 2024 13:24

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Jul 17, 2024

MatMoore marked this pull request as ready for review July 17, 2024 13:18

coderabbitai bot reviewed Jul 17, 2024

View reviewed changes

vercel bot deployed to Preview July 17, 2024 13:26 View deployment

hsheth2 closed this Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workaround to DBT ingestion to avoid recursion errors #10934

Add workaround to DBT ingestion to avoid recursion errors #10934

MatMoore commented Jul 17, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 17, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

hsheth2 commented Jul 17, 2024

MatMoore commented Jul 18, 2024

hsheth2 commented Jul 18, 2024

MatMoore commented Jul 23, 2024 •

edited

Loading

Add workaround to DBT ingestion to avoid recursion errors #10934

Add workaround to DBT ingestion to avoid recursion errors #10934

Conversation

MatMoore commented Jul 17, 2024 • edited by coderabbitai bot Loading

The problem

Workaround for the unhandled exception

Workaround for handled exception leading to spurious logs

Checklist

Summary by CodeRabbit

coderabbitai bot commented Jul 17, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

hsheth2 commented Jul 17, 2024

MatMoore commented Jul 18, 2024

hsheth2 commented Jul 18, 2024

MatMoore commented Jul 23, 2024 • edited Loading

MatMoore commented Jul 17, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 17, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

MatMoore commented Jul 23, 2024 •

edited

Loading