-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add workaround to DBT ingestion to avoid recursion errors #10934
Add workaround to DBT ingestion to avoid recursion errors #10934
Conversation
If column lineage is disabled, we don't need to call _parse_cll to parse `node.compiled_code` with sqlglot. If this fails, we get spurious "Failed to generate any CLL lineage" messages in the logs. An exception is the case where `added_to_schema_resolver` is False, as in this case the parsed code is used to infer the schema fields.
Calling `try_format_query` on `compiled_code` sometimes does not work, because sqlglot's generator is recursive, and sufficiently complex queries can exceed python's recursion limit. As a workaround, we can just skip the sqlglot formatting and take the value as it is formatted in the DBT manifest. Switching off this option is less obtrusive than switching off `ignore_compiled_code`
WalkthroughThe recent update introduces a new boolean field Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DBTCommonConfig
participant DBTCommon
User->>DBTCommonConfig: Set reformat_compiled_code (True/False)
DBTCommonConfig->>DBTCommon: Pass reformat_compiled_code flag
DBTCommon->>DBTCommon: Check reformat_compiled_code
alt reformat_compiled_code is True
DBTCommon->>DBTCommon: Reformat compiled code
else reformat_compiled_code is False
DBTCommon->>DBTCommon: Use compiled code as is
end
DBTCommon->>User: Emit metadata with compiled code
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py (3 hunks)
- metadata-ingestion/tests/integration/dbt/test_dbt.py (1 hunks)
Additional comments not posted (3)
metadata-ingestion/tests/integration/dbt/test_dbt.py (1)
219-224
: New Test Configuration AddedThe new
DbtTestConfig
instance for testing withreformat_compiled_code
set toFalse
is correctly configured. This ensures that the new functionality is covered by integration tests, which is crucial for maintaining robustness.metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py (2)
368-371
: New Configuration Field AddedThe addition of the
reformat_compiled_code
field inDBTCommonConfig
with a default value ofTrue
is well-documented and follows the coding standards. This field allows control over whether compiled SQL code should be reformatted, addressing the recursion error issue by potentially bypassing the reformatting process.
1538-1543
: Conditional Formatting Logic ImplementedThe conditional logic in
_create_view_properties_aspect
that checks thereformat_compiled_code
flag before deciding to format thecompiled_code
is correctly implemented. This change supports the new configuration's intended functionality, allowing users to bypass potentially problematic code formatting if needed.
Ah I didn't see that one. I haven't tested it directly but it looks like that will work and We probably don't need the We still need to handle or avoid the unhandled exception in the |
@MatMoore do you have a stack trace from when |
Ah, so it does. Now that I look at it again, I'm not sure we've actually run into it since upgrading to 0.13.3 and picking up #10553, so it's possible it was fixed as part of that refactor. Let me do some more testing and get back to you. The stack trace from CLI version 0.13.2.4 are here. It seems that it was supposed to handle the error, but the handler threw another exception. |
Our DBT setup contains some very long queries that break sqlglot's generator, due to Python's recursion limit. I am looking into whether this problem can be solved upstream, but in the meantime, this PR contains a couple of workarounds for this.
The problem
When Datahub's DBT source calls the code now in
_parse_cll
we get a recursion error, which is handled and printed to the report:INFO {datahub.ingestion.source.dbt.dbt_common:1161} - Failed to parse compiled code for snapshot.xxxx: maximum recursion depth exceeded
(Note: this message has since changed slightly in this refactor, but the logic seems unchanged)
Also if we have
include_compiled_code
enabled, then we encounter a recursion error that is not handled, breaking the ingestion.So we have one code path that throws exceptions and handles it, and one path that throws unhandled exceptions.
This was observed in v0.13.2.
Workaround for the unhandled exception
We can bypass calling
_parse_cll
in cases whereinclude_column_lineage
is offIn these cases the output,
inferred_schema_fields
shouldn't be used, because as far as I can tell, any schema information will be taken from either the DBT catalog or the existing graph.This code can also be bypassed by switching off
infer_dbt_schemas
(set to true by default), but that would switch off schema inference completely, when in our case we just want to avoid it in a few cases that happen to trigger the recursion error.Since this is such a minor feature, I'm assuming it's not worth adding any guidance to
metadata-ingestion/docs/sources/dbt/dbt.md
.Workaround for handled exception leading to spurious logs
The second change adds an option
reformat_compiled_code
to skip runningcompiled_code
throughtry_format_query
. By default, the behaviour is unchanged, but if disabled, then we will use the formatting that is already present in the DBT output.Checklist
Summary by CodeRabbit
New Features
Tests