Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion/lookml): support looker -- if comments #11113

Merged
merged 13 commits into from
Aug 16, 2024

Conversation

sid-acryl
Copy link
Collaborator

@sid-acryl sid-acryl commented Aug 7, 2024

There are many transformations that we need to perform on the LookML view to make it suitable for metadata ingestion.

These transformations include:

  1. Evaluating Looker templates, such as -- if comments.
  2. Resolving Liquid templates.
  3. Removing ${} from derived view patterns (e.g., changing ${view_name.SQL_TABLE_NAME} to 4. view_name.SQL_TABLE_NAME).
  4. Completing incomplete SQL fragments.

The Python module looker_template_language.py handles all these transformations. To keep the code readable and extensible, we have added a transformer for each of the operations mentioned above. If we need to perform any additional transformations in the future, we can easily add a new transformer to handle that scenario.

Each transformer works on specific attributes of the LookML view. For example, the #4 transformation is only applicable to the view.derived.sql attribute, while the other transformations apply to both the view.sql_table_name and view.derived.sql attributes.

The class LookMLViewTransformer contains the logic to ensure that the transformer is applied to specific attributes and returns a dictionary containing the transformed data. For example, in cases #1 and #2, it returns:

transformed derived_table:

{
    "derived_table": {
        "datahub_transformed_sql": "<transformed value of derived_table.sql attribute>"
    }
}

Where as original was:

{
    "derived_table": {
        "sql": "<Sql text with liquid or lookml template language>"
    }
}

Each transformation generates a section of the transformed dictionary with a new attribute named datahub_transformed_<original-attribute-name>.

The class TransformedLookMLView is collecting all such outputs to create a new transformed LookML view. It creates a copy of the original view dictionary and updates the copy with the transformed output. The deepmerge library is used because Python's dict.update function doesn't merge nested fields. The transformed LookML view will contain the following attributes:

{
    "derived_table": {
        "sql": "<original sql with looker template language",
        "datahub_transformed_sql": "<transformed sql>"
    },
   
    dimensions .....
}

Copy link
Contributor

coderabbitai bot commented Aug 7, 2024

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes made to the metadata-ingestion project enhance data handling and configuration capabilities. New dependencies were added for deep merging, and multiple constants were introduced to standardize SQL configurations. Significant refactoring occurred in the LookML processing logic, encapsulating configurations and improving code organization. Additionally, new LookML views were defined, enriching the data model with environment-specific datasets, all aimed at improving the ingestion process and adaptability of metadata in Looker.

Changes

Files Change Summary
setup.py Added dependency "deepmerge>=1.1.1" for enhanced data merging capabilities.
.../looker/looker_constant.py Introduced new constants for SQL and derived table management, improving configuration standardization.
.../looker/looker_file_loader.py Refactored class initialization to replace liquid_variable with source_config, updating method calls to reflect this change.
.../looker/looker_template_language.py Added an abstract base class for LookML view transformations and several concrete transformer subclasses for specific behaviors.
.../looker/lookml_config.py Introduced a new looker_environment field in LookMLSourceConfig to specify production or development contexts.
.../looker/lookml_source.py Modified get_internal_workunits method to utilize self.source_config instead of self.source_config.liquid_variable.
.../vv-lineage-and-liquid-templates/data.model.lkml Added new include statements and explore blocks to enrich the LookML model with new datasets.
.../vv-lineage-and-liquid-templates/employee_income_source_as_per_env.view.lkml Defined a new LookML view for employee income data, allowing dynamic source selection based on conditions.
.../vv-lineage-and-liquid-templates/environment_activity_logs.view.lkml Created a new LookML view for environment activity logs, facilitating data analysis based on environment context.
.../vv_lineage_liquid_template_golden.json Updated dataset definitions to include new entries for environment_activity_logs and employee_income_source_as_per_env.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Looker
    participant DataHub

    User->>Looker: Request data
    Looker->>DataHub: Fetch metadata
    DataHub->>Looker: Return enriched metadata
    Looker->>User: Serve requested data
Loading

Poem

🐰 In fields where data hops and plays,
New constants bloom in joyful ways.
With merges deep and views so bright,
Looker gleams in morning light.
A dance of SQL, fresh and spry,
Metadata leaps, oh my, oh my! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Aug 7, 2024
IncompleteSqlTransformer(
source_config=source_config
), # complete any incomplete sql
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting - so we're pushing some of our transformation logic into the jinja / lkml parse layer?

@sid-acryl sid-acryl marked this pull request as ready for review August 8, 2024 15:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (2)
metadata-ingestion/src/datahub/ingestion/source/looker/looker_file_loader.py (1)

33-39: Occurrences of liquid_variable found in source files.

The transition from liquid_variable to source_config is not fully complete, as there are still references to liquid_variable in the source code. Please review the following locations to ensure that all necessary updates have been made:

  • metadata-ingestion/src/datahub/ingestion/source/looker/looker_template_language.py
  • metadata-ingestion/src/datahub/ingestion/source/looker/looker_liquid_tag.py
  • metadata-ingestion/src/datahub/ingestion/source/looker/lookml_config.py

Consider checking if these references are intentional or need to be updated to use source_config.

Analysis chain

Encapsulation of configuration details.

The constructor now accepts source_config instead of liquid_variable, encapsulating configuration details more effectively. This change should improve code clarity and maintainability. Verify that all dependencies on liquid_variable have been updated to use source_config.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the update from `liquid_variable` to `source_config`.

# Test: Search for any remaining usage of `liquid_variable`. Expect: No occurrences.
rg --type python $'liquid_variable'

Length of output: 65


Script:

#!/bin/bash
# Description: Verify the update from `liquid_variable` to `source_config`.

# Test: Search for any remaining usage of `liquid_variable`. Expect: No occurrences.
rg --glob '*.py' 'liquid_variable'

Length of output: 4831

metadata-ingestion/src/datahub/ingestion/source/looker/lookml_config.py (1)

177-182: Add tests for looker_environment functionality

The looker_environment field is used in the codebase but does not appear to have direct test coverage. It's important to add tests to ensure that the environment-specific logic behaves correctly in both "prod" and "dev" settings.

  • Consider adding tests in relevant test files to cover scenarios for both environments.
Analysis chain

Addition of looker_environment enhances configuration flexibility.

The looker_environment field allows for specifying the Looker environment, enabling environment-specific logic in LookML files. This addition is well-aligned with the PR's objectives and should be thoroughly tested to ensure correct behavior in both "prod" and "dev" environments.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage and testing of `looker_environment`.

# Test: Search for usage and tests of `looker_environment`. Expect: Correct application and coverage in tests.
rg --type python $'looker_environment'

Length of output: 68


Script:

#!/bin/bash
# Description: Search for the usage of `looker_environment` across the codebase.

# Search for occurrences of `looker_environment` in Python files.
rg 'looker_environment' --glob '*.py' -A 5

Length of output: 2969


Script:

#!/bin/bash
# Description: Search for tests related to `looker_environment` across the codebase.

# Search for occurrences of `looker_environment` in test files.
rg 'looker_environment' --glob '*test*.py' -A 5

Length of output: 47

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 900c259 and b2c2c58.

Files selected for processing (11)
  • metadata-ingestion/setup.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/looker/looker_constant.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/looker/looker_file_loader.py (3 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/looker/looker_template_language.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/looker/lookml_concept_context.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/looker/lookml_config.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/looker/lookml_source.py (1 hunks)
  • metadata-ingestion/tests/integration/lookml/vv-lineage-and-liquid-templates/data.model.lkml (2 hunks)
  • metadata-ingestion/tests/integration/lookml/vv-lineage-and-liquid-templates/employee_income_source_as_per_env.view.lkml (1 hunks)
  • metadata-ingestion/tests/integration/lookml/vv-lineage-and-liquid-templates/environment_activity_logs.view.lkml (1 hunks)
  • metadata-ingestion/tests/integration/lookml/vv_lineage_liquid_template_golden.json (1 hunks)
Additional comments not posted (18)
metadata-ingestion/src/datahub/ingestion/source/looker/looker_constant.py (1)

2-10: LGTM! Verify the usage of new constants.

The introduction of these constants improves the standardization of SQL configurations. Ensure that these constants are utilized correctly throughout the codebase.

metadata-ingestion/tests/integration/lookml/vv-lineage-and-liquid-templates/environment_activity_logs.view.lkml (1)

1-12: LGTM! Ensure environment-specific configurations are tested.

The use of conditional SQL table names for different environments is a useful feature. Ensure that these configurations are thoroughly tested to avoid issues in different environments.

Verification successful

Environment-specific configurations are present in LookML files. The -- if prod -- and -- if dev -- patterns are correctly used for SQL table names and other configurations. Ensure these configurations are thoroughly tested in their respective environments to prevent issues during deployment.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness of environment-specific configurations in LookML files.

# Test: Check for the presence of environment-specific configurations. Expect: Correct configurations for each environment.
rg --type lookml '-- if prod' -A 2
rg --type lookml '-- if dev' -A 2

Length of output: 897


Script:

#!/bin/bash
# Description: Verify the correctness of environment-specific configurations in LookML files.

# Test: Check for the presence of environment-specific configurations. Expect: Correct configurations for each environment.
rg -e '-- if prod' -A 2 --type lookml
rg -e '-- if dev' -A 2 --type lookml

Length of output: 140


Script:

#!/bin/bash
# Description: Verify the correctness of environment-specific configurations in LookML files.

# Find all .lkml files and search for environment-specific configurations.
fd --extension lkml --exec rg -e '-- if prod' -A 2 {}
fd --extension lkml --exec rg -e '-- if dev' -A 2 {}

Length of output: 884

metadata-ingestion/tests/integration/lookml/vv-lineage-and-liquid-templates/data.model.lkml (1)

Line range hint 9-36:
LGTM! Verify the integration of new datasets.

The inclusion of new datasets and explore blocks enhances the data model. Ensure that these datasets are correctly integrated and utilized in the LookML model.

Verification successful

Integration of New Datasets Verified Successfully

The include and explore statements for the new datasets environment_activity_logs and employee_income_source_as_per_env are correctly integrated in the LookML model.

  • include: "environment_activity_logs.view.lkml"
  • include: "employee_income_source_as_per_env.view.lkml"
  • explore: environment_activity_logs
  • explore: employee_income_source_as_per_env
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the integration of new datasets in the LookML model.

# Test: Check for the inclusion and exploration of new datasets. Expect: Correct integration of each dataset.
rg --type lookml 'include: "environment_activity_logs.view.lkml"'
rg --type lookml 'include: "employee_income_source_as_per_env.view.lkml"'
rg --type lookml 'explore: environment_activity_logs {'
rg --type lookml 'explore: employee_income_source_as_per_env {'

Length of output: 617


Script:

#!/bin/bash
# Description: Verify the integration of new datasets in the LookML model.

# Test: Check for the inclusion and exploration of new datasets. Expect: Correct integration of each dataset.
rg 'include: "environment_activity_logs.view.lkml"' --glob '*.lkml'
rg 'include: "employee_income_source_as_per_env.view.lkml"' --glob '*.lkml'
rg 'explore: environment_activity_logs \{' --glob '*.lkml'
rg 'explore: employee_income_source_as_per_env \{' --glob '*.lkml'

Length of output: 829

metadata-ingestion/tests/integration/lookml/vv-lineage-and-liquid-templates/employee_income_source_as_per_env.view.lkml (2)

14-14: Clarify the source table selection logic.

The comment -- if dev -- dev_income_source -- if prod -- prod_income_source suggests conditional logic for selecting the source table. Ensure that this logic is implemented correctly in the Looker environment, possibly using the newly introduced looker_environment configuration.


6-12: Ensure correct handling of environment-specific logic.

The use of {% if %} and {% elsif %} for selecting tables based on environment-specific flags is a good approach for dynamic SQL generation. Verify that dw_eff_dt_date, finance_dw_eff_dt_date, dw_eff_dt_week, and finance_dw_eff_dt_week are correctly defined in the Looker environment and that their _is_selected attributes are properly set.

Verification successful

Environment-specific logic is correctly implemented and tested.

The flags dw_eff_dt_date, finance_dw_eff_dt_date, dw_eff_dt_week, and finance_dw_eff_dt_week are defined and their _is_selected attributes are used appropriately in the codebase, as evidenced by their presence in LookML files and the test file test_lookml.py.

  • metadata-ingestion/tests/integration/lookml/test_lookml.py: Contains definitions and tests for _is_selected attributes.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the definition and usage of environment-specific flags.

# Test: Search for the flags in the codebase. Expect: Definitions and usage of `_is_selected` attributes.
rg --type lkml $'dw_eff_dt_date|finance_dw_eff_dt_date|dw_eff_dt_week|finance_dw_eff_dt_week'

Length of output: 123


Script:

#!/bin/bash
# Description: Verify the definition and usage of environment-specific flags.

# Test: Search for the flags in the codebase. Expect: Definitions and usage of `_is_selected` attributes.
rg 'dw_eff_dt_date|finance_dw_eff_dt_date|dw_eff_dt_week|finance_dw_eff_dt_week'

Length of output: 4843

metadata-ingestion/src/datahub/ingestion/source/looker/looker_file_loader.py (1)

10-10: Refactor aligns with improved template processing.

The replacement of resolve_liquid_variable_in_view_dict with process_lookml_template_language suggests a more robust approach to handling LookML templates. Ensure that the new function covers all necessary template processing scenarios.

Verification successful

Refactor aligns with improved template processing.

The process_lookml_template_language function is implemented in metadata-ingestion/src/datahub/ingestion/source/looker/looker_template_language.py. It uses a series of transformers to process LookML templates, indicating a comprehensive approach to handling template processing scenarios. This aligns with the review comment's suggestion of a more robust method.

  • metadata-ingestion/src/datahub/ingestion/source/looker/looker_template_language.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and coverage of `process_lookml_template_language`.

# Test: Search for the function implementation. Expect: Comprehensive handling of LookML templates.
ast-grep --lang python --pattern $'def process_lookml_template_language($_, $_):\n    $$$'

Length of output: 90


Script:

#!/bin/bash
# Description: Search for the function `process_lookml_template_language` without specifying parameters to find any implementation.

rg 'def process_lookml_template_language' --glob '*.py' -A 10

Length of output: 1353

metadata-ingestion/src/datahub/ingestion/source/looker/looker_template_language.py (7)

111-164: Good use of abstract base class pattern.

The LookMLViewTransformer class is well-structured, effectively using the abstract base class pattern to enforce implementation of the _apply_transformation method in subclasses. The transform method is central to the class's functionality and is well-implemented.


166-175: LGTM!

The LiquidVariableTransformer class correctly implements the transformation logic for resolving liquid variables.


178-204: Well-implemented SQL completion logic.

The IncompleteSqlTransformer class effectively handles incomplete SQL fragments by adding necessary SELECT and FROM clauses.


207-219: Efficient pattern removal.

The DropDerivedViewPatternTransformer class uses regular expressions efficiently to remove ${} patterns from SQL table names.


222-254: Well-constructed regex for conditional comments.

The LookMlIfCommentTransformer class effectively evaluates Looker -- if -- comments using well-constructed regular expressions.


257-286: Effective transformation management.

The TransformedLookMlView class is well-structured and effectively manages the transformation process using a sequence of transformers.


289-318: Modular and extensible transformation process.

The process_lookml_template_language function is well-designed, applying transformations in a prioritized manner. The use of transformers enhances modularity and extensibility.

metadata-ingestion/src/datahub/ingestion/source/looker/lookml_concept_context.py (1)

318-318: Improved string handling.

The modification to the datahub_transformed_sql_table_name method enhances robustness by ensuring the output is free from excess whitespace and surrounding spaces.

metadata-ingestion/setup.py (1)

174-174: New dependency added: deepmerge.

The addition of the deepmerge>=1.1.1 dependency enhances the project's capability to handle deep merging of data structures.

metadata-ingestion/src/datahub/ingestion/source/looker/lookml_source.py (1)

672-672: Verify usage of self.source_config in get_internal_workunits.

The change from self.source_config.liquid_variable to self.source_config allows broader access to configuration properties. Ensure that the method logic is updated to handle this broader scope and that any specific logic previously relying on liquid_variable is correctly adjusted.

Verification successful

Usage of self.source_config in get_internal_workunits is correctly handled.

The method now accesses multiple properties of self.source_config, such as base_folder, project_name, and model_pattern. This indicates that the broader configuration scope is being utilized effectively, and the logic appears consistent with this change. No issues found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `self.source_config` within the `get_internal_workunits` method.

# Test: Search for the usage of `self.source_config` in the method. Expect: Correct handling of the broader configuration scope.
ast-grep --lang python --pattern $'def get_internal_workunits(self) -> Iterable[MetadataWorkUnit]:\n  $$$\n  self.source_config\n  $$$'

Length of output: 133


Script:

#!/bin/bash
# Description: Use ripgrep to search for `self.source_config` usage within the `get_internal_workunits` method.

# Test: Search for `self.source_config` in the file to understand its usage context.
rg --context 5 'def get_internal_workunits' -A 50 metadata-ingestion/src/datahub/ingestion/source/looker/lookml_source.py | rg 'self\.source_config'

Length of output: 426

metadata-ingestion/tests/integration/lookml/vv_lineage_liquid_template_golden.json (2)

1583-1768: Verify conditional comments in viewProperties for environment_activity_logs.

The viewProperties aspect uses conditional comments to define SQL logic. Verify that these comments are correctly interpreted and executed in the Looker environment.


1769-2048: Verify conditional logic in viewProperties for employee_income_source_as_per_env.

The viewProperties aspect uses conditional logic to dynamically select data sources. Verify that this logic is correctly implemented and functions as intended in different environments.

Verification successful

Conditional logic in viewProperties is correctly implemented.

The viewLogic aspect for employee_income_source_as_per_env includes conditional logic to dynamically select data sources, confirming its correct implementation. No further action is required.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the correctness of conditional logic in `viewProperties`.

# Test: Search for the usage of conditional logic in the dataset entry. Expect: Correct implementation and functionality.
rg --json -A 5 '"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:looker,lkml_samples.view.employee_income_source_as_per_env,PROD)"' | rg '"viewLogic":'

Length of output: 154


Script:

#!/bin/bash
# Description: Re-verify the correctness of conditional logic in `viewProperties`.

# Test: Search for the `viewLogic` aspect within the dataset entry more effectively.
rg '"entityUrn": "urn:li:dataset:(urn:li:dataPlatform:looker,lkml_samples.view.employee_income_source_as_per_env,PROD)"' -A 50 metadata-ingestion/tests/integration/lookml/vv_lineage_liquid_template_golden.json | rg '"viewLogic":'

Length of output: 231


Script:

#!/bin/bash
# Description: Search for any instance of `viewLogic` in the entire JSON file to verify its existence and format.

# Test: Search for the `viewLogic` aspect in the entire file.
rg '"viewLogic":' metadata-ingestion/tests/integration/lookml/vv_lineage_liquid_template_golden.json

Length of output: 4472

@hsheth2 hsheth2 changed the title feat(ingestion/lookml): support looker -- if comments feat(ingestion/lookml): support looker -- if comments Aug 16, 2024
@hsheth2 hsheth2 merged commit cb33c0f into datahub-project:master Aug 16, 2024
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants