feat(ingest/unity): Add usage extraction; add TableReference #7910

asikowitz · 2023-04-26T23:24:06Z

Adds unity catalog usage extraction

Problems encountered:

Parses sql as unity API sql history endpoint has no mention of referenced or destination tables
Attempts to guess catalog / schema if they are not provided, as API does not specify current catalog or schema when queries were run, nor provides fully qualified table names

Uses sqllineage parser and a spark sql parser as fallback.

Refactors:

Creates UsageAggregator class to usage_common, as I've seen this same logic multiple times.
Allows customizable user_urn_builder in usage_common as not all unity users are emails. We create emails with a default email_domain config in other connectors like redshift and snowflake, which seems unnecessary now?
Creates TableReference for unity catalog and adds it to the Table dataclass, for managing string references to tables. Replaces logic, especially in lineage extraction, with these references
Creates gen_dataset_urn and gen_user_urn on unity source to reduce duplicate code
Breaks up proxy.py into implementation and types

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

mayurinehate · 2023-04-27T13:37:06Z

metadata-ingestion/src/datahub/ingestion/source/unity/usage.py

+    def spark_sql_parser(self):
+        """Lazily initializes the Spark SQL parser."""
+        if self._spark_sql_parser is None:
+            spark_context = pyspark.SparkContext.getOrCreate()


Do we install pyspark dependency for unity-catalog ? Its not explicitly mentioned in setup.py. Is it indirectly installed for databricks-cli ?

Ah good catch, it's part of delta lake but not databricks

hsheth2 · 2023-04-27T22:22:11Z

metadata-ingestion/src/datahub/ingestion/source/unity/config.py



-class UnityCatalogSourceConfig(StatefulIngestionConfigBase, DatasetSourceConfigMixin):
+class UnityCatalogSourceConfig(
+    StatefulIngestionConfigBase, BaseUsageConfig, DatasetSourceConfigMixin


BaseUsageConfig has a lot of fields - if not all of them are supported, then we should change the base class that we're using

We use most of them, as they get used by usage_common. The only one that doesn't get used is include_read_operational_stats which seems to only be used by bigquery... so if anything I say we remove it from this common config and put it only in the bigquery one.

hsheth2 · 2023-04-27T22:24:27Z

metadata-ingestion/src/datahub/ingestion/source/unity/proxy.py

+                        )
+                        table.upstreams.setdefault(table_ref, {}).setdefault(
+                            column.name, []
+                        ).append(item["name"])


nice cleanup

hsheth2 · 2023-04-27T22:26:37Z

metadata-ingestion/src/datahub/ingestion/source/unity/usage.py

+
+
+TableMap = Dict[str, List[TableReference]]
+T = TypeVar("T", bound=object)


don't think the bound is doing anything here

hsheth2 · 2023-04-27T22:32:06Z

metadata-ingestion/src/datahub/ingestion/source/unity/source.py

-            )
-
-            if not self.config.table_pattern.allowed(filter_table_name):
+            if not self.config.table_pattern.allowed(table.ref.qualified_table_name):


the filter_table_name isn't quite the fully qualified table name, since it doesn't include the metastore name

Yeah, so I made TableReference.qualified_table_name not include metastore name, while TableReference.__str__ includes metastore name. I'm realizing that naming can be confusing though. Any ideas on a better one?

hsheth2 · 2023-04-27T22:37:05Z

metadata-ingestion/src/datahub/ingestion/source/unity/usage.py

+
+    def _parse_query_via_lineage_runner(self, query: str) -> Optional[StringTableInfo]:
+        try:
+            runner = LineageRunner(query)


not a huge fan of the fact that this LineageRunner stuff is copy-pasted across our codebase

Ideally we'd go through a unified parser interface, where you can select the underlying parser(s) to use with an enum or something

don't need to fix it here though

We have a our SqlLineageSQLParser which creates SqlLineageSQLParserImpl which eventually calls LineageAnalyzer().analyze, but there was a lot of logic in there which didn't seem necessary. I thought overall this was simpler. Agree that we should standardize this as much as possible, but I think it'll take a bit of work

hsheth2 · 2023-04-27T22:38:08Z

metadata-ingestion/src/datahub/ingestion/source/unity/usage.py

+            spark_context = pyspark.SparkContext.getOrCreate()
+            spark_session = pyspark.sql.SparkSession(spark_context)
+            self._spark_sql_parser = (
+                spark_session._jsparkSession.sessionState().sqlParser()


whoa that's cool

feat(ingest/unity): Add usage extraction; add TableReference

723942b

asikowitz requested review from treff7es and hsheth2 April 26, 2023 23:24

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Apr 26, 2023

vercel bot deployed to Preview April 26, 2023 23:33 View deployment

mayurinehate reviewed Apr 27, 2023

View reviewed changes

housekeeping, docs, start time validation, test fixes, dependency fix

d62afc4

vercel bot deployed to Preview April 27, 2023 19:50 View deployment

lint; update validator

8cd8fbf

vercel bot deployed to Preview April 27, 2023 20:54 View deployment

hsheth2 reviewed Apr 27, 2023

View reviewed changes

remove bound; lint

346e74f

vercel bot deployed to Preview April 27, 2023 23:34 View deployment

hsheth2 approved these changes May 1, 2023

View reviewed changes

Merge branch 'master' into unity-catalog-usage

bf23a3e

vercel bot deployed to Preview May 1, 2023 17:37 View deployment

asikowitz merged commit 5b290c9 into datahub-project:master May 1, 2023

asikowitz deleted the unity-catalog-usage branch May 1, 2023 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest/unity): Add usage extraction; add TableReference #7910

feat(ingest/unity): Add usage extraction; add TableReference #7910

asikowitz commented Apr 26, 2023 •

edited

Loading

mayurinehate Apr 27, 2023

asikowitz Apr 27, 2023

hsheth2 Apr 27, 2023

asikowitz Apr 27, 2023

hsheth2 Apr 27, 2023

hsheth2 Apr 27, 2023

hsheth2 Apr 27, 2023

asikowitz Apr 27, 2023

hsheth2 Apr 27, 2023

hsheth2 Apr 27, 2023

asikowitz Apr 27, 2023

hsheth2 Apr 27, 2023



		TableMap = Dict[str, List[TableReference]]
		T = TypeVar("T", bound=object)

feat(ingest/unity): Add usage extraction; add TableReference #7910

feat(ingest/unity): Add usage extraction; add TableReference #7910

Conversation

asikowitz commented Apr 26, 2023 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz commented Apr 26, 2023 •

edited

Loading