-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/unity): Add usage extraction; add TableReference #7910
feat(ingest/unity): Add usage extraction; add TableReference #7910
Conversation
def spark_sql_parser(self): | ||
"""Lazily initializes the Spark SQL parser.""" | ||
if self._spark_sql_parser is None: | ||
spark_context = pyspark.SparkContext.getOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we install pyspark dependency for unity-catalog ? Its not explicitly mentioned in setup.py. Is it indirectly installed for databricks-cli ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch, it's part of delta lake but not databricks
|
||
|
||
class UnityCatalogSourceConfig(StatefulIngestionConfigBase, DatasetSourceConfigMixin): | ||
class UnityCatalogSourceConfig( | ||
StatefulIngestionConfigBase, BaseUsageConfig, DatasetSourceConfigMixin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseUsageConfig has a lot of fields - if not all of them are supported, then we should change the base class that we're using
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use most of them, as they get used by usage_common
. The only one that doesn't get used is include_read_operational_stats
which seems to only be used by bigquery... so if anything I say we remove it from this common config and put it only in the bigquery one.
) | ||
table.upstreams.setdefault(table_ref, {}).setdefault( | ||
column.name, [] | ||
).append(item["name"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice cleanup
|
||
|
||
TableMap = Dict[str, List[TableReference]] | ||
T = TypeVar("T", bound=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think the bound is doing anything here
) | ||
|
||
if not self.config.table_pattern.allowed(filter_table_name): | ||
if not self.config.table_pattern.allowed(table.ref.qualified_table_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the filter_table_name isn't quite the fully qualified table name, since it doesn't include the metastore name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so I made TableReference.qualified_table_name
not include metastore name, while TableReference.__str__
includes metastore name. I'm realizing that naming can be confusing though. Any ideas on a better one?
|
||
def _parse_query_via_lineage_runner(self, query: str) -> Optional[StringTableInfo]: | ||
try: | ||
runner = LineageRunner(query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a huge fan of the fact that this LineageRunner stuff is copy-pasted across our codebase
Ideally we'd go through a unified parser interface, where you can select the underlying parser(s) to use with an enum or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need to fix it here though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a our SqlLineageSQLParser
which creates SqlLineageSQLParserImpl
which eventually calls LineageAnalyzer().analyze
, but there was a lot of logic in there which didn't seem necessary. I thought overall this was simpler. Agree that we should standardize this as much as possible, but I think it'll take a bit of work
spark_context = pyspark.SparkContext.getOrCreate() | ||
spark_session = pyspark.sql.SparkSession(spark_context) | ||
self._spark_sql_parser = ( | ||
spark_session._jsparkSession.sessionState().sqlParser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoa that's cool
Adds unity catalog usage extraction
Problems encountered:
Uses sqllineage parser and a spark sql parser as fallback.
Refactors:
UsageAggregator
class to usage_common, as I've seen this same logic multiple times.user_urn_builder
in usage_common as not all unity users are emails. We create emails with a defaultemail_domain
config in other connectors like redshift and snowflake, which seems unnecessary now?TableReference
for unity catalog and adds it to theTable
dataclass, for managing string references to tables. Replaces logic, especially in lineage extraction, with these referencesgen_dataset_urn
andgen_user_urn
on unity source to reduce duplicate codeproxy.py
into implementation and typesChecklist