Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAAS-9403 AzureSQL Profiling Pipeline #20

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,7 @@ when there is a conditional date format.
* [dcsazure_Snowflake_to_Snowflake_prof_pl](./documentation/pipelines/dcsazure_Snowflake_to_Snowflake_prof_pl.md)
* [dcsazure_adls_to_adls_mask_pl](./documentation/pipelines/dcsazure_adls_to_adls_mask_pl.md)
* [dcsazure_adls_to_adls_prof_pl](./documentation/pipelines/dcsazure_adls_to_adls_prof_pl.md)
* [dcsazure_AzureSQL_to_AzureSQL_prof_pl](./documentation/pipelines/dcsazure_AzureSQL_to_AzureSQL_prof_pl.md)


## Contribution
Expand Down
80 changes: 80 additions & 0 deletions dcsazure_AzureSQL_to_AzureSQL_prof_pl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# dcsazure_AzureSQL_to_AzureSQL_prof_pl
## Delphix Compliance Services (DCS) for Azure - AzureSQL to AzureSQL Profiling Pipeline

This pipeline will perform automated sensitive data discovery on your AzureSQL Instance.

### Prerequisites
1. Configure the hosted metadata database and associated Azure SQL service (version `V2024.01.01.0`+).
1. Configure the DCS for Azure REST service.
1. Configure the AzureSQL linked service.
ankurs-delphix marked this conversation as resolved.
Show resolved Hide resolved

### Importing
There are several linked services that will need to be selected in order to perform the profiling of your AzureSQL
instance.

These linked services types are needed for the following steps:


`AzureSQL` (source) - Linked service associated with unmasked AzureSQL data. This will be used for the following
steps:
* Schema Discovery From AzureSQL (Copy data activity)
* dcsazure_AzureSQL_to_AzureSQL_source_ds (AzureSQL dataset)
* dcsazure_AzureSQL_to_AzureSQL_prof_df/AzureSQLSource1MillRowDataSampling (dataFlow)

`Azure SQL` (metadata) - Linked service associated with your hosted metadata store. This will be used for the following
steps:
* dcsazure_AzureSQL_to_AzureSQL_metadata_prof_ds (Azure SQL Database dataset),
* dcsazure_AzureSQL_to_AzureSQL_prof_df/MetadataStoreRead (dataFlow),
* dcsazure_AzureSQL_to_AzureSQL_prof_df/WriteToMetadataStore (dataFlow)

`REST` (DCS for Azure) - Linked service associated with calling DCS for Azure. This will be used for the following
steps:
* dcsazure_AzureSQL_to_AzureSQL_prof_df (dataFlow)

### How It Works

* Schema Discovery From AzureSQL
* Query metadata from AzureSQL `information_schema` to identify tables and columns in the AzureSQL instance
* Select Discovered Tables
* After persisting the metadata to the metadata store, collect the list of discovered tables
* For Each Discovered Table
* Call the `dcsazure_AzureSQL_to_AzureSQL_prof_df` data flow

```mermaid
sequenceDiagram
participant AzureSQL as Azure SQL Database
participant SchemaDiscovery as Schema Discovery
participant MetadataStore as Metadata Store
participant DataFlow as dcsazure_AzureSQL_to_AzureSQL_prof_df


SchemaDiscovery->>AzureSQL: Query metadata (information_schema)
AzureSQL-->>SchemaDiscovery: Return metadata (tables and columns)

SchemaDiscovery->>MetadataStore: Persist metadata
MetadataStore-->>SchemaDiscovery: Confirmation

SchemaDiscovery->>MetadataStore: Collect discovered tables
MetadataStore-->>SchemaDiscovery: Return list of tables

SchemaDiscovery->>SchemaDiscovery: For Each Discovered Table
loop For each table
SchemaDiscovery->>DataFlow: Call data flow (dcsazure_AzureSQL_to_AzureSQL_prof_df)
DataFlow-->>MetadataStore: Store profiling results
end
```

### Variables

If you have configured your database using the metadata store scripts, these variables will not need editing. If you
have customized your metadata store, then these variables may need editing.

* `METADATA_SCHEMA` - This is the schema to be used for in the self-hosted AzureSQL database for storing metadata
(default `dbo`)
* `METADATA_RULESET_TABLE` - This is the table to be used for storing the discovered ruleset
(default `discovered_ruleset`)

### Parameters

* `P_SOURCE_DATABASE` - String - This is the database in AzureSQL that contains data we wish to profile
* `P_SOURCE_SCHEMA` - String - This is the schema within the above source database that we will profile
Loading