CORE DQ Hub – Data Quality Workflows

Data Quality Workflows

Select a workflow. Download your CSV, open the Databricks link, run the notebook, and get your results.

Vendor / Customer

CVI + CDQ

Paste IDs → download the input CSV → run the CVI notebook in Databricks → download the output → upload to CDQ for duplicate matching.

Open workflow

Email Cleaning

Manual Steps

Paste partner IDs → download the input CSV → run the Email Cleaning notebook in Databricks → download the output report.

Open workflow

Contact

Manual Steps

Paste partner IDs → download the input CSV → run the Contact notebook in Databricks → download the output report.

Open workflow

Full Process Guide

Reference

End-to-end reference: Databricks extraction → General Mapping → CVI linkage → CDQ duplicate matching and download rules.

View guide

Vendor / Customer

Fill in your IDs below → download the input CSV → run the notebook in Databricks → download CVI output → upload to CDQ.

Business Input Step 1

Enter IDs below and click Download Input CSV. Then go to Databricks, upload the CSV, run the notebook, and download the output.

Next Steps Steps 2–4

2

Upload CSV to Databricks & Run Notebook

Log into Databricks. Go to your workspace, upload the downloaded CSV to the FileStore, and run the CVI Extraction notebook.

Open Databricks Workspace

3

Download CVI Output from FileStore

After the notebook finishes, go to Databricks FileStore and download the CVI output CSV file.

Open Databricks FileStore

4

Upload CVI output to CDQ

Go to CDQ → Data Mirror Management → Upload using General Mapping. Then run Duplicate Matching and download Duplicate Consolidation.

Open CDQ Portal

CDQ Duplicate Matching — Steps 4 & 5 Steps 4–5

1

Upload to CDQ Data Mirror ManagementGo to CDQ → Data Mirror Management → Upload using General Mapping. NULL values must be included where applicable — do not replace with blank strings.

2

Configure Duplicate MatchingGo to Duplicate Matching → Upload Configuration File. Adjust fuzzy matching thresholds if results are over/under-matching.

3

Select matching modeSelf-match: select one data source. Cross-system (PMD vs P08): select Pattern + Candidate sources.

4

US data ruleRemove Tax Number 1–5 columns from the upload template for any US records before running matching.

5

Download → Duplicate ConsolidationAlways select Duplicate Consolidation as the download option. Validate and block/delete confirmed duplicates before migration.

Email Address Cleaning

Enter partner IDs → download the input CSV → run the Email Cleaning notebook in Databricks → download the output.

Enter Partner IDs Step 1

Enter the BP / Partner numbers to check for Celaning email addresses, then download the CSV.

Run & Download Steps 2–4

2

Upload CSV & Run Email Notebook

Log into Databricks. Upload the CSV to FileStore and run the Email notebook in your workspace.

Open Databricks Workspace

3

Download Email Output

After the notebook completes, download the output CSV from Databricks FileStore.

Open Databricks FileStore

4

Validate & Action

Review the output. Block or merge duplicate email records in the source system before migration.

Contact Person

Enter partner IDs → download the input CSV → run the Contact notebook in Databricks → download the output.

Enter Partner IDs Step 1

Enter BP / Partner numbers to identify duplicate contact persons, then download the CSV.

Run & Download Steps 2–4

2

Upload CSV & Run Contact Notebook

Log into Databricks. Upload the CSV to FileStore and run the Contact notebook.

Open Databricks Workspace

3

Download Contact Output

After the notebook finishes, download the output CSV from Databricks FileStore.

Open Databricks FileStore

4

Validate & Action

Review the output. Delete or merge duplicate contacts in the source system. Document findings before migration.

Full Process Guide

End-to-end reference — Databricks extraction through CDQ validation.

End-to-End StepsReference

1

Load Golden List into Databricks

Upload the Excel golden list to Databricks FileStore. Load with Pandas → convert to Spark DataFrame.

2

Map to General Mapping Template (CDQ Schema)

Cast and rename columns: Name (CONCAT NAME1–4), Country, City, Postal Code (STRING), Tax Numbers 1–5 (STRING — empty for US data), VAT Number.

3

CVI Linkage — Join Tables

Join PMD_but000_view + PMD_cvi_cust_link_view + PMD_cvi_vend_link_view. Always filter: OPtype ≠ 'D' to exclude deleted records.

4

Generate External ID

Formula: BP_Number + '_C' + Customer_Number + '_V' + Vendor_Number. Use COALESCE on all fields.

5

Create Final Template View

UNION ALL of Customer_View (KNA1) and Vendor_View (LFA1) into Final_Template.

6

Export CSV & Upload to CDQ

Export Final_Template as CSV. Upload via CDQ → Data Mirror Management → General Mapping.

Open CDQ Portal

7

Configure Duplicate Matching

Upload Configuration File. Set thresholds. Mode: self-match (one source) or linkage (Pattern + Candidate).

8

Download Results

Always select Duplicate Consolidation. Validate and block/delete duplicates before migration.

Key Rules

OPtype filter

Always exclude OPtype = 'D' in all CVI joins — deleted records only.

US Tax Data

Tax Number 1–5 must be NULL / empty for all US records before CDQ upload.

Data types

Cast Postal Code and all Tax Numbers to STRING — no numeric formatting.

Name format

CONCAT(COALESCE(NAME1,''), ' ', COALESCE(NAME2,''), ' ', COALESCE(NAME3,''), ' ', COALESCE(NAME4,''))

External ID NULLs

Always use COALESCE to handle NULL values in the External ID concat formula.

Download option

Always select "Duplicate Consolidation" when downloading results from CDQ.