Fixes #26670: Resolve QuickSight column aliases to correct upstream column in lineage#27301
Conversation
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
Resolves incorrect QuickSight CustomSql column-level lineage when projected columns use aliases (e.g., SELECT id AS relation_id) by leveraging LineageParser.column_lineage mappings instead of falling back to name-based matching.
Changes:
- Added
_build_column_lineage_from_parser()to buildColumnLineageentries from SQL parser column mappings with parent-table filtering and fallback behavior. - Updated
_yield_lineage_from_query()to use the new parser-based column lineage builder. - Added unit tests covering alias resolution, multi-table filtering, and fallback behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ingestion/src/metadata/ingestion/source/dashboard/quicksight/metadata.py |
Builds column-level lineage using SQL parser alias mappings and uses it during lineage emission. |
ingestion/tests/unit/topology/dashboard/test_quicksight.py |
Adds unit tests to validate alias resolution, multi-table filtering, and fallback behavior. |
| # Multi-table safety: filter by parent table to avoid resolving | ||
| # a shared column name (e.g. 'id') to the wrong upstream table. | ||
| if src_col._parent: | ||
| # _parent may be qualified: '<default>.table' or 'schema.table' | ||
| parent_str = str(src_col._parent).replace("<default>.", "") | ||
| # Compare only the table name portion (last segment) | ||
| parent_table = parent_str.split(".")[-1].lower() | ||
| entity_table = from_entity.name.root.lower() | ||
| if parent_table != entity_table: | ||
| continue |
There was a problem hiding this comment.
src_col._parent is treated as a single value and coerced with str(src_col._parent), but LineageParser.column_lineage can propagate _parent as an iterable of parents (the lineage utilities iterate for parent in source._parent). This can cause incorrect filtering (or even exceptions if _parent/raw_name aren’t present for some parser outputs), leading to an empty column_lineage and falling back to the old name-based matching (reintroducing the bug). Consider using getattr(src_col, "_parent", None) and handling the iterable case by checking whether any parent table name matches from_entity.name.root before accepting the pair.
There was a problem hiding this comment.
Fixed. Replaced direct src_col._parent access with getattr(src_col, "_parent", None) and normalised to a list using hasattr(parent, "__iter__") check. Now handles both single Table and iterable of Tables. Used any() to accept the pair when at least one parent matches from_entity.name.root.
| # Column from the correct upstream table | ||
| src_col_correct = MagicMock() | ||
| src_col_correct.raw_name = "id" | ||
| src_col_correct._parent = MagicMock() | ||
| src_col_correct._parent.__str__ = MagicMock( | ||
| return_value="relation_table" | ||
| ) | ||
|
|
||
| tgt_col_correct = MagicMock() | ||
| tgt_col_correct.raw_name = "relation_id" | ||
|
|
||
| # Column from a DIFFERENT table with same name 'id' | ||
| src_col_wrong = MagicMock() | ||
| src_col_wrong.raw_name = "id" | ||
| src_col_wrong._parent = MagicMock() | ||
| src_col_wrong._parent.__str__ = MagicMock(return_value="other_table") | ||
|
|
There was a problem hiding this comment.
The new tests mock src_col._parent as a single object with a __str__, but in real parser output _parent can be an iterable of parent tables. Adding a test that covers an iterable _parent (e.g., a list/set of mocked parents) would help ensure the filtering logic doesn’t regress and that alias resolution still works when multiple parents are reported.
There was a problem hiding this comment.
Fixed. Added test_build_column_lineage_from_parser_iterable_parent (order 12) that mocks src_col._parent as a list [parent_table_mock] and verifies alias resolution still works correctly when _parent is iterable.
ingestion/src/metadata/ingestion/source/dashboard/quicksight/metadata.py
Outdated
Show resolved
Hide resolved
… also fix bot comments
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
All bot review comments from Copilot and Gitar Bot have been addressed in commit 106c664:
@harshach @pmbrull could you please add the |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review ✅ Approved 1 resolved / 1 findingsResolves QuickSight column aliases to correct upstream columns in lineage by fixing parent-table filtering when _parent is None in multi-table SQL queries. No issues found. ✅ 1 resolved✅ Edge Case: No parent-table filter when _parent is None in multi-table SQL
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| if not column_lineage: | ||
| # Fallback: parser returned no column lineage or all pairs | ||
| # were filtered out. Use name-based matching as before. | ||
| columns = [col.name.root for col in data_model_entity.columns] | ||
| return self._get_column_lineage( | ||
| from_entity, data_model_entity, columns | ||
| ) | ||
|
|
There was a problem hiding this comment.
The fallback to _get_column_lineage is triggered when column_lineage is empty after filtering for the current from_entity. In multi-table queries where lineage_parser.column_lineage exists but has no pairs for a given source table (e.g., a joined table used only in predicates), this will reintroduce the original name-based mis-linking by emitting column lineage for that table anyway. Suggestion: only fall back when lineage_parser.column_lineage is empty/None globally; if the parser has column lineage but none match the current from_entity, return an empty list (or None) so we don't manufacture incorrect column lineage for unrelated tables. Adding a unit test for this case would prevent regressions.
| @pytest.mark.order(12) | ||
| def test_build_column_lineage_from_parser_iterable_parent(self): | ||
| """ | ||
| When src_col._parent is an iterable of parent tables (as some | ||
| parser outputs produce), _build_column_lineage_from_parser must |
There was a problem hiding this comment.
The new tests use @pytest.mark.order(12) and then later @pytest.mark.order(11), which makes the ordered test sequence harder to follow and can be confusing when diagnosing failures. Consider keeping the order markers monotonically increasing in the file (swap these two order values, or reorder the tests to match).
Fixes #26670
Description
I worked on this because QuickSight datasets using CustomSql with column
aliases (e.g.
SELECT id AS relation_id) were producing incorrectcolumn-level lineage. OpenMetadata was matching by column name instead
of tracing the alias back through the SQL expression — so
relation_idwould get linked to any other upstream column named
relation_idin thecatalog, rather than the actual source column
id.The fix was simpler than it looks —
LineageParserwas already beinginstantiated with the CustomSql query in
_yield_lineage_from_query(),but its
column_lineageproperty was never used for column-level lineage.The code was falling back to name-based matching instead.
What I did:
_build_column_lineage_from_parser()method that useslineage_parser.column_lineageto resolve alias mappings correctlysrc_col._parentto filter by parent table — prevents wronglineage when multiple upstream tables share the same column name
raw_namevia.split(".")[-1]to handlefully-qualified column names from the parser
len(col_pair) < 2)no column lineage (SQL too complex or parsing failed)
get_column_fqnimport which was only in the base class beforeType of change:
Checklist:
I have read the CONTRIBUTING document.
My PR title is
Fixes #26670: Resolve QuickSight column aliases to correct upstream column in lineageI have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
I have added a test that covers the exact scenario we are fixing.
Tests added (
test_quicksight.py):test_build_column_lineage_from_parser_resolves_alias— verifies alias resolutiontest_build_column_lineage_from_parser_multi_table_filters_correctly— verifies parent table filtering prevents wrong lineage in multi-table joinstest_build_column_lineage_from_parser_falls_back_when_empty— verifies graceful fallback to name-based matching