Interface Changelog
Connectors implement this minimal interface to expose change data. Spark handles post-processing (carry-over removal, update detection, net change computation) based on the properties declared by the connector.
The columns returned by columns() must include the following metadata columns:
_change_type(STRING) — the kind of change:insert,delete,update_preimage, orupdate_postimage_commit_version(connector-defined type, e.g. LONG) — the version containing this change_commit_timestamp(TIMESTAMP) — the timestamp of the commit
- Since:
- 4.2.0
-
Method Summary
Modifier and TypeMethodDescriptionColumn[]columns()Returns the columns of this changelog, including data columns and the required metadata columns (_change_type,_commit_version,_commit_timestamp).booleanWhether the raw change data may contain identical insert/delete carry-over pairs produced by copy-on-write file rewrites.booleanWhether the raw change data may contain multiple intermediate states per row identity within the requested changelog range (across all commit versions in the range).name()A name to identify this changelog.newScanBuilder(CaseInsensitiveStringMap options) Returns a newScanBuilderfor reading the change data.booleanWhether updates in the raw change data are represented as delete+insert pairs rather than fully materializedupdate_preimageandupdate_postimageentries.default NamedReference[]rowId()Returns the columns that uniquely identify a row, used for carry-over removal, update detection, and net change computation.default NamedReferenceReturns the column that holds the row version — the commit version at which the row's content was last modified.
-
Method Details
-
name
String name()A name to identify this changelog. -
columns
Column[] columns()Returns the columns of this changelog, including data columns and the required metadata columns (_change_type,_commit_version,_commit_timestamp). -
containsCarryoverRows
boolean containsCarryoverRows()Whether the raw change data may contain identical insert/delete carry-over pairs produced by copy-on-write file rewrites.When
trueand the CDC query'sdeduplicationModeis notnone, Spark will remove carry-over pairs from the raw change data. Iffalse, the connector guarantees that no carry-over pairs are present in the raw change data and Spark will skip carry-over removal entirely. -
containsIntermediateChanges
boolean containsIntermediateChanges()Whether the raw change data may contain multiple intermediate states per row identity within the requested changelog range (across all commit versions in the range).When
trueand the CDC query'sdeduplicationModeisnetChanges, Spark will collapse multiple changes per row identity into the net effect. Iffalse, the connector guarantees at most one change per row identity across the entire changelog range, and Spark will skip net change computation. -
representsUpdateAsDeleteAndInsert
boolean representsUpdateAsDeleteAndInsert()Whether updates in the raw change data are represented as delete+insert pairs rather than fully materializedupdate_preimageandupdate_postimageentries.When
trueand the CDC query'scomputeUpdatesoption is enabled, Spark will deriveupdate_preimage/update_postimagefrom insert/delete pairs in the raw change data. Iffalse, the connector guarantees that update pre/post-images are already present in the raw change data. -
newScanBuilder
Returns a newScanBuilderfor reading the change data.- Parameters:
options- read options (case-insensitive string map)
-
rowId
Returns the columns that uniquely identify a row, used for carry-over removal, update detection, and net change computation.The default implementation throws
UnsupportedOperationException. Connectors must override this method when any ofcontainsCarryoverRows(),representsUpdateAsDeleteAndInsert(), orcontainsIntermediateChanges()returnstrue. Each referenced column must be non-nullable. -
rowVersion
Returns the column that holds the row version — the commit version at which the row's content was last modified. The row version has these properties:- Assigned the current commit version when the row is initially inserted.
- Bumped to the current commit version when the row's content is updated.
- Preserved when the row is rewritten by a copy-on-write operation without a content change — it is NOT bumped to the current commit version.
_commit_version._commit_versionidentifies the commit that emitted this change row; the row version identifies the commit that last wrote the row's content. For a delete+insert pair produced within a single commit, both halves share the same row version if the pair is a copy-on-write carry-over, and have different row versions (old on the delete, new on the insert) if the pair is a true update.Spark uses the row version to distinguish copy-on-write carry-over from update without scanning data columns, for both carry-over removal and update detection.
The default implementation throws
UnsupportedOperationException. Connectors must override this method whencontainsCarryoverRows()orrepresentsUpdateAsDeleteAndInsert()returnstrue. The referenced column must be non-nullable.
-