Interface Changelog


@Evolving public interface Changelog
The central connector interface for Change Data Capture (CDC).

Connectors implement this minimal interface to expose change data. Spark handles post-processing (carry-over removal, update detection, net change computation) based on the properties declared by the connector.

The columns returned by columns() must include the following metadata columns:

  • _change_type (STRING) — the kind of change: insert, delete, update_preimage, or update_postimage
  • _commit_version (connector-defined type, e.g. LONG) — the version containing this change
  • _commit_timestamp (TIMESTAMP) — the timestamp of the commit
Since:
4.2.0
  • Method Summary

    Modifier and Type
    Method
    Description
    Returns the columns of this changelog, including data columns and the required metadata columns (_change_type, _commit_version, _commit_timestamp).
    boolean
    Whether the raw change data may contain identical insert/delete carry-over pairs produced by copy-on-write file rewrites.
    boolean
    Whether the raw change data may contain multiple intermediate states per row identity within the requested changelog range (across all commit versions in the range).
    A name to identify this changelog.
    Returns a new ScanBuilder for reading the change data.
    boolean
    Whether updates in the raw change data are represented as delete+insert pairs rather than fully materialized update_preimage and update_postimage entries.
    default NamedReference[]
    Returns the columns that uniquely identify a row, used for update detection and net change computation.
    Returns the column used for ordering changes within the same row identity, used for update detection.
  • Method Details

    • name

      String name()
      A name to identify this changelog.
    • columns

      Column[] columns()
      Returns the columns of this changelog, including data columns and the required metadata columns (_change_type, _commit_version, _commit_timestamp).
    • containsCarryoverRows

      boolean containsCarryoverRows()
      Whether the raw change data may contain identical insert/delete carry-over pairs produced by copy-on-write file rewrites.

      When true and the CDC query's deduplicationMode is not none, Spark will remove carry-over pairs from the raw change data. If false, the connector guarantees that no carry-over pairs are present in the raw change data and Spark will skip carry-over removal entirely.

    • containsIntermediateChanges

      boolean containsIntermediateChanges()
      Whether the raw change data may contain multiple intermediate states per row identity within the requested changelog range (across all commit versions in the range).

      When true and the CDC query's deduplicationMode is netChanges, Spark will collapse multiple changes per row identity into the net effect. If false, the connector guarantees at most one change per row identity across the entire changelog range, and Spark will skip net change computation.

    • representsUpdateAsDeleteAndInsert

      boolean representsUpdateAsDeleteAndInsert()
      Whether updates in the raw change data are represented as delete+insert pairs rather than fully materialized update_preimage and update_postimage entries.

      When true and the CDC query's computeUpdates option is enabled, Spark will derive update_preimage/update_postimage from insert/delete pairs in the raw change data. If false, the connector guarantees that update pre/post-images are already present in the raw change data.

    • newScanBuilder

      ScanBuilder newScanBuilder(CaseInsensitiveStringMap options)
      Returns a new ScanBuilder for reading the change data.
      Parameters:
      options - read options (case-insensitive string map)
    • rowId

      default NamedReference[] rowId()
      Returns the columns that uniquely identify a row, used for update detection and net change computation.

      The default implementation throws UnsupportedOperationException. Connectors that support update detection or net change computation must override this method.

    • rowVersion

      default NamedReference rowVersion()
      Returns the column used for ordering changes within the same row identity, used for update detection.

      The default implementation throws UnsupportedOperationException. Connectors that support update detection must override this method.