Incremental Sync - Append + Deduped
High-Level Context
This connector syncs data incrementally, which means that only new or modified data will be synced. In contrast with the Incremental Append mode, this mode updates rows that have been modified instead of adding a new version of the row with the updated data. Simply put, if you've synced a row before and it has since been updated, this mode will combine the two rows in the destination and use the most recent data. On the other hand, the Incremental Append mode would just add a new row with the updated data.
Overview
Airbyte supports syncing data in Incremental Append Deduped mode i.e:
- Incremental means syncing only replicate new or modified data. This prevents re-fetching data that you have already replicated from a source. If the sync is running for the first time, it is equivalent to a Full Refresh since all data will be considered as new.
- Append means that this incremental data is added to existing tables in your data warehouse.
- Deduped means that data in the final table will be unique per primary key (unlike Append modes). This is determined by sorting the data using the cursor field and keeping only the latest de-duplicated data row.
Records in the final destination can potentially be deleted as they are de-duplicated, and if your source supports emitting deleting records (e.g. an CDC database source). You should not find multiple copies of the same primary key as these should be unique in that table.
Definitions
A cursor
is the value used to track whether a record should be replicated in an incremental sync. A common example of a cursor
would be a timestamp from an updated_at
column in a database table.
A cursor field
is the field or column in the data where that cursor can be found. Extending the above example, the updated_at
column in the database would be the cursor field
, while the cursor
is the actual timestamp value used to determine if a record should be replicated.
We will refer to the set of records that the source identifies as being new or updated as a delta
.
A primary key
is one or multiple (called composite primary keys
) fields or columns that is used to identify the unique entities of a table. Only one row per primary key value is permitted in a database table. In the data warehouse, just like in incremental - Append, multiple rows for the same primary key can be found in the history table. The unique records per primary key behavior is mirrored in the final table with incremental deduped sync mode. The primary key is then used to refer to the entity which values should be updated.
Rules
As mentioned above, the delta from a sync will be appended to the existing history data in the data warehouse. In addition, it will update the associated record in the final table. Let's walk through a few examples.
Newly Created Record
Assume that updated_at
is our cursor_field
and name
is the primary_key
. Let's say the following data already exists into our data warehouse.
name | deceased | updated_at |
---|---|---|
Louis XVI | false | 1754 |
Marie Antoinette | false | 1755 |
In the next sync, the delta contains the following record:
name | deceased | updated_at |
---|---|---|
Louis XVI | false | 1785 |
At the end of this incremental sync, the data warehouse would now contain:
name | deceased | updated_at |
---|---|---|
Marie Antoinette | false | 1755 |
Louis XVI | false | 1785 |
Updating a Record
Let's assume that our warehouse contains all the data that it did at the end of the previous section. Now, unfortunately the king and queen lose their heads. Let's see that delta:
name | deceased | updated_at |
---|---|---|
Louis XVI | true | 1793 |
Marie Antoinette | true | 1793 |
In the final de-duplicated table:
name | deceased | updated_at |
---|---|---|
Louis XVI | true | 1793 |
Marie Antoinette | true | 1793 |