Ingestion Modes
Brief description of the various ingestion modes on Sprinkle
While adding datasets, the ingestion mode can be selected to be either Complete or Incremental, depending on the need.
Complete Ingestion
Complete ingestion fetches the entire data during each ingestion irrespective of the pre-existing data. This takes a significant amount of time and also increases the load on the database server.
Incremental Ingestion
In Incremental ingestion, only new and updated data is ingested to the data source based on checkpoints. For the Incremental ingestion, the βTime columnβ (also known as Updated_at Timestamp) needs to be specified, which forms the basis for the incremental ingestion.
First Fetch
In the first run, the full fetch i.e complete ingestion is done. The complete data is loaded into the dataset, also updating the checkpoint for the fetch. Checkpoint is the max (timestamp) of previous ingestion and is used in the following fetches to update only the new & updated rows.
Following Fetches
In the subsequent fetches post the First Fetch, only incremental records are fetched. The table is updated with only the rows whose Time Column entry is greater than or equal to the Checkpoint.
Remove Duplicate Rows
The option can be enabled to remove duplicate row entries. This is generally required in the case when incremental ingestion leads to the creation of duplicate rows, where rows get updated on the source.
For example, consider an e-commerce table order. If the order status of a particular order gets updated then the time column entry corresponding to the order ID also gets updated with the current timestamp. In this scenario, the updated entries are added to the table. When querying on the final table the data might get duplicated.
Users can opt to remove duplicate rows by clicking on yes. On enabling, the tool removes the duplicate rows present in the table based on the unique key provided.
Periodic Full Fetch
The periodic Full Fetch option enables to do a periodic complete fetch of the data from the source. This can be done in order to maintain proper sync between the dataset & the source data. On clicking βYesβ, you get an option to select the periodicity of the βFull Fetch Intervalβ field. You would want to periodic full fetch if data on source gets deleted, and you want to delete them from the warehouse as well.
Users can opt for complete ingestion every night or every week. During complete ingestion, the complete table will get dropped and recreated in the warehouse. Transactional data where there is an update expected in the row entries can be ingested with incremental settings with periodic full fetch enabled.
By default, βPeriodic Full Fetchβ is set to No. Events or log-based data can be ingested without the Periodic Full Fetch being enabled, as there are no updates expected on the row entries.
Best Practices
All event or log-based data can be ingested with Incremental mode and no periodic full fetch enabled because there wonβt be any change in the historical data.
All transaction data with updatedat timestamp can be ingested under incremental mode with the option to remove duplicates.
Not required to opt to remove duplicates if only new rows get added on the source, for ex: History or Audit data.
Enable periodic full fetch on incremental mode, if records get deleted on source and warehouse should be in sync, disable otherwise.
Last updated