AWS S3

Guide to integrate your AWS S3 to Sprinkle

Datasource Concepts

Before setting up the datasource, learn about datasource concepts here

Step by Step Guide

STEP-1: Configure S3 Connection

To learn about Connection, refer here

Log into Sprinkle application
Navigate to Datasources -> Connections Tab -> New Connection ->
Select S3
Provide all the mandatory details
- Name: Name to identify this connection
- Access Key: Account -> My security credentials -> Access keys -> Create new access key -> Download key file -> Show access key. To know more, click here
- Secret key: Account -> My security credentials -> Access keys -> Create new access key -> Download key file -> Show secret key. To know more: click here
- Region: Region should be where the storage bucket was created, for example ap-south-1
- Bucket Name
Test Connection
Create

STEP-2: Configure S3 datasource

To learn about datasource, refer here

Navigate to Datasources -> Datasources Tab -> Add ->
Select S3
Provide the name -> Create
Connection Tab:
- From the drop-down, select the name of connection created in STEP-2
- Update

STEP-3: Create Dataset

Datasets Tab: To learn about Dataset, refer here. Add Dataset for each folder that you want to replicate, providing following details

File Type: Select the File Format
- JSON
- CSV
  - Select Delimiter - Comma, Tab, Pipe, Dash, Other Character
- Parquet
- ORC
Directory Path (Required) :Provide the full path like this: s3a://test-sprinkle-a/s3Ingest/s3Ingest13
Ingestion Mode (Required) :
- Complete: Full folder is downloaded and ingested in every ingestion job run
- Incremental: Ingest only the new files in every ingestion job run. Use this option if your folder is very large, and you are getting new files continuously
  - Remove Duplicate Rows:
    Unique Key: Unique key from table, to dedup data across multiple ingestions
    Time Column Name: Will be used to order data for deduping
  - Max Job Runtime: Give maximum time in minutes for which data should be downloaded. Ingestion job will run specified max minutes and checkpoint will be updated. Next run will continue from checkpoint.
Flatten Type (Required): Select from One level or Multi Level. In one level, flattening will not be applied on complex type. They will be stored as string. In multi level, flattening will be applied in complex level till they become simple type.
Destination Schema (Required) : Data warehouse schema where the table will be ingested into
Destination Table name (Required) : It is the table name to be created on the warehouse. If not given, sprinkle will create like ds_<datasourcename>_<tablename>
Destination Create Table Clause: Provide additional clauses to warehouse-create table queries such as clustering, partitioning, and more, useful for optimizing DML statements. Learn more on how to use this field.
Create

STEP-4: Run and schedule Ingestion

In the Ingestion Jobs tab:

Trigger the Job, using Run button
To schedule, enable Auto-Run. Change the frequency if needed

PreviousFiles NextAWS S3 External

Last updated 1 year ago