AWS S3 External
Guide to integrate your S3 data into external table in Athena or Redshift Spectrum
Last updated
Guide to integrate your S3 data into external table in Athena or Redshift Spectrum
Last updated
S3 External is a datasource connection, which creates an external table in Athena or Redshift Spectrum, automatically by inferring the schema of the data. The data is not loaded into the warehouse, instead data is read from the source location itself when queries are run on the data warehouse.
Before setting up the datasource, learn about datasource concepts
To learn about Connection, refer
Log into Sprinkle application
Navigate to Datasources -> Connections Tab -> New Connection ->
Select S3 External
Provide all the mandatory details
Name: Name to identify this connection
Access Key: Account -> My security credentials -> Access keys -> Create new access key -> Download key file -> Show access key. To know more,
Secret key: Account -> My security credentials -> Access keys -> Create new access key -> Download key file -> Show secret key. To know more:
Region: Region should be where the storage bucket was created, for example ap-south-1
Bucket Name
Test Connection
Create
Navigate to Datasources -> Datasources Tab -> Add ->
Select S3 External
Provide the name -> Create
Connection Tab:
From the drop-down, select the name of connection created in STEP-2
Update
File Type: Select the File Format
JSON
CSV
Select Delimiter - Comma, Tab, Pipe, Dash, Other Character
Parquet
ORC
Compression Type (Required): Select from none, bzip2, gzip, snappy
Flatten Level (Required): Select One Level or Multi Level. In one level, flattening will not be applied on complex type. They will be stored as string. In multi level, flattening will be applied in complex level till they become simple type.
Destination Schema (Required) : Data warehouse schema where the table will be created
Destination Table name (Required) : It is the table name to be created on the warehouse. If not given, sprinkle will create like ds_<datasourcename>_<tablename>
Create
In the Ingestion Jobs tab:
Trigger the Job, using Run button
To schedule, enable Auto-Run. Change the frequency if needed
To learn about datasource, refer
Datasets Tab: To learn about Dataset, refer . Add Dataset for each folder that you want to replicate, providing following details
Directory Path (Required) :Provide the full path like this:
Destination Create Table Clause: Provide additional clauses to warehouse-create table queries such as clustering, partitioning, and more, useful for optimizing DML statements. on how to use this field.