Google Cloud Storage
Guide to integrate your Google Cloud Storage to Sprinkle
Datasource Concepts
Before setting up the datasource, learn about datasource concepts here
Step by Step Guide
STEP-1: Configure Google Cloud Storage Connection
To learn about Connection, refer here
Log into Sprinkle application
Navigate to Datasources -> Connections Tab -> New Connection ->
Select Google Cloud Storage
Provide all the mandatory details
Name: Name to identify this connection
Bucket Name
Private Key JSON: Private credential that is specified in JSON format, to know more, click here.
Test Connection
Create
STEP-2: Configure Google Cloud Storage datasource
To learn about datasource, refer here
Navigate to Datasources -> Datasources Tab -> Add ->
Select Google Cloud Storage
Provide the name -> Create
Connection Tab:
From the drop-down, select the name of connection created in STEP-2
Update
STEP-3: Create Dataset
Datasets Tab: To learn about Dataset, refer here. Add Dataset for each folder that you want to replicate, providing following details
File Type: Select the File Format
JSON
CSV
Select Delimiter - Comma, Tab, Pipe, Dash, Other Character
Skip before header - Specify the number of rows to skip before header line. Should not skip column header itself.
Exclude columns - specify columns to exclude using comma(,) separated. Ex - column1,column2
Parquet
ORC
Directory Path (Required) :Provide the full path like this: gs://test-sprinkle-bucket/sprinkle//bigquery/datasource/big
Ingestion Mode (Required) :
Complete: Full folder is downloaded and ingested in every ingestion job run
Incremental: Ingest only the new files in every ingestion job run. Use this option if your folder is very large, and you are getting new files continuously
Remove Duplicate Rows:
Unique Key: Unique key from table, to dedup data across multiple ingestions
Time Column Name: Will be used to order data for deduping
Max Job Runtime: Give maximum time in minutes for which data should be downloaded. Ingestion job will run specified max minutes and checkpoint will be updated. Next run will continue from checkpoint.
Flatten Level (Required): Select One Level or Multi Level. In one level, flattening will not be applied on complex type. They will be stored as string. In multi level, flattening will be applied in complex level till they become simple type.
Destination Schema (Required) : Data warehouse schema where the table will be ingested into
Destination Table name (Required) : It is the table name to be created on the warehouse. If not given, sprinkle will create like ds_<datasourcename>_<tablename>
Destination Create Table Clause: Provide additional clauses to warehouse-create table queries such as clustering, partitioning, and more, useful for optimizing DML statements. Learn more on how to use this field.
Create
STEP-4: Run and schedule Ingestion
In the Ingestion Jobs tab:
Trigger the Job, using Run button
To schedule, enable Auto-Run. Change the frequency if needed
Last updated