Data ingestion through  AWS Redshift

Redshift is a petabyte-scale data warehouse in cloud. It is a cluster with nodes running redshift engine on each and can have multiple databases. This article is about streamed data ingestion through Redshift and exporting from it.

Prerequisites:

  1. Need an AWS account,
  2. Connecting to Redshift can be done with psql(postgresql) client tool or SQL workbench
  3. Create aws credentials http://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-create-an-iam-role.html
  4. Launch a cluster http://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-launch-sample-cluster.html and authorize access using http://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-authorize-cluster-access.html

Connecting to redshift:

psql -h <redshift_cluster_elb_created> -U test -d test -p 5439

Creating tables:

Below is simple table for streams

Create Table streams(int id integer,stream_tz timestamp,data1 character varying(1024));

test=# \i create_streams.sql

Data ingestion:

Redshift allows data ingestion in multiple ways but it has be done through AWS S3 storage

1.Copy data as below from S3 bucket if stream data is stored in S3

COPY streams FROM ‘s3://mys3bucket/stream_data/stream_data.txt’  credentials ‘aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>’delimiter ‘|’;

  1. Copy data as below from you streaming servers using metadata file in format of json as below

COPY streams FROM ‘s3://mys3bucket/stream_data/stream_data.json’ credentials ‘aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>’ delimiter ‘|’ ssh

stream_data.json will look like this

{

“entries”: [

       {“endpoint”:”stream_data_server1″,

       “command”:”cat stream_data.txt”,

       “mandatory”:true,

       “username”:”streamuser”},

        {“endpoint”:”stream_data_server2″,

       “command”:”cat stream_data.txt”,

       “mandatory”:true,

       “Username”:”streamuser”}

]

}

Challenges in CICD environment changes to streamed data:

Above copying works fine as long the data fields do not change in stream_data.txt, if the stream data gets added with an additional field in CICD(Continous Integration Continous Deployment) world, the COPY command fails as the Redshift expects the number fields to be same. That can be avoided with FILLRECORD option which allows data files to be loaded when contiguous columns are missing at the end of some of the records

COPY streams FROM ‘s3://mys3bucket/stream_data/stream_data.json’ credentials ‘aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>’ delimiter ‘|’ FILLRECORD ssh

Troubleshooting:

  1. Redshift provide tables STL_LOAD_ERRORS and STL_LOADERROR_DETAIL to figure out errors while ingesting data. A sample query like below will give some details about most recent error.  SELECT * FROM  STL_LOAD_ERRORS ORDER BY starttime desc LIMIT 1;
  1. STV_LOAD_STATE provides the current state of ongoing COPY commands.

Export Data from Redshift:

Data ingested from streams can be exported using below command on psql cli

UNLOAD (‘select * from streams’)

TO ‘s3://mys3bucket/stream_data/export_data/streams_’ CREDENTIALS ‘aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>’ DELIMITER AS ‘|’ GZIP;

When we switched to AWS redshift our application could ingest 1TB of data in 15-20mins.