We were tasked to build a data pipeline for a client and which was earlier built like any other common pipeline which included ETL, extracting data and transforming/cleansing data and loading it to a staging table and then load onto few final tables.

With Redshift spectrum, you can streamline the pipeline to ingest data directly to data lake like S3, instead of loading them to staging tables first. Spectrum lets you query the data in S3 and generate insights on your data before actually loading them on your warehouse tables, which is exactly what we needed, so we chose Redshift spectrum.

Setting up Amazon Redshift Spectrum is fairly easy and it requires you to create an external schema and tables, external tables are read-only and won’t allow you to perform any modifications to data. We have to make sure that data files in S3 and the Redshift cluster are in the same AWS region before creating the external schema. Amazon Redshift Spectrum supports the following formats AVRO, PARQUET, TEXTFILE, SEQUENCEFILE, RCFILE, RegexSerDe, ORC, Grok, CSV, Ion, and JSON as per its documentation.

Amazon documentation is very concise and if you follow these 4 steps you can create external schema and tables in no time, so I will not write what is already there in the documentation.

Probably the missing step in the documentation is to create a Trust Relationship that allows Amazon Redshift to access other AWS services on your behalf,  you have to add the following JSON snippet in Trust Relationships tab after you create IAM Role in step 1 mentioned in the documentation.

And then you create an external schema like this,

let us follow this by creating an external table as well.

you can see the create command is fairly self-explanatory and descriptive which just looks for schemarow format, delimiter, s3 bucket location any partition keys and that’s it, we will discuss about partitioning a little later.

Once an external table is created, you can start querying data like it is a table on Redshift. You can also join external tables with the data in other non-external tables.

All the external tables that one has created can be seen in SVV_EXTERNAL_TABLES system view and to view column information in these tables you can query SVV_EXTERNAL_COLUMNS system view, whereas SVV_EXTERNAL_PARTITIONS lets you query list of partitions you have created in your table.

Redshift creates external tables with Pseudo columns $path and $size, selecting these two columns for each table will give information about the bucket table is pointing to and its size.

Redshift spectrum lets us separate computation from storage, data is stored in S3 while redshift does the computation through its spectrum nodes which use MPP(massively parallel computing) to distribute work , so it is better to keep the files on S3 in separate folders for different tables so that the workflow is evenly distributed among all nodes in the cluster, one another advantage of this is you can create multiple redshift clusters to access the external tables, this lets you scale the compute power when there huge requirements for data reads.

Redshift spectrum also lets you partition data by one or more partition keys like salesmonth partition key in the above sales table. To add partitions to table you have to create a directory structure similar to s3 path mentioned in creating an external table, meaning if I should be placing my Jan and Feb Sales data files in a directory with name salesmonth=2019-01 and salesmonth=2019-02 so that my query only scans that particular partition of data. You can add a partition with following ALTER command.

Now we’ll see an example of how to query 2019-02 data after the newly added partitions.

you can improve a table performance by using columnar data formats such as Apache Parquet together with partitioning.

When should you consider using Redshift spectrum?

  • If you want to archive your huge tables but you still want to query them if needed.
  • If you already have few data files in s3 and you want to avoid spending time on ETL but still want to query this data.
  • Having data already on s3 which you can query lets you avoid the storage costs which you will otherwise have if you move your data into Redshift tables.
  • If you want to simplify data pipeline and avoid replicating data in staging tables.

Redshift also allows us to query nested data, we have seen in the beginning that spectrum supports JSON format, but let us discuss it as another topic.

Hope this information will help you get started the Redshift spectrum.