3 mins Read

This article is continuation to Redshift loading strategies. You can refer previous article here

Problem Statement:

Consider we want to load  10 GB data (belongs to 20 tables) in AWS S3 into Redshift. Using the COPY statement mentioned in part 1 we can load the data in serial passion table by table. Job might get complete in 5-10 mins. How about loading 1 TB or 10 TB of data belongs to same 20 tables. If we try to load 10 TB of data job might run for many hours or a day in worst case. In future if we get 100 TB data then job will run for days to load the data into Redshift. Is it possible to reduce amount of time job takes to load the data into Redshift?

Solution:

We can divide the number of tables(equally in size of the data) into 2 or 3 batches and we will give each batch to a thread for loading so that all batches will load at same time i.e in parallel (Redshift will take care of parallelization of files belong to a table, we are trying to do parallelization at table level). Below is the sample code for the same. We will create as many connections as the number of batches and using the connections we will load data. In the below code we have hard coded the number of connections and batches, however we can automate the same. In future if the data size grows then we can increase the number of connections and achieve more parallelism.

 

Here we can add the flag to get the status of threads and can be validate the same at the end of the process to check whether load completed successfully in all threads or failed in one of the thread.