Loading CSV and JSON Data from S3 to Redshift
In this article, we will discuss how to load data from Amazon Simple Storage Service (S3) into Amazon Redshift using the COPY
command. We will cover both CSV and JSON data formats and provide examples of how to escape special characters in these formats.
Understanding the Requirements
Before we begin, let’s review the requirements:
- We have data stored in an S3 bucket.
- The data is in a CSV or JSON format.
- We want to load this data into a Redshift database.
- The data will be loaded using the
COPY
command.
Loading CSV Data from S3 to Redshift
To load CSV data from S3, we can use the following command:
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;
This command is almost correct, but we need to make one adjustment. In the COPY
command, the third field (the desc
column) contains a double quote inside a JSON object. To load this data correctly, we must escape the inner double quotes using another double quote.
Escaping Double Quotes in CSV Data
To escape the double quotes, we need to surround each value with single quotes instead of double quotes. Here’s the corrected command:
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;
Loading JSON Data from S3 to Redshift
To load JSON data from S3, we can use the COPY
command with the following options:
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
format as json
This will load the JSON data directly into the Redshift database.
Parsing JSON Data in Redshift
To parse the JSON data, we can use the json_parse
function. Here’s an example:
SELECT *
FROM table
WHERE json_parse(desc) @> '{"id":1,"name":"test"}';
This will return all rows where the desc
column contains a JSON object with the specified key-value pairs.
Using REMOVEQUOTES
in CSV Data
When loading CSV data, we need to use the REMOVEQUOTES
option to remove the quotes from each field. This option is necessary because we are using single quotes instead of double quotes to escape special characters.
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;
Using IGNOREBLANKLINES
in CSV Data
We also need to use the IGNOREBLANKLINES
option to ignore blank lines in the data. This is necessary because we are loading a manifest file that may contain blank lines.
copy table
from 's3://bucketname/manifest' credentials 'aws_access_key_id=xx;aws_secret_access_key=xxx'
delimiter ','
IGNOREHEADER 1
REMOVEQUOTES
IGNOREBLANKLINES
manifest;
Conclusion
In this article, we discussed how to load CSV and JSON data from S3 into Redshift using the COPY
command. We covered both the basic syntax and some advanced options, such as escaping special characters in CSV data and parsing JSON data in Redshift.
By following these steps and examples, you should be able to successfully load your data from S3 into Redshift. Happy loading!
Last modified on 2024-05-01