Azure Data Factory - SQL to Nested JSON
Introduction
Azure Data Factory (ADF) is a cloud-based data integration service that allows users to create, schedule, and manage data pipelines. One of the key features of ADF is its ability to transform and process data from various sources, including relational databases. In this article, we will explore how to use ADF to transform SQL data into nested JSON format.
Background
The provided Stack Overflow question outlines a scenario where a user wants to use ADF to output SQL data in a nested JSON structure. The desired output is similar to the following example:
[
{
"staffid":"101",
"firstname":"Donald",
"lastname":"Duck",
"appointments":[
{
"appointmentid":"201",
"startdate":"2020-02-01T00:00:00",
"enddate":"2020-04-29T23:00:00"
},
{
"appointmentid":"202",
"startdate":"2020-01-01T00:00:00",
"enddate":"2020-01-31T00:00:00"
}
]
},
{
"staffid":"102",
"firstname":"Mickey",
"lastname":"Mouse",
"appointments":[
{
"appointmentid":"203",
"startdate":"2020-02-01T00:00:00",
"enddate":"2020-04-29T23:00:00"
},
{
"appointmentid":"204",
"startdate":"2020-01-01T00:00:00",
"enddate":"2020-01-31T00:00:00"
}
]
}
]
The user has tried using the Copy activity, but it produces flat JSON structures instead of the desired nested structure.
Understanding JSON PATH
To achieve the desired output, we need to understand how to use JSON PATH in ADF. JSON PATH is a way to specify the path to the data you want to extract from a JSON source.
In this case, we can use the following JSON PATH:
'appointments'
This will extract all the appointment data that belongs to each staff member.
Simulating the Sample Data
To demonstrate the solution, let’s simulate some sample data in SQL Server. We’ll create two tables: Staff
and Appointments
.
CREATE TABLE Staff (
staffid INT,
firstname VARCHAR(50),
lastname VARCHAR(50)
);
INSERT INTO Staff (staffid, firstname, lastname) VALUES
(101, 'Donald', 'Duck'),
(102, 'Mickey', 'Mouse');
CREATE TABLE Appointments (
appointmentid INT,
staffid INT,
startdate DATETIME,
enddate DATETIME
);
INSERT INTO Appointments (appointmentid, staffid, startdate, enddate) VALUES
(201, 101, '2020-02-01T00:00:00', '2020-04-29T23:00:00'),
(202, 101, '2020-01-01T00:00:00', '2020-01-31T00:00:00'),
(203, 102, '2020-02-01T00:00:00', '2020-04-29T23:00:00'),
(204, 102, '2020-01-01T00:00:00', '2020-01-31T00:00:00');
Using SQL in ADF
Now that we have our sample data, let’s use the Copy activity in ADF to transform it into nested JSON format.
First, we need to create a new dataset in ADF. We’ll call this dataset “StaffData”.
{
"name": "StaffData",
"type": "SqlSource",
"sqlReaderQuery": "SELECT app.staffid,app.firstname,app.lastname,
'appointments' = (
SELECT
appointmentid AS 'appointmentid',startdate as 'startdate',enddate as 'enddate'
FROM
dbo.appoint as app2
where app2.staffid = app.staffid and
app2.firstname = app.firstname and
app2.lastname = app_lastname
FOR JSON PATH)
from dbo.appoint as app
group by app.staffid,app.firstname,app.lastname
FOR JSON Path;"
}
In this query, we’re using the SELECT
statement to extract all the appointment data that belongs to each staff member. We’re also using the FOR JSON PATH
clause to format the output as nested JSON.
Note that we’ve used the same JSON PATH as before: 'appointments'
. This will extract all the appointment data and nest it under the corresponding staff member’s name.
Running the Pipeline
Now that we have our dataset set up, let’s run the pipeline in ADF. We’ll call this pipeline “TransformStaffData”.
{
"name": "TransformStaffData",
"type": "DatasetPipeline",
"datasets": {
"StaffData": {"referenceName": "StaffData"}
},
"activities": [
{
"name": "TransformStaffDataActivity",
"type": "CopyDataToAzureBlobActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30
},
"configuration": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "$(DatasetName).sqlReaderQuery"
},
"sink": {
"type": "AzureBlobSink",
"writeBatchSize": 10000,
"enableStaging": true
}
}
}
]
}
In this pipeline, we’re using the CopyDataToAzureBlobActivity
to transform our dataset into nested JSON format. We’ve also added some configuration options to optimize performance.
Running the Pipeline and Verifying the Results
Now that we have our pipeline set up, let’s run it in ADF and verify the results.
Once the pipeline is complete, we should see a new file in Azure Blob Storage with the transformed data in nested JSON format:
[
{
"staffid": 101,
"firstname": "Donald",
"lastname": "Duck",
"appointments": [
{
"appointmentid": 201,
"startdate": "2020-02-01T00:00:00",
"enddate": "2020-04-29T23:00:00"
},
{
"appointmentid": 202,
"startdate": "2020-01-01T00:00:00",
"enddate": "2020-01-31T00:00:00"
}
]
},
{
"staffid": 102,
"firstname": "Mickey",
"lastname": "Mouse",
"appointments": [
{
"appointmentid": 203,
"startdate": "2020-02-01T00:00:00",
"enddate": "2020-04-29T23:00:00"
},
{
"appointmentid": 204,
"startdate": "2020-01-01T00:00:00",
"enddate": "2020-01-31T00:00:00"
}
]
}
]
Conclusion
In this article, we’ve explored how to use ADF to transform SQL data into nested JSON format. We’ve used the SELECT
statement and FOR JSON PATH
clause to specify the path to the data you want to extract from a JSON source.
We’ve also demonstrated how to simulate sample data in SQL Server and run a pipeline in ADF to transform it into nested JSON format. With this knowledge, you can now use ADF to transform your own data into nested JSON format.
Note: This is just one way to achieve the desired output, and there may be other approaches that work for your specific use case.
Last modified on 2024-02-28