Azure Data Factory - SQL to Nested JSON

Introduction

Azure Data Factory (ADF) is a cloud-based data integration service that allows users to create, schedule, and manage data pipelines. One of the key features of ADF is its ability to transform and process data from various sources, including relational databases. In this article, we will explore how to use ADF to transform SQL data into nested JSON format.

Background

The provided Stack Overflow question outlines a scenario where a user wants to use ADF to output SQL data in a nested JSON structure. The desired output is similar to the following example:

[
  {
    "staffid":"101",
    "firstname":"Donald",
    "lastname":"Duck",
    "appointments":[
      {
        "appointmentid":"201",
        "startdate":"2020-02-01T00:00:00",
        "enddate":"2020-04-29T23:00:00"
      },
      {
        "appointmentid":"202",
        "startdate":"2020-01-01T00:00:00",
        "enddate":"2020-01-31T00:00:00"
      }
    ]
  },
  {
    "staffid":"102",
    "firstname":"Mickey",
    "lastname":"Mouse",
    "appointments":[
      {
        "appointmentid":"203",
        "startdate":"2020-02-01T00:00:00",
        "enddate":"2020-04-29T23:00:00"
      },
      {
        "appointmentid":"204",
        "startdate":"2020-01-01T00:00:00",
        "enddate":"2020-01-31T00:00:00"
      }
    ]
  }
]

The user has tried using the Copy activity, but it produces flat JSON structures instead of the desired nested structure.

Understanding JSON PATH

To achieve the desired output, we need to understand how to use JSON PATH in ADF. JSON PATH is a way to specify the path to the data you want to extract from a JSON source.

In this case, we can use the following JSON PATH:

'appointments'

This will extract all the appointment data that belongs to each staff member.

Simulating the Sample Data

To demonstrate the solution, let’s simulate some sample data in SQL Server. We’ll create two tables: Staff and Appointments.

CREATE TABLE Staff (
    staffid INT,
    firstname VARCHAR(50),
    lastname VARCHAR(50)
);

INSERT INTO Staff (staffid, firstname, lastname) VALUES
(101, 'Donald', 'Duck'),
(102, 'Mickey', 'Mouse');

CREATE TABLE Appointments (
    appointmentid INT,
    staffid INT,
    startdate DATETIME,
    enddate DATETIME
);

INSERT INTO Appointments (appointmentid, staffid, startdate, enddate) VALUES
(201, 101, '2020-02-01T00:00:00', '2020-04-29T23:00:00'),
(202, 101, '2020-01-01T00:00:00', '2020-01-31T00:00:00'),
(203, 102, '2020-02-01T00:00:00', '2020-04-29T23:00:00'),
(204, 102, '2020-01-01T00:00:00', '2020-01-31T00:00:00');

Using SQL in ADF

Now that we have our sample data, let’s use the Copy activity in ADF to transform it into nested JSON format.

First, we need to create a new dataset in ADF. We’ll call this dataset “StaffData”.

{
  "name": "StaffData",
  "type": "SqlSource",
  "sqlReaderQuery": "SELECT app.staffid,app.firstname,app.lastname,
'appointments' = (
            SELECT
                appointmentid AS 'appointmentid',startdate as 'startdate',enddate as 'enddate'
            FROM
                dbo.appoint as app2
            where app2.staffid = app.staffid and
            app2.firstname = app.firstname and
            app2.lastname = app_lastname
            FOR JSON PATH)
from dbo.appoint as app
group by app.staffid,app.firstname,app.lastname
FOR JSON Path;"
}

In this query, we’re using the SELECT statement to extract all the appointment data that belongs to each staff member. We’re also using the FOR JSON PATH clause to format the output as nested JSON.

Note that we’ve used the same JSON PATH as before: 'appointments'. This will extract all the appointment data and nest it under the corresponding staff member’s name.

Running the Pipeline

Now that we have our dataset set up, let’s run the pipeline in ADF. We’ll call this pipeline “TransformStaffData”.

{
  "name": "TransformStaffData",
  "type": "DatasetPipeline",
  "datasets": {
    "StaffData": {"referenceName": "StaffData"}
  },
  "activities": [
    {
      "name": "TransformStaffDataActivity",
      "type": "CopyDataToAzureBlobActivity",
      "dependsOn": [],
      "policy": {
        "timeout": "7.00:00:00",
        "retry": 0,
        "retryIntervalInSeconds": 30
      },
      "configuration": {
        "source": {
          "type": "SqlSource",
          "sqlReaderQuery": "$(DatasetName).sqlReaderQuery"
        },
        "sink": {
          "type": "AzureBlobSink",
          "writeBatchSize": 10000,
          "enableStaging": true
        }
      }
    }
  ]
}

In this pipeline, we’re using the CopyDataToAzureBlobActivity to transform our dataset into nested JSON format. We’ve also added some configuration options to optimize performance.

Running the Pipeline and Verifying the Results

Now that we have our pipeline set up, let’s run it in ADF and verify the results.

Once the pipeline is complete, we should see a new file in Azure Blob Storage with the transformed data in nested JSON format:

[
  {
    "staffid": 101,
    "firstname": "Donald",
    "lastname": "Duck",
    "appointments": [
      {
        "appointmentid": 201,
        "startdate": "2020-02-01T00:00:00",
        "enddate": "2020-04-29T23:00:00"
      },
      {
        "appointmentid": 202,
        "startdate": "2020-01-01T00:00:00",
        "enddate": "2020-01-31T00:00:00"
      }
    ]
  },
  {
    "staffid": 102,
    "firstname": "Mickey",
    "lastname": "Mouse",
    "appointments": [
      {
        "appointmentid": 203,
        "startdate": "2020-02-01T00:00:00",
        "enddate": "2020-04-29T23:00:00"
      },
      {
        "appointmentid": 204,
        "startdate": "2020-01-01T00:00:00",
        "enddate": "2020-01-31T00:00:00"
      }
    ]
  }
]

Conclusion

In this article, we’ve explored how to use ADF to transform SQL data into nested JSON format. We’ve used the SELECT statement and FOR JSON PATH clause to specify the path to the data you want to extract from a JSON source.

We’ve also demonstrated how to simulate sample data in SQL Server and run a pipeline in ADF to transform it into nested JSON format. With this knowledge, you can now use ADF to transform your own data into nested JSON format.

Note: This is just one way to achieve the desired output, and there may be other approaches that work for your specific use case.

Last modified on 2024-02-28