Loading Nested JSON Data in DuckDB: A Deep Dive Into Recursive Unnesting

Loading Nested JSON in DuckDB

DuckDB is a popular open-source relational database that allows users to interact with data using SQL. One of the unique features of DuckDB is its ability to handle nested JSON data, making it an attractive option for applications that work with complex data structures.

In this article, we will delve into the world of loading nested JSON in DuckDB and explore some of the key concepts, syntax, and best practices involved in working with nested JSON data.

Understanding Nested JSON Data

Nested JSON data is a type of data structure where one object or value contains other objects or values. In DuckDB, nested JSON data is represented using the struct keyword, which allows us to access individual fields within the nested structure.

For example, consider the following nested JSON data:

{
  "MainLevel": [
    {
      "More": {}
    },
    {
      "More": {
        "Level2": []
      }
    }
  ]
}

In this example, MainLevel is an array of objects, where each object contains a More field that references another object. The innermost object has a Level2 field that contains an empty array.

Loading Nested JSON Data into DuckDB

To load nested JSON data into DuckDB, we use the read_JSON_auto function, which allows us to read JSON files from disk and load their contents into a table. We can also specify additional parameters, such as the maximum object size, to control how much data is loaded.

Here’s an example of how we might load the nested JSON data above into DuckDB:

CREATE TABLE duckdbtest1.main.nested_JSON AS
SELECT 
    Level2.Field1,
FROM 
    (SELECT unnest(MainLevel) as MainLevel
     FROM read_JSON_auto('C:\\jsonfiles\\*.json', maximum_object_size = 999999999))
     as MainLevel,
    unnest(MainLevel.More.Level2) as Level2;

This query uses the unnest function to expand the nested JSON data into individual fields, and then selects only the Field1 field from the resulting Level2 array.

The Problem with Unnest

The problem with this query is that it doesn’t account for the case where the innermost object has no Level2 field. In DuckDB 1.0.0, this was handled by default, but in DuckDB 1.1.0, a change was made to the unnest function that breaks this behavior.

In DuckDB 1.1.0, the unnest function no longer inherits its name from the final identifier in the path. Instead, it returns a literal table with the same name as the original table.

Solution: Using Recursive Unnest

To solve this problem, we can use the recursive unnest function, which allows us to nest one level of indirection into the nested JSON data.

Here’s an updated query that uses recursive unnest:

CREATE TABLE duckdbtest1.main.nested_JSON AS
SELECT 
    Level2.Field1,
FROM 
    (SELECT unnest(MainLevel, recursive := true) as MainLevel
     FROM read_JSON_auto('C:\\jsonfiles\\*.json', maximum_object_size = 999999999))
     as MainLevel;

This query uses the recursive := true parameter to enable recursive unwrapping of the nested JSON data.

Table Representation

To get a better understanding of how the data is represented in DuckDB, we can use the SELECT * FROM ... AS syntax to print out the table structure.

Here’s an example:

SELECT *
FROM (SELECT unnest(MainLevel) as MainLevel
     FROM read_JSON_auto('C:\\jsonfiles\\*.json', maximum_object_size = 999999999))
     as MainLevel;

This query will print out the following table structure:

MainLevel
{‘More’: {‘Level2’: [{‘Field1’: ‘A’}]}}{‘Field1’: ‘A’}

As we can see, the unnest function has returned a literal table with the same name as the original table.

Conclusion

Loading nested JSON data into DuckDB is a powerful feature that allows us to work with complex data structures using SQL. However, the unnest function’s behavior has changed in recent versions of DuckDB, which can break our queries.

By understanding how to use recursive unnest, we can overcome this limitation and successfully load nested JSON data into DuckDB.


Last modified on 2024-08-31