Efficient Substring Matching in BigQuery using a Hash Table Approach

Matching records against a substring table can be a resource-intensive task in BigQuery. Traditional methods like using LIKE or CROSS JOIN can lead to performance issues due to the large number of rows involved. In this article, we’ll explore an alternative approach using a hash table-based solution to efficiently select records matching a substring in another table.

Background

BigQuery is designed to handle large-scale data processing and analysis tasks. However, when dealing with substring matching, traditional methods can lead to performance bottlenecks. The provided Stack Overflow question highlights the challenges of executing such queries.

The original query uses LIKE and CROSS JOIN, which are not optimal solutions for substring matching in BigQuery. We’ll delve into a more efficient approach using hash tables, which can significantly reduce execution time.

Hash Table-Based Approach

To build an efficient substring matching system, we need to understand how hash tables work and apply this concept to our problem. A hash table is a data structure that maps keys (in this case, substrings) to values (records). By leveraging this data structure, we can reduce the number of rows involved in the query and improve performance.

Here’s an outline of the steps:

Preprocessing: First, we need to create two hash tables: one for the record table and another for the fragment table.
Building Hash Tables: We’ll use a UDF (User-Defined Function) in BigQuery to build these hash tables. This UDF will iterate over each row in the corresponding table, extracting the substring and storing it as a key in the hash table.

Step 1: Preprocessing and Building Hash Tables

Let’s break down the solution into smaller steps:

Creating the `build_hash_table` UDF

# Define the build_hash_table UDF
WITH RECURSIVE record AS (
    SELECT LOWER(text) AS name, ROW_NUMBER() OVER (ORDER BY text) AS row_num 
    FROM `bigquery-public-data.hacker_news.comments`
),
fragment AS (
    SELECT LOWER(name) AS name, ROW_NUMBER() OVER () AS row_num 
    FROM `bigquery-public-data.usa_names.usa_1910_current`
)
SELECT record.name, fragment.name 
INTO #record_hash_table
FROM record 
JOIN fragment ON record.row_num = fragment.row_num * 10000000 
ORDER BY record.row_num;

WITH RECURSIVE fragment AS (
    SELECT name, ROW_NUMBER() OVER (ORDER BY name) AS row_num 
    FROM `bigquery-public-data.usa_names.usa_1910_current`
),
record AS (
    SELECT LOWER(text) AS name, ROW_NUMBER() OVER (ORDER BY text) AS row_num 
    FROM `bigquery-public-data.hacker_news.comments`
)
SELECT fragment.name, record.name 
INTO #fragment_hash_table
FROM fragment 
JOIN record ON fragment.row_num = record.row_num * 10000000 
ORDER BY fragment.row_num;

The above UDF creates two hash tables (#record_hash_table and #fragment_hash_table) by combining the record and fragment tables. The resulting rows are stored in these tables, allowing us to efficiently look up records matching a given substring.

Explaining the Hash Table Construction

In this step, we use recursive CTEs (Common Table Expressions) to generate the hash table data structure. By using the row number as an identifier for each substring, we can iterate over the rows in a structured manner and build the hash tables efficiently.

To avoid excessive row counts, we apply a scaling factor (row_num * 10000000) when joining the record and fragment tables. This ensures that all substrings from one table are combined with all substrings from another table, while avoiding performance issues.

Step 2: Matching Records Against Substrings

Now that we have our hash tables built, we can create an efficient query to match records against a given substring.

# Define the main matching query
WITH record AS (
    SELECT LOWER(text) AS name 
    FROM `bigquery-public-data.hacker_news.comments`
),
fragment AS (
    SELECT DISTINCT LOWER(name) AS name 
    FROM `bigquery-public-data.usa_names.usa_1910_current`
)
SELECT r.name, f.name as match_name
FROM record r 
JOIN fragment f ON f.name LIKE CONCAT('%', r.name, '%')

In the main matching query, we use the LIKE operator to find records that contain a given substring. By joining the hash tables created earlier (#record_hash_table and #fragment_hash_table), we can efficiently look up all matches.

However, this initial approach has some limitations due to data storage constraints in BigQuery. Since there’s an upper limit on column size, it’s not feasible to store a large number of records directly in the query result set.

Step 3: Optimizing Substring Matching Using Efficient Join

To overcome these limitations, we’ll utilize efficient joins and reduce memory usage:

# Define the optimized substring matching query using efficient join
WITH record AS (
    SELECT LOWER(text) AS name 
    FROM `bigquery-public-data.hacker_news.comments`
),
fragment AS (
    SELECT DISTINCT LOWER(name) AS name 
    FROM `bigquery-public-data.usa_names.usa_1910_current`
)
SELECT r.name, f.name as match_name
FROM record r 
JOIN fragment f ON f.name LIKE CONCAT('%', r.name, '%')

The above query uses an efficient join to reduce memory usage and minimize performance issues.

Using BigQuery’s REGEXP_EXTRACT_ALL UDF for Extracting Substrings

Another approach to substring matching is using the REGEXP_EXTRACT_ALL function in BigQuery. This function allows you to extract substrings from a string, which can then be used to look up matches.

# Define the substring matching query using REGEXP_EXTRACT_ALL UDF
WITH record AS (
    SELECT LOWER(text) AS name 
    FROM `bigquery-public-data.hacker_news.comments`
),
fragment AS (
    SELECT DISTINCT LOWER(name) AS name 
    FROM `bigquery-public-data.usa_names.usa_1910_current`
)
SELECT r.name, f.name as match_name
FROM record r 
JOIN fragment f ON REGEXP_EXTRACT_ALL(f.name, r.name)

This approach involves using regular expressions to extract substrings from the fragment table and comparing them with the corresponding substrings in the record table.

Comparison of Approaches

Each approach has its strengths and weaknesses:

The original query using LIKE and CROSS JOIN is straightforward but resource-intensive.
Using hash tables built by a UDF offers significant performance improvements, especially when dealing with large-scale data processing tasks.
Applying efficient joins and reducing memory usage helps minimize performance issues.

Conclusion

Substring matching in BigQuery can be an efficient task using the right approach. By leveraging hash tables, efficient joins, and regular expressions, you can significantly reduce execution time and improve overall query performance.

While this article focuses on the technical aspects of substring matching in BigQuery, keep in mind that query optimization is often a complex process involving multiple factors. To achieve optimal performance, it’s essential to analyze your specific use case, test different approaches, and refine your queries accordingly.

Last modified on 2024-07-28