Querying Students Table: Get Row from Inner Select and by Group

Introduction

The problem at hand involves querying a large students table, which contains 500,000 to 1,000,000 rows. The goal is to retrieve specific rows based on two conditions:

The ID in each row does not exist as any reference ID (ref_id) in the table.
The name appears more than once.

We need to find a way to achieve this efficiently while minimizing the number of rows being processed.

Background

To understand the problem, let’s take a closer look at the structure of the students table:

id	name	ref_id
1	test	NULL
2	test	1
3	test3	1
4	test4	NULL

The table has three columns: id, name, and ref_id. The id column is the primary key and uniquely identifies each student. The name column stores the student’s name, and the ref_id column stores the ID of a reference person.

Querying Strategy

To achieve the desired result, we need to follow these steps:

First, get the list of IDs that do not exist as any reference ID in the table.
Then, get the list of IDs where the name appears more than once.
Finally, join these two queries together to retrieve the desired rows.

Step 1: Get List of IDs without Reference ID

To find the IDs that do not exist as any reference ID, we can use a subquery with the NOT EXISTS operator:

SELECT st.id FROM students at
WHERE NOT EXIST (SELECT * FROM students stt where stt.ref_id = at.id)

This query returns the IDs of students who have no corresponding reference person.

Step 2: Get List of IDs with Repeated Names

To find the IDs where the name appears more than once, we can use an inner join to compare each student’s name with every other student’s name:

SELECT id, name FROM student n1
INNER JOIN student n2 ON n2.name = n1.name WHERE n1.id < n2.id

This query returns the IDs of students who have duplicate names.

Step 3: Join Queries and Retrieve Desired Rows

To get the final result, we need to join the two queries together:

SELECT t1.id FROM
(SELECT st.id FROM students at
WHERE NOT EXIST (SELECT * FROM students stt where stt.ref_id = at.id)) t1
INNER JOIN
(SELECT id, name FROM student n1
INNER JOIN student n2 ON n2.name = n1.name WHERE n1.id < n2.id) t2 ON t1.id = t2.id

This query returns the IDs of students who meet both conditions: their ID does not exist as any reference ID, and their name appears more than once.

Example Use Case

Let’s say we have a table with 500,000 rows. We want to find all students whose ID does not exist as any reference ID and has a duplicate name.

Suppose the input table looks like this:

id	name	ref_id
1	test	NULL
2	test	3
3	test3	4
4	test4	NULL

The query will return the ID of student number 2, which meets both conditions:

Its ID (2) does not exist as any reference ID.
Its name (“test”) appears more than once.

Conclusion

In this article, we discussed how to query a large students table to retrieve specific rows based on two conditions. We followed a three-step strategy:

Get the list of IDs that do not exist as any reference ID.
Get the list of IDs where the name appears more than once.
Join these queries together to retrieve the desired rows.

We provided an example use case and explained each step in detail, including the SQL code used. This approach can be applied to similar problems involving large tables with multiple conditions.

Last modified on 2025-01-26