Section 10.3. Existence Queries and Correlation

10.3. Existence Queries and Correlation

Correlated queries are often written so that the question in the inner query is one of existence. For example, suppose you want to find the names of students who have taken a computer science (COSC) class and have earned a grade of B in that course. This query can be written in several ways. For example, you can write it as a noncorrelated subquery as follows:

SELECT  s.sname
FROM    Student s
WHERE   s.stno IN
 (SELECT   gr.student_number FROM Grade_report gr, Section
  WHERE    Section.section_id = gr.section_id
  AND         Section.course_num LIKE 'COSC%'
  AND         gr.grade = 'B')

This query produces the following output (17 rows):

sname
--------------------
Lineas
Mary
Brenda
Lujack
Reva
Harley
Chris
Lynette
Hillary
Phoebe
Holly
George
Cramer
Fraiser
Francis
Lindsay
Stephanie
 
(17 row(s) affected)

You can think of this query as first forming the set of student numbers of students who have made Bs in COSC coursesthe inner query result set. In the inner query, you must have both the Grade_report table (for the grades) and the Section table (for the course numbers). Once you form this set of student numbers (by completing the inner query), the outer query looks through the Student table and SELECTs only those students who are in the inner query set.

This query could also be done by creating a double-nested subquery containing two INs, or it could be written using a three-table join.

Had we chosen to write the query with an unnecessary correlation, it might look like this:

SELECT  s.sname
FROM    Student s
WHERE   s.stno IN
 (SELECT      gr.student_number
  FROM        Grade_report gr, Section
  WHERE       Section.section_id = gr.section_id
  AND           Section.course_num LIKE 'COSC%'
  AND           gr.student_number = s.stno
  AND           gr.grade = 'B')

The output of this query would be the same as the previous query. In this case, the use of the Student table in the subquery is unnecessary. Although correlation is unnecessary, this example is included to show the following:

When correlation is necessary
How to untangle unnecessarily correlated queries
How you might migrate your thought process toward correlation, should it be necessary

First, let's look at situations in which the correlation of a subquery is necessary, and introduce a new predicate: EXISTS.

10.3.1. Using EXISTS

In situations in which the correlation of a subquery is necessary, you can write the correlated subquery with the EXISTS predicate, which looks like this:

SELECT s.sname
FROM   Student s
WHERE EXISTS
 (SELECT 1 FROM Grade_report gr, Section
  WHERE Section.section_id = gr.section_id
  AND     Section.course_num LIKE 'COSC%'
  AND     gr.student_number  = s.stno
  AND     gr.grade = 'B')

The output of this query would be the same as the output (17 rows) of both of the previous queries.

Let's dissect this query. The EXISTS predicate says, "Choose the row from the Student table in the outer query if the subquery is true (that is, if a row in the subquery exists that satisfies the condition in the subquery WHERE clause)." Because no actual result set is formed, "SELECT 1" is used as a "dummy" result set to indicate that the subquery is true (1 is returned) or false (no rows are returned). In the noncorrelated case, we tied the student number in the Student table to the inner query by the IN predicate as follows:

SELECT   s.stno
FROM     Student s
WHERE    s.stno IN
   (SELECT "student number ...)

When using the EXISTS predicate, we do not use any column of the Student table, but rather are seeking only to find whether the subquery WHERE can be satisfied.

We have indicated that we are using EXISTS with (SELECT 1...). Using the EXISTS predicate, the subquery does not form a result set per se, but rather causes EXISTS to returns true or false. The use of SELECT * in the inner query is common among SQL programmers. However, from an "internal" standpoint, SELECT * causes the SQL engine to check the data dictionary unnecessarily. As the actual result of the inner query is not important, it is strongly suggested that you use SELECT 'X' (or SELECT 1 ...) instead of SELECT * ... so that a constant is SELECTed instead of some "sensible" entry. The SELECT 'X' .. or SELECT 1 ... is simply more efficient.

In the EXISTS case, we do not specify any columns to be SELECTed in the inner query's result set; rather, we use a dummy result--SELECT 'X' (or we could use SELECT 1). If the subquery WHERE is satisfied, it returns true, and if the inner query is not satisfied, it selects nothing, then the subquery returns false. The EXISTS predicate forces us to correlate the query. To illustrate that correlation is usually necessary with EXISTS, consider the following query:

SELECT   s.sname
FROM     Student s
WHERE EXISTS
 (SELECT 'X' FROM Grade_report gr, Section t
  WHERE   t.section_id = gr.section_id
  AND     t.course_num LIKE 'COSC%'
  AND     gr.grade = 'B')

This query produces 48 rows of output (of which we show the first 20 rows):

sname
--------------------
Lineas
Mary
Zelda
Ken
Mario
Brenda
Romona
Richard
Kelly
Lujack
Reva
Elainie
Harley
Donald
Chris
Jake
Lynette
Susan
Monica
Bill.
.
.
 
(48 row(s) affected)

This query uses EXISTS, but has no correlation. This syntax infers that for each student row, we test the joined Grade_report and Section tables to see whether there is a course number like COSC and a grade of B (which, of course, there is). We unnecessarily ask the subquery question over and over again. The result from this latter, uncorrelated EXISTS query is the same as the following:

SELECT  s.sname
FROM    Student s

The point is that the correlation is usually necessary when we use EXISTS.

Consider another example in which a correlation could be used. Suppose that we want to find the names of all students who have three or more Bs. A first pass at a query might be something like this:

SELECT  s.sname
FROM    Student s WHERE "something" IN
 (SELECT "something"
  FROM    Grade_report
  WHERE  "count of grade = 'B'" > 2)

This query can be done with a HAVING clause, as you saw previously (Chapter 9), but we want to show how to do this in yet another way. Suppose we arrange the subquery to use the student number (stno) from the Student table as a filter and count in the subquery only when a row in the Grade_report table correlates to that student. The query (this time with an implied EXISTS) looks like this:

SELECT   s.sname
FROM     Student s
WHERE 2 < (SELECT COUNT(*)
           FROM     Grade_report gr
           WHERE    gr.student_number = s.stno
           AND      gr.grade = 'B')

which results in the following output (8 rows):

sname
--------------------
Lineas
Mary
Lujack
Reva
Chris
Hillary
Phoebe
Holly
 
(8 row(s) affected)

Although there is no EXISTS in this query, it is implied. The syntax of the query does not allow an EXISTS, but the sense of the query is "WHERE EXISTS a COUNT of 2 which is less than..." In this correlated subquery, we have to examine the Grade_report table for each member of the Student table to see whether the student has more than two Bs. We test the entire Grade_report table for each student row in the outer query.

If it were possible, a subquery without the correlation would be more desirable, because it would appear simpler to understand. The overall query might be as follows:

SELECT  s.sname
FROM    Student s
WHERE   s.stno IN
 (subquery that defines a set of students who have made 3 Bs)

Therefore, we might attempt to write the following query:

SELECT   s.sname
FROM     Student s
WHERE    s.stno IN
  (SELECT   gr.student_number
    FROM    Grade_report gr
    WHERE   gr.grade = 'B')

However, as the following output (27 rows) shows, this query would give us only students who earned at least one B:

sname
--------------------
Lineas
Mary
Zelda
Ken
Mario
Brenda
Kelly
Lujack
Reva
Harley
Chris
Lynette
Hillary
Phoebe
Holly
Sadie
Jessica
Steve
Cedric
George
Cramer
Fraiser
Francis
Smithly
Sebastian
Lindsay
Stephanie
 
(27 row(s) affected)

To get a list of students who have earned at least three Bs, we could try the following query:

SELECT   s.sname
FROM     Student s
WHERE    s.stno IN
 (SELECT    gr.student_number, COUNT(*)
  FROM      Grade_report gr
  WHERE     gr.grade = 'B'
  GROUP BY  gr.student_number
  HAVING    COUNT(*) > 2)

However, this approach does not work, because the subquery cannot have two columns in its result set unless the main query has two columns in the WHERE .. IN.

Here, the subquery must have only gr.student_number to match s.stno. So, we might try to construct an inline view, as shown in the following query:

SELECT  s.sname
FROM    Student s
WHERE   s.stno IN
  (SELECT vi.student_number
   FROM  (SELECT    student_number, ct = COUNT(*)
                 FROM      Grade_report gr
                 WHERE     gr.grade = 'B'
                 GROUP BY student_number
                 HAVING COUNT(*) > 2) AS vi)

This is an example of the inline view, discussed in Chapter 6. This query succeeds in SQL Server , producing the following output (8 rows):

sname
--------------------
Lineas
Mary
Lujack
Reva
Chris
Hillary
Phoebe
Holly
 
(8 row(s) affected)

This query also works in Oracle, but it may fail in other SQL languages.

As you can see, several ways exist to query the database with SQL. In this case, the correlated subquery may be the easiest to see and perhaps the most efficient.

10.3.2. From IN to EXISTS

A simple example of converting from IN to EXISTS--uncorrelated to correlated (or vice versa)--would be to move the set test in the WHERE .. IN of the uncorrelated subquery to the WHERE of the EXISTS in the correlated query.

As an example, consider the following uncorrelated subquery:

SELECT *
FROM   Student s
WHERE  s.stno  IN
  (SELECT  g.student_number
   FROM    Grade_report g
   WHERE   grade = 'B')

The following is the same query written as a correlated subquery:

SELECT *
FROM   Student s
WHERE EXISTS
  (SELECT   g.student_number
   FROM     Grade_report g
   WHERE    grade = 'B'
    AND     s.stno = g.student_number)

This query produces 27 rows of output (of which we show the first 15 rows):

STNO   SNAME                MAJOR CLASS  BDATE
------ -------------------- ----- ------ -----------------------
2      Lineas               ENGL  1      1980-04-15 00:00:00
3      Mary                 COSC  4      1978-07-16 00:00:00
5      Zelda                COSC  NULL   1978-02-12 00:00:00
6      Ken                  POLY  NULL   1980-07-15 00:00:00
7      Mario                MATH  NULL   1980-08-12 00:00:00
8      Brenda               COSC  2      1977-08-13 00:00:00
13     Kelly                MATH  4      1980-08-12 00:00:00
14     Lujack               COSC  1      1977-02-12 00:00:00
15     Reva                 MATH  2      1980-06-10 00:00:00
19     Harley               POLY  2      1981-04-16 00:00:00
24     Chris                ACCT  4      1978-02-12 00:00:00
34     Lynette              POLY  1      1981-07-16 00:00:00
121    Hillary              COSC  1      1977-07-16 00:00:00
122    Phoebe               ENGL  3      1980-04-15 00:00:00
123        Holly                  POLY  4       1981-01-15 00:00:00.
.
.
 
(27 row(s) affected)

This example gives you a pattern to move from one kind of query to the other kind and to test the efficiency of both kinds of queries. Both of the preceding queries should produce the same output.

10.3.3. NOT EXISTS

As with the IN predicate, which has a NOT IN compliment, EXISTS may also be used with NOT. In some situations, the predicates EXISTS and NOT EXISTS are vital. For example, if we ask a "for all" question, it must be answered by "existence"--actually, the lack thereof (that is, "not existence"). In logic, the statement, "find x for all y" is logically equivalent to "do not find x where there does not exist a y." Or, there is no x for no y. Or, you cannot find an x when there is no y.

In SQL, there is no "for all" predicate. Instead, SQL uses the idea of "for all" logic with NOT EXISTS. (A word of caution, howeverSQL is not simply a logic exercise, as you will see.) In this section, we look at how EXISTS and NOT EXISTS work in SQL. In the following section, we address the "for all" problem.

Consider the following query:

SELECT   s.sname
FROM     Student s
WHERE EXISTS
  (SELECT 'X'
   FROM        Grade_report gr
   WHERE       s.stno = gr.student_number
   AND         gr.grade = 'C')

which produces the following output (24 rows):

sname
--------------------
Zelda
Ken
Mario
Brenda
Richard
Reva
Donald
Jake
Susan
Monica
Bill
Sadie
Jessica
Steve
Alan
Rachel
Smithly
Sebastian
Losmith
Genevieve
Thornton
Gus
Benny
Lionel
 
(24 row(s) affected)

For this correlated subquery, "student names" are SELECTed when:

The student is enrolled in a section (WHERE s.stno = gr.student_number)
The same student has a grade of C (note the correlation in the WHERE clause in the inner query)

Both statements must be true for the student row to be SELECTed. Recall that we use SELECT 1 or SELECT 'X' in our inner query, because we want the subquery to return something if the subquery is true. The actual value of the "something" does not matter. true means something is returned; false means nothing was returned from the subquery. Therefore, SELECT .. EXISTS "says" SELECT .. WHERE true. The inner query is true if any row is SELECTed in the inner query.

Now consider the preceding query with a NOT EXISTS in it instead of EXISTS for students who do not have a grade of C:

SELECT s.sname
FROM   Student s
WHERE NOT EXISTS
  (SELECT 'X'
   FROM    Grade_report gr
   WHERE   s.stno = gr.student_number
   AND     gr.grade = 'C')

This query produces the following output (24 rows):

sname
--------------------
Lineas
Mary
Romona
Kelly
Lujack
Elainie
Harley
Chris
Lynette
Smith
Hillary
Phoebe
Holly
Brad
Cedric
George
Jerry
Cramer
Fraiser
Harrison
Francis
Lindsay
Stephanie
Jake
 
(24 row(s) affected)

In this query, we are still SELECTing with the pattern SELECT .. WHERE true because all SELECTs with EXISTS work that way. But, the twist is that the subquery has to be false to be SELECTed with NOT EXISTS. If the subquery is false, then NOT EXISTS is true and the outer row is SELECTed.

Now, logic implies that if either s.stno <> gr.student_number or gr.grade <> 'C', then the subquery "fails"--that is, it is false for that student row. As the subquery is false, the NOT EXISTS would return a TRue for that row. Unfortunately, this logic is not quite what happens. Recall that we characterized the correlated subquery as follows:

LOOP1: For each row in Student  s  DO
      LOOP2: For each row in Grade_report DO
             IF (gr.student_number = s.stno) THEN
                     IF (gr.grade = 'C') THEN TRUE
      END LOOP2;
     IF TRUE, THEN student row is SELECTed
END LOOP1

Note that LOOP2 is completed before the next student is tested. In other words, just because a student number exists that is not equal, it will not cause the subquery to be false. Rather, the entire subquery table is parsed and the logic is more like this:

For the case .. WHERE EXISTS s.stno = gr.student_number ..., is there a gr.grade = 'C'? If, when the student numbers are equal, no C can be found, then the subquery returns no rowsit is false for that student row. So, with NOT EXISTS, we will SELECT students who have student numbers equal in the Grade_report and Student tables, but who have no C in the Grade_report table. The point about "no C in the Grade_report table" can be answered true only by looking at all the rows in the inner query and finding no C for that student.