As a part of our continuing study of SQL Server query plans, we’re going to take a look at some subqueries. This is particularly important, since subqueries are the subject of some fair amount of folklore, and it is important to separate fact from myth. It’s not uncommon to hear someone say “Joins are better than subqueries”, but to discern the facts we shall have to be very careful in our terminology and we must distinguish between the logical aspects of a SELECT statement and its physical mechanism. When we talk about the logical aspects of a query, we are talking about the SELECT statement that you sit down and write. As we have seen on numerous occasions, the physical query, that is to say the actual steps that SQL Server takes to satisfy your SQL request may be quite a bit different. We shall see that logical subqueries are generally not a problem, because the subquery disappears in the actual execution plan. This is why I sometimes refer to logical subqueries as apparent subqueries, since the actual plan does not involve subqueries at all. Physical subqueries, on the other hand, can often be a performance problem.
This is where subqueries got a bad reputation to begin with. Years ago, before the wonderful optimizing query engines we have today, the actual execution plan closely resembled the SELECT statement you wrote. So in those days, subqueries were very often bad performers and joins were almost always better. The query engines changed, but the folklore did not.
Let’s start by taking a look at an extremely simple subquery. In this example, run on the Northwind sample database, we wish to see if there are any Suppliers that happen to be located in the same city as at least one customer. There are lots of ways to write this query, we’ll use the EXISTS operator.
SELECT CompanyName, City
FROM suppliers s
WHERE EXISTS (SELECT * FROM customers c WHERE s.City = s.City)
If SQL Server were to execute this query as an actual physical subquery, it would query the Customers table once for every row in the Suppliers table. Of course, SQL Server is too smart for that. When we look at the query plan, we see that SQL Server has turned the subquery into a JOIN.
We also notice that a variation on the Nested Loops operator has appeared. A “Left Semi Join” refers to a left join where we are not getting any column values for the table on the right. The table on the right, in this case the Customers table, is being joined for the sole purpose of deciding which rows from the table on the left should be included in the resultset.
If we wanted to write the JOIN ourselves, our first reaction might be to try something like this, but our first reaction would be wrong.
SELECT s.CompanyName, s.City
FROM Suppliers s
JOIN Customers c
ON c.City = s.City
This query is logically distinct from the subquery above. If there were, say, 10 customers in the same city as a particular supplier, that supplier’s name would show up 10 times in the results, whereas with the subquery a given supplier’s name would appear at most once, which is what we want.
To ensure that there are no duplicates in our JOIN query, we must add the DISTINCT keyword. The DISTINCT keyword has the net effect of forcing the query engine to choose an inferior plan. What we want is a semi-join, but we are not allowed to type that! SQL Server reserves semi-joins for itself and doesn’t want to share.
We see that SQL Server does extra work to eliminate duplicates when it sees DISTINCT, and the net effect is to create an inferior execution plan, more than twice as bad as the plan produced from the subquery. At least in this case, the logical subquery has proved better than the JOIN.
Perhaps ironically, there is a warning on the nested loop operator on the subquery to inform us there is no join predicate.
The appearance of a subquery in a SELECT statement often does not result in an execution plan containing a subquery. In the next section, we will look at a more complex subquery and observe the tricks that SQL Server falls back on to avoid a subquery in the physical plan.