Sie sind auf Seite 1von 11

Ranking Functions and Performance in SQL Server 2005

By Alex Kozak
20 April 2006

Ranking functions, introduced in SQL Server 2005, are a great


enhancement to Transact-SQL. Many tasks, like creating arrays,
generating sequential numbers, finding ranks, and so on, which in pre-
2005 versions requires many lines of code, now can be implemented
much easier and faster.

Let's look at the syntax of ranking functions:

ROW_NUMBER () OVER ([<partition_by_clause>]


<order_by_clause>)
RANK () OVER ([<partition_by_clause>] <order_by_clause>)
DENSE_RANK () OVER ([<partition_by_clause>]
<order_by_clause>)
NTILE (integer_expression) OVER ([<partition_by_clause>]
<order_by_clause>)

All four functions have "partition by" and "order by" clauses and that
makes these functions very flexible and useful. However, there is one
nuance in syntax that deserves your attention: the "order by" clause is
not an option.

Why should you worry about the "order by" clause?

Well, as a DBA or database programmer you know that sorting is a


fairly expensive operation in terms of time and resources. And if you
were forced to use it always, even in a situation where you didn't need
it, you could expect degradation of performance, especially in large
databases.

Is it possible to avoid sorting in ranking functions? If possible, how


would it improve performance?

Let's try to answer these questions.


How to Avoid Sorting in Ranking Functions

Create a sample table (Listing 1):

-- Listing 1. Create a sample table.


CREATE TABLE RankingFunctions(orderID int NOT NULL);
INSERT INTO RankingFunctions VALUES(7);
INSERT INTO RankingFunctions VALUES(11);
INSERT INTO RankingFunctions VALUES(4);
INSERT INTO RankingFunctions VALUES(21);
INSERT INTO RankingFunctions VALUES(15);

Run the next query with the ROW_NUMBER() function:

SELECT ROW_NUMBER () OVER (ORDER BY orderID) AS rowNum,


orderID
FROM RankingFunctions;

If you check the execution plan for that query (see Figure 1), you will
find that the Sort operator is very expensive and costs 78 percent.

Run the same query, leaving the OVER() clause blank:

SELECT ROW_NUMBER () OVER () AS rowNum, orderID


FROM RankingFunctions;

You will get an error:

Msg 4112, Level 15, State 1, Line 1


The ranking function "row_number" must have an ORDER BY
clause.

Since the parser doesn't allow you to avoid the "order by" clause,
maybe you can force the query optimizer to stop using the Sort
operator. For example, you could create a computed column that
consists of a simple integer, 1, and then use that virtual column in the
"order by" clause (Listing 2):

-- Listing 2. ORDER BY computed column.


-- Query 1: Using derived table.
SELECT ROW_NUMBER () OVER (ORDER BY const) AS rowNum,
orderID
FROM (SELECT orderID, 1 as const
FROM RankingFunctions) t1
GO
-- Query 2: Using common table expression (CTE).
WITH OriginalOrder AS
(SELECT orderID, 1 as const
FROM RankingFunctions)
SELECT ROW_NUMBER () OVER (ORDER BY const) AS rowNum,
orderID
FROM OriginalOrder;

If you check the execution plans now (see Figure 2), you will find that
query optimizer doesn't use the Sort operator anymore. Both queries
will generate the row numbers and return the orderID values in the
original order.

RowNum orderID
1 7
2 11
3 4
4 21
5 15

There is a small problem with the queries in Listing 2 — they need


time (resources) to create and populate the virtual column. As a
result, the performance gains that you achieve by avoiding the sort
operation may disappear when you populate the computed column. Is
there any other way to skip the sort operation?

Let's try to answer this question.

The "order by" clause allows the expressions. The expression can be
simple, constant, variable, column, and so on. Simple expressions can
be organized into complex ones.

What if you talk to query optimizer using the expression's language?


For example, try to use the subquery as an expression:

SELECT ROW_NUMBER () OVER (ORDER BY (SELECT orderID FROM


RankingFunctions)) AS rowNum, orderID
FROM RankingFunctions;

No, you can't bypass the parser. You will get an error:

Msg 512, Level 16, State 1, Line 1


Subquery returned more than 1 value. This is not permitted
when the subquery follows =, !=, <, <= , >, >= or when the
subquery is used as an expression.

O-o-o-p-s, here's the hint! The expression (or in our case, the
subquery) has to produce a single value.

This should work:

SELECT ROW_NUMBER () OVER (ORDER BY (SELECT MAX(OrderID)


FROM RankingFunctions)) AS rowNum, orderID
FROM rankingFunctions;

Bingo! That query is working exactly as you wanted — no Sort


operator has been used.

Now you can write an expression in the "order by" clause that returns
a single value, forcing the query optimizer to refrain from using a sort
operation.

By the way, the solutions in Listing 2 worked because the integer


values in computed columns have been duplicated in all the rows and
for that reason were considered a single value.

Here are some more examples of expression usage in an "order by"


clause (Listing 3):

-- Listing 3. Using an expression in an ORDER BY clause.


SELECT ROW_NUMBER () OVER (ORDER BY (SELECT 1 FROM
sysobjects WHERE 1<>1)) AS rowNum, orderID
FROM RankingFunctions;
GO
SELECT ROW_NUMBER () OVER (ORDER BY (SELECT 1)) AS rowNum,
orderID
FROM RankingFunctions;
GO
DECLARE @i as bit;
SELECT @i = 1;
SELECT ROW_NUMBER () OVER (ORDER BY @i) AS rowNum, orderID
FROM RankingFunctions;

Figure 3 shows the execution plans for the queries in Listing 3.


RANK(), DENSE_RANK() and NTILE() Functions with
Expressions in an ORDER BY Clause

Before we move forward, we should check the correctness of the


solutions for the rest of the ranking functions.

Let's create a few duplicates in the RankingFunctions table and start


testing the RANK() and DENSE_RANK() functions:

-- Listing 4. RANK() and DENSE_RANK() functions with


expressions in an ORDER BY clause.
-- Create duplicates in table RankingFunctions.
INSERT INTO RankingFunctions VALUES(11);
INSERT INTO RankingFunctions VALUES(4);
INSERT INTO RankingFunctions VALUES(4);
GO
-- Query 1: (ORDER BY orderID).
SELECT RANK () OVER (ORDER BY orderID) AS rankNum,
DENSE_RANK () OVER (ORDER BY orderID) AS
denseRankNum,
orderID
FROM RankingFunctions;
GO
-- Query 2: (ORDER BY expression).
SELECT RANK () OVER (ORDER BY (SELECT 1)) AS rankNum,
DENSE_RANK () OVER (ORDER BY (SELECT 1)) AS
denseRankNum,
orderID
FROM RankingFunctions;
GO

If you check the execution plans (see Figure 4), you will find that the
first query in Listing 4 requires a lot of resources for sorting. The
second query doesn't have a Sort operator. So the queries behave as
expected.

However, when you run the queries, the second result will be wrong:

Query 1 retrieves the correct result:

RankNum denseRankNum orderID


1 1 4
1 1 4
1 1 4
4 2 7
5 3 11
5 3 11
7 4 15
8 5 21

Query 2 retrieves the wrong result:

rankNum denseRankNum orderID


1 1 7
1 1 11
1 1 4
1 1 21
1 1 15
1 1 11
1 1 4
1 1 4

Even though the expressions in the "order by" clause help to skip
sorting, they can't be applied to the RANK() and DENSE_RANK()
functions. Apparently, these ranking functions must have a sorted
input to produce the correct result.

Now let's look at the NTILE() function:

-- Listing 5. NTILE() function with expressions in an ORDER


BY clause.
-- Query 1: ORDER BY orderID.
SELECT NTILE(3) OVER (ORDER BY orderID) AS NTileNum,
orderID
FROM RankingFunctions;
GO
-- Query 2: ORDER BY expression.
SELECT NTILE(3) OVER (ORDER BY (SELECT 1)) AS NTileNum,
orderID
FROM RankingFunctions;
Analyzing the execution plans for both queries (see Figure 5), you will
find that:

• The second query skips sorting, meaning the solution is working.

• The results of both queries are correct.

• The optimizer is using Nested Loops, which in some situations


can be heavy.

Performance of Ranking Functions

Now, when you know how to avoid sorting in ranking functions you can
test their performance.

Let's insert more rows into the RankingFunctions table (Listing 6):

-- Listing 6. Insert more rows into the RankingFunctions


table.
IF EXISTS (SELECT * FROM sys.objects
WHERE object_id =
OBJECT_ID(N'[dbo].[RankingFunctions]') AND type in (N'U'))
DROP TABLE RankingFunctions

SET NOCOUNT ON
CREATE TABLE RankingFunctions(orderID int NOT NULL);
INSERT INTO RankingFunctions VALUES(7);
INSERT INTO RankingFunctions VALUES(11);
INSERT INTO RankingFunctions VALUES(4);
INSERT INTO RankingFunctions VALUES(21);
INSERT INTO RankingFunctions VALUES(15);

DECLARE @i as int, @LoopMax int, @orderIDMax int;


SELECT @i = 1, @LoopMax = 19;
WHILE (@i <= @LoopMax)
BEGIN
SELECT @orderIDMax = MAX(orderID) FROM
RankingFunctions;
INSERT INTO RankingFunctions(OrderID)
SELECT OrderID + @orderIDMax FROM RankingFunctions;
SELECT @i = @i + 1;
END

SELECT COUNT(*) FROM RankingFunctions;


-- 2,621,440.

UPDATE RankingFunctions
SET orderID = orderID/5
WHERE orderID%5 = 0;

The INSERT and SELECT parts of the INSERT…SELECT statement are


using the same RankingFunctions table.

The number of generated rows can be calculated as:

generated rows number = initial rows number * power(2,


number of loop iterations)

Since RankingFunctions initially has 5 rows and @LoopMax = 19, the


number of generated rows will be:

5 * POWER(2,19) = 2,621,440

To increase the entropy in the row order, I changed (updated) the


orderID values in the rows where orderID can be divided by 5 without
the remainder.

Then I tested the INSERT and DELETE commands, using ranking


functions with and without sorting (Listing 7 and Listing 8).

-- Listing 7. Performance tests 1 (Inserts, using SELECT


...INTO).
-- Query 1: Using ORDER BY orderID.
IF EXISTS (SELECT * FROM sys.objects
WHERE object_id =
OBJECT_ID(N'[dbo].RankingFunctionsInserts') AND type in
(N'U'))
DROP TABLE RankingFunctionsInserts;
GO
SELECT ROW_NUMBER () OVER (ORDER BY OrderID) AS rowNum,
OrderID
INTO RankingFunctionsInserts
FROM RankingFunctions;

-- Drop table RankingFunctionsInserts and run Query 2.


-- Query 2: Without sorting.
SELECT ROW_NUMBER () OVER (ORDER BY (SELECT 1)) AS rowNum,
OrderID
INTO RankingFunctionsInserts
FROM RankingFunctions;
-- Drop table RankingFunctionsInserts and run Query 3.
-- Query 3: Using a pre-2005 solution.
SELECT IDENTITY(int,1,1) AS rowNum, orderID
INTO RankingFunctionsInserts
FROM RankingFunctions;

Each of the three queries in Listing 7 inserts the generated row


number and orderID into the RankingFunctionsInserts table, using the
SELECT…INTO statement. (This technique is very helpful when you
trying to create pseudo-arrays in SQL.)

For the sake of curiosity, I tested a solution with an IDENTITY column


(Query 3). That solution is very common in pre-2005 versions of SQL
Server.

-- Listing 8. Performance tests 2 (Delete every fifth row


in the RankingFunctions table).
-- Query 1: Without sorting.
-- Run the script from Listing 6 to insert 2,621,440 rows
into RankingFunctions.
WITH originalOrder AS
(SELECT ROW_NUMBER ( ) OVER (ORDER BY (SELECT 1)) AS
rowNum, OrderID
FROM RankingFunctions)
DELETE originalOrder WHERE rowNum%5 = 0;

-- Query 2: With ORDER BY OrderID.


-- Run the script from Listing 6 to insert 2,621,440 rows
into RankingFunctions.
WITH originalOrder AS
(SELECT ROW_NUMBER ( ) OVER (ORDER BY OrderID) AS rowNum,
OrderID
FROM RankingFunctions)
DELETE originalOrder WHERE rowNum%5 = 0;

Deleting every Nth row or duplicates in the table are common tasks for
a DBA or database programmer. In Listing 8, I used CTE to delete
every fifth row in the RankingFunctions table.

Test Results
Here are the results that I got on a regular Pentium 4 desktop
computer with 512 MB RAM running Windows 2000 Server and
Microsoft SQL Server 2005 Developer Edition:

ROW_NUMBER() RANK() DENSE_RANK() NTILE(3)


INSERT
2,621,440
rows
without 5 sec. N/A N/A 35 sec.
sorting
with 14 sec. 14 sec. 14 sec. 40 sec.
sorting
with 8 sec. N/A N/A N/A
IDENTITY

DELETE
each 5th
row
without 5 sec.
sorting
with 24 sec.
sorting

As you can see, the ROW_NUMBER() function works much faster


without sorting. It also performs better than the IDENTITY solution,
which is unsorted as well.

The RANK() and DENSE_RANK() functions, as we found earlier, don't


work properly without sorting. NTILE() shows a very small
improvement, about 10 percent. This is can be explained.

As I mentioned earlier, the optimizer is using Nested Loops to


implement the NTILE() function. For large data sets, without the
indexes (as in our case), Nested Loops can be very inefficient.
However, you will find that they are inexpensive in the execution plan
(see Figure 6), because sorting helps to make Nested Loops lighter.
When sorting is missing (see Figure 7), the Nested Loops become
much heavier and almost "eat" the performance gains that you achieve
by avoiding sorting.

How Indexes Can Help

As you know, all the pages of non-clustered indexes, and the


intermediate-level pages of clustered indexes, are linked together and
sorted in key sequence order. The leaf-level of a clustered index
consists of data pages that are physically sorted in the same key
sequence order as the clustered index key. All that means is that you
already store some part(s) of your table's data in a particular order. If
your query can use that sorted data — and this is what happens when
you have a covering index — you will increase the performance of your
query dramatically.

Take any table with many columns and rows (or create and populate
one using the technique from Listing 6). Then create different indexes
and test the ranking functions. You will find that for covered queries
the optimizer won't use a Sort operator. This is what makes the
ranking function as fast as, or even faster than, the functions with an
expression in an "order by" clause.

Conclusion

This article explains ranking functions and helps you understand how
they work. The techniques shown here, in some situations, can
increase the performance of ranking functions 3-5 times. In addition,
this article discusses some common trends in the behavior of an
ORDER BY clause with expressions.

Das könnte Ihnen auch gefallen