Ranking, DensRanking, NTILE Functions and Performance in SQL Server 2005

Ranking Functions and Performance in SQL Server 2005
By Alex Kozak
20 April 2006
Ranking functions, introduced in SQL Server 2005, are a great

enhancement to Transact-SQL. Many tasks, like creating arrays,
generating sequential numbers, finding ranks, and so on, which in pre-
2005 versions requires many lines of code, now can be implemented
much easier and faster.
Let's look at the syntax of ranking functions:
ROW_NUMBER () OVER ([<partition_by_clause>]

<order_by_clause>)
RANK () OVER ([<partition_by_clause>] <order_by_clause>)
DENSE_RANK () OVER ([<partition_by_clause>]
<order_by_clause>)
NTILE (integer_expression) OVER ([<partition_by_clause>]
<order_by_clause>)
All four functions have "partition by" and "order by" clauses and that
makes these functions very flexible and useful. However, there is one
nuance in syntax that deserves your attention: the "order by" clause is
not an option.
Why should you worry about the "order by" clause?
Well, as a DBA or database programmer you know that sorting is a

fairly expensive operation in terms of time and resources. And if you
were forced to use it always, even in a situation where you didn't need
it, you could expect degradation of performance, especially in large
databases.
Is it possible to avoid sorting in ranking functions? If possible, how

would it improve performance?
Let's try to answer these questions.

How to Avoid Sorting in Ranking Functions
Create a sample table (Listing 1):
-- Listing 1. Create a sample table.

CREATE TABLE RankingFunctions(orderID int NOT NULL);
INSERT INTO RankingFunctions VALUES(7);
Run the next query with the ROW_NUMBER() function:
SELECT ROW_NUMBER () OVER (ORDER BY orderID) AS rowNum,

orderID
FROM RankingFunctions;
If you check the execution plan for that query (see Figure 1), you will
find that the Sort operator is very expensive and costs 78 percent.
Run the same query, leaving the OVER() clause blank:
SELECT ROW_NUMBER () OVER () AS rowNum, orderID

You will get an error:
Msg 4112, Level 15, State 1, Line 1

The ranking function "row_number" must have an ORDER BY
clause.
Since the parser doesn't allow you to avoid the "order by" clause,
maybe you can force the query optimizer to stop using the Sort
operator. For example, you could create a computed column that
consists of a simple integer, 1, and then use that virtual column in the
"order by" clause (Listing 2):
-- Listing 2. ORDER BY computed column.

-- Query 1: Using derived table.
SELECT ROW_NUMBER () OVER (ORDER BY const) AS rowNum,
orderID
FROM (SELECT orderID, 1 as const
FROM RankingFunctions) t1
GO
-- Query 2: Using common table expression (CTE).
WITH OriginalOrder AS
(SELECT orderID, 1 as const
FROM RankingFunctions)
SELECT ROW_NUMBER () OVER (ORDER BY const) AS rowNum,
orderID
FROM OriginalOrder;
If you check the execution plans now (see Figure 2), you will find that
query optimizer doesn't use the Sort operator anymore. Both queries
will generate the row numbers and return the orderID values in the
original order.
RowNum orderID
1 7
2 11
3 4
4 21
5 15
There is a small problem with the queries in Listing 2 — they need

time (resources) to create and populate the virtual column. As a
result, the performance gains that you achieve by avoiding the sort
operation may disappear when you populate the computed column. Is
there any other way to skip the sort operation?
Let's try to answer this question.
The "order by" clause allows the expressions. The expression can be
simple, constant, variable, column, and so on. Simple expressions can
be organized into complex ones.
What if you talk to query optimizer using the expression's language?

For example, try to use the subquery as an expression:
SELECT ROW_NUMBER () OVER (ORDER BY (SELECT orderID FROM

RankingFunctions)) AS rowNum, orderID
No, you can't bypass the parser. You will get an error:
Msg 512, Level 16, State 1, Line 1

Subquery returned more than 1 value. This is not permitted
when the subquery follows =, !=, <, <= , >, >= or when the
subquery is used as an expression.
O-o-o-p-s, here's the hint! The expression (or in our case, the
subquery) has to produce a single value.
This should work:
SELECT ROW_NUMBER () OVER (ORDER BY (SELECT MAX(OrderID)

FROM RankingFunctions)) AS rowNum, orderID
FROM rankingFunctions;
Bingo! That query is working exactly as you wanted — no Sort

operator has been used.
Now you can write an expression in the "order by" clause that returns
a single value, forcing the query optimizer to refrain from using a sort
operation.
By the way, the solutions in Listing 2 worked because the integer

values in computed columns have been duplicated in all the rows and
for that reason were considered a single value.
Here are some more examples of expression usage in an "order by"

clause (Listing 3):
-- Listing 3. Using an expression in an ORDER BY clause.

SELECT ROW_NUMBER () OVER (ORDER BY (SELECT 1 FROM
sysobjects WHERE 1<>1)) AS rowNum, orderID
GO
SELECT ROW_NUMBER () OVER (ORDER BY (SELECT 1)) AS rowNum,
orderID
GO
DECLARE @i as bit;
SELECT @i = 1;
SELECT ROW_NUMBER () OVER (ORDER BY @i) AS rowNum, orderID
Figure 3 shows the execution plans for the queries in Listing 3.

RANK(), DENSE_RANK() and NTILE() Functions with
Expressions in an ORDER BY Clause
Before we move forward, we should check the correctness of the

solutions for the rest of the ranking functions.
Let's create a few duplicates in the RankingFunctions table and start

testing the RANK() and DENSE_RANK() functions:
-- Listing 4. RANK() and DENSE_RANK() functions with

expressions in an ORDER BY clause.
-- Create duplicates in table RankingFunctions.
GO
-- Query 1: (ORDER BY orderID).
SELECT RANK () OVER (ORDER BY orderID) AS rankNum,
DENSE_RANK () OVER (ORDER BY orderID) AS
denseRankNum,
orderID
GO
-- Query 2: (ORDER BY expression).
SELECT RANK () OVER (ORDER BY (SELECT 1)) AS rankNum,
DENSE_RANK () OVER (ORDER BY (SELECT 1)) AS
denseRankNum,
orderID
GO
If you check the execution plans (see Figure 4), you will find that the
first query in Listing 4 requires a lot of resources for sorting. The
second query doesn't have a Sort operator. So the queries behave as
expected.
However, when you run the queries, the second result will be wrong:
Query 1 retrieves the correct result:
RankNum denseRankNum orderID

1 1 4
1 1 4
1 1 4
4 2 7
5 3 11
5 3 11
7 4 15
8 5 21
Query 2 retrieves the wrong result:
rankNum denseRankNum orderID

1 1 7
1 1 11
1 1 4
1 1 21
1 1 15
1 1 11
1 1 4
1 1 4
Even though the expressions in the "order by" clause help to skip
sorting, they can't be applied to the RANK() and DENSE_RANK()
functions. Apparently, these ranking functions must have a sorted
input to produce the correct result.
Now let's look at the NTILE() function:
-- Listing 5. NTILE() function with expressions in an ORDER

BY clause.
-- Query 1: ORDER BY orderID.
SELECT NTILE(3) OVER (ORDER BY orderID) AS NTileNum,
orderID
GO
-- Query 2: ORDER BY expression.
SELECT NTILE(3) OVER (ORDER BY (SELECT 1)) AS NTileNum,
orderID
Analyzing the execution plans for both queries (see Figure 5), you will
find that:
• The second query skips sorting, meaning the solution is working.
• The results of both queries are correct.
• The optimizer is using Nested Loops, which in some situations

can be heavy.
Performance of Ranking Functions
Now, when you know how to avoid sorting in ranking functions you can
test their performance.
Let's insert more rows into the RankingFunctions table (Listing 6):
-- Listing 6. Insert more rows into the RankingFunctions

table.
IF EXISTS (SELECT * FROM sys.objects
WHERE object_id =
OBJECT_ID(N'[dbo].[RankingFunctions]') AND type in (N'U'))
DROP TABLE RankingFunctions
SET NOCOUNT ON
CREATE TABLE RankingFunctions(orderID int NOT NULL);
DECLARE @i as int, @LoopMax int, @orderIDMax int;

SELECT @i = 1, @LoopMax = 19;
WHILE (@i <= @LoopMax)
BEGIN
SELECT @orderIDMax = MAX(orderID) FROM
RankingFunctions;
INSERT INTO RankingFunctions(OrderID)
SELECT OrderID + @orderIDMax FROM RankingFunctions;
SELECT @i = @i + 1;
END
SELECT COUNT(*) FROM RankingFunctions;

-- 2,621,440.
UPDATE RankingFunctions
SET orderID = orderID/5
WHERE orderID%5 = 0;
The INSERT and SELECT parts of the INSERT…SELECT statement are

using the same RankingFunctions table.
The number of generated rows can be calculated as:
generated rows number = initial rows number * power(2,

number of loop iterations)
Since RankingFunctions initially has 5 rows and @LoopMax = 19, the

number of generated rows will be:
5 * POWER(2,19) = 2,621,440
To increase the entropy in the row order, I changed (updated) the

orderID values in the rows where orderID can be divided by 5 without
the remainder.
Then I tested the INSERT and DELETE commands, using ranking

functions with and without sorting (Listing 7 and Listing 8).
-- Listing 7. Performance tests 1 (Inserts, using SELECT

...INTO).
-- Query 1: Using ORDER BY orderID.
IF EXISTS (SELECT * FROM sys.objects
WHERE object_id =
OBJECT_ID(N'[dbo].RankingFunctionsInserts') AND type in
(N'U'))
DROP TABLE RankingFunctionsInserts;
GO
SELECT ROW_NUMBER () OVER (ORDER BY OrderID) AS rowNum,
OrderID
INTO RankingFunctionsInserts
-- Drop table RankingFunctionsInserts and run Query 2.

-- Query 2: Without sorting.
SELECT ROW_NUMBER () OVER (ORDER BY (SELECT 1)) AS rowNum,
OrderID
-- Drop table RankingFunctionsInserts and run Query 3.
-- Query 3: Using a pre-2005 solution.
SELECT IDENTITY(int,1,1) AS rowNum, orderID
Each of the three queries in Listing 7 inserts the generated row

number and orderID into the RankingFunctionsInserts table, using the
SELECT…INTO statement. (This technique is very helpful when you
trying to create pseudo-arrays in SQL.)
For the sake of curiosity, I tested a solution with an IDENTITY column

(Query 3). That solution is very common in pre-2005 versions of SQL
Server.
-- Listing 8. Performance tests 2 (Delete every fifth row

in the RankingFunctions table).
-- Query 1: Without sorting.
-- Run the script from Listing 6 to insert 2,621,440 rows
into RankingFunctions.
WITH originalOrder AS
(SELECT ROW_NUMBER ( ) OVER (ORDER BY (SELECT 1)) AS
rowNum, OrderID
DELETE originalOrder WHERE rowNum%5 = 0;
-- Query 2: With ORDER BY OrderID.

-- Run the script from Listing 6 to insert 2,621,440 rows
into RankingFunctions.
WITH originalOrder AS
(SELECT ROW_NUMBER ( ) OVER (ORDER BY OrderID) AS rowNum,
OrderID
DELETE originalOrder WHERE rowNum%5 = 0;
Deleting every Nth row or duplicates in the table are common tasks for
a DBA or database programmer. In Listing 8, I used CTE to delete
every fifth row in the RankingFunctions table.
Test Results
Here are the results that I got on a regular Pentium 4 desktop
computer with 512 MB RAM running Windows 2000 Server and
Microsoft SQL Server 2005 Developer Edition:
ROW_NUMBER() RANK() DENSE_RANK() NTILE(3)

INSERT
2,621,440
rows
without 5 sec. N/A N/A 35 sec.
sorting
with 14 sec. 14 sec. 14 sec. 40 sec.
sorting
with 8 sec. N/A N/A N/A
IDENTITY
DELETE
each 5th
row
without 5 sec.
sorting
with 24 sec.
sorting
As you can see, the ROW_NUMBER() function works much faster

without sorting. It also performs better than the IDENTITY solution,
which is unsorted as well.
The RANK() and DENSE_RANK() functions, as we found earlier, don't

work properly without sorting. NTILE() shows a very small
improvement, about 10 percent. This is can be explained.
As I mentioned earlier, the optimizer is using Nested Loops to

implement the NTILE() function. For large data sets, without the
indexes (as in our case), Nested Loops can be very inefficient.
However, you will find that they are inexpensive in the execution plan
(see Figure 6), because sorting helps to make Nested Loops lighter.
When sorting is missing (see Figure 7), the Nested Loops become
much heavier and almost "eat" the performance gains that you achieve
by avoiding sorting.
How Indexes Can Help
As you know, all the pages of non-clustered indexes, and the

intermediate-level pages of clustered indexes, are linked together and
sorted in key sequence order. The leaf-level of a clustered index
consists of data pages that are physically sorted in the same key
sequence order as the clustered index key. All that means is that you
already store some part(s) of your table's data in a particular order. If
your query can use that sorted data — and this is what happens when
you have a covering index — you will increase the performance of your
query dramatically.
Take any table with many columns and rows (or create and populate
one using the technique from Listing 6). Then create different indexes
and test the ranking functions. You will find that for covered queries
the optimizer won't use a Sort operator. This is what makes the
ranking function as fast as, or even faster than, the functions with an
expression in an "order by" clause.
Conclusion
This article explains ranking functions and helps you understand how
they work. The techniques shown here, in some situations, can
increase the performance of ranking functions 3-5 times. In addition,
this article discusses some common trends in the behavior of an
ORDER BY clause with expressions.

Ranking, DensRanking, NTILE Functions and Performance in SQL Server 2005

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Ranking, DensRanking, NTILE Functions and Performance in SQL Server 2005

Hochgeladen von

Copyright:

Verfügbare Formate

Ranking Functions and Performance in SQL Server 2005

Ranking functions, introduced in SQL Server 2005, are a great

Let's look at the syntax of ranking functions:

ROW_NUMBER () OVER ([<partition_by_clause>]

Why should you worry about the "order by" clause?

Well, as a DBA or database programmer you know that sorting is a

Is it possible to avoid sorting in ranking functions? If possible, how

Let's try to answer these questions.

Create a sample table (Listing 1):

-- Listing 1. Create a sample table.

Run the next query with the ROW_NUMBER() function:

SELECT ROW_NUMBER () OVER (ORDER BY orderID) AS rowNum,

Run the same query, leaving the OVER() clause blank:

SELECT ROW_NUMBER () OVER () AS rowNum, orderID

You will get an error:

Msg 4112, Level 15, State 1, Line 1

-- Listing 2. ORDER BY computed column.

There is a small problem with the queries in Listing 2 — they need

Let's try to answer this question.

What if you talk to query optimizer using the expression's language?

SELECT ROW_NUMBER () OVER (ORDER BY (SELECT orderID FROM

Msg 512, Level 16, State 1, Line 1

This should work:

SELECT ROW_NUMBER () OVER (ORDER BY (SELECT MAX(OrderID)

Bingo! That query is working exactly as you wanted — no Sort

By the way, the solutions in Listing 2 worked because the integer

Here are some more examples of expression usage in an "order by"

-- Listing 3. Using an expression in an ORDER BY clause.

Figure 3 shows the execution plans for the queries in Listing 3.

Before we move forward, we should check the correctness of the

Let's create a few duplicates in the RankingFunctions table and start

-- Listing 4. RANK() and DENSE_RANK() functions with

Query 1 retrieves the correct result:

RankNum denseRankNum orderID

Query 2 retrieves the wrong result:

rankNum denseRankNum orderID

Now let's look at the NTILE() function:

-- Listing 5. NTILE() function with expressions in an ORDER

• The second query skips sorting, meaning the solution is working.

• The results of both queries are correct.

• The optimizer is using Nested Loops, which in some situations

Performance of Ranking Functions

-- Listing 6. Insert more rows into the RankingFunctions

DECLARE @i as int, @LoopMax int, @orderIDMax int;

SELECT COUNT(*) FROM RankingFunctions;

The INSERT and SELECT parts of the INSERT…SELECT statement are

The number of generated rows can be calculated as:

generated rows number = initial rows number * power(2,

Since RankingFunctions initially has 5 rows and @LoopMax = 19, the

To increase the entropy in the row order, I changed (updated) the

Then I tested the INSERT and DELETE commands, using ranking

-- Listing 7. Performance tests 1 (Inserts, using SELECT

-- Drop table RankingFunctionsInserts and run Query 2.

Each of the three queries in Listing 7 inserts the generated row

For the sake of curiosity, I tested a solution with an IDENTITY column

-- Listing 8. Performance tests 2 (Delete every fifth row

-- Query 2: With ORDER BY OrderID.

ROW_NUMBER() RANK() DENSE_RANK() NTILE(3)

As you can see, the ROW_NUMBER() function works much faster

The RANK() and DENSE_RANK() functions, as we found earlier, don't

As I mentioned earlier, the optimizer is using Nested Loops to

How Indexes Can Help