Beruflich Dokumente
Kultur Dokumente
Overview
Analytic Functions, which have been available since Oracle 8.1.6, are designed to
address such problems as "Calculate a running total", "Find percentages within a
group", "Top-N queries", "Compute a moving average" and many more. Most of
these problems can be solved using standard PL/SQL, however the performance is
often not what it should be. Analytic Functions add extensions to the SQL language
that not only make these operations easier to code; they make them faster than
could be achieved with pure SQL or PL/SQL. These extensions are currently under
review by the ANSI SQL committee for inclusion in the SQL specification.
Analytic functions compute an aggregate value based on a group of rows. They differ
from aggregate functions in that they return multiple rows for each group. The group
of rows is called a window and is defined by the analytic clause. For each row, a
"sliding" window of rows is defined. The window determines the range of rows used
to perform the calculations for the "current row". Window sizes can be based on
either a physical number of rows or a logical interval such as time.
Analytic functions are the last set of operations performed in a query except for the
final ORDER BY clause. All joins and all WHERE, GROUP BY, and HAVING clauses are
completed before the analytic functions are processed. Therefore, analytic functions
can appear only in the select list or ORDER BY clause.
The Syntax
Analytic-Function(<Argument>,<Argument>,...)
OVER (
<Query-Partition-Clause>
<Order-By-Clause>
<Windowing-Clause>
)
← Analytic-Function
Specify the name of an analytic function, Oracle actually provides many analytic
functions such as AVG, CORR, COVAR_POP, COVAR_SAMP, COUNT, CUME_DIST,
DENSE_RANK, FIRST, FIRST_VALUE, LAG, LAST, LAST_VALUE, LEAD, MAX, MIN,
NTILE, PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC, RANK,
RATIO_TO_REPORT, STDDEV, STDDEV_POP, STDDEV_SAMP, SUM, VAR_POP,
VAR_SAMP, VARIANCE.
← Arguments
The PARTITION BY clause logically breaks a single result set into N groups, according
to the criteria set by the partition expressions. The words "partition" and "group" are
used synonymously here. The analytic functions are applied to each group
independently, they are reset for each group.
← Order-By-Clause
The ORDER BY clause specifies how the data is sorted within each group (partition).
This will definitely affect the outcome of any analytic function.
← Windowing-Clause
The windowing clause gives us a way to define a sliding or anchored window of data,
on which the analytic function will operate, within a group. This clause can be used
to have the analytic function compute its value based on any arbitrary sliding or
anchored window within a group. More information on windows can be found here.
This example shows the cumulative salary within a departement row by row, with
each row including a summation of the prior rows salary.
Execution Plan
---------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 WINDOW (SORT)
2 1 TABLE ACCESS (FULL) OF 'EMP'
Statistics
---------------------------------------------------
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads
0 redo size
1658 bytes sent via SQL*Net to client
503 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
14 rows processed
The example shows how to calculate a "Running Total" for the entire query. This is
done using the entire ordered result set, via SUM(sal) OVER (ORDER BY deptno,
ename).
Further, we were able to compute a running total within each department, a total
that would be reset at the beginning of the next department. The PARTITION BY
deptno in that SUM(sal) caused this to happen, a partitioning clause was specified in
the query in order to break the data up into groups.
The execution plan shows, that the whole query is very well performed with only 3
consistent gets, this can never be accomplished with standard SQL or even PL/SQL.
Top-N Queries
There are some problems with Top-N queries however; mostly in the way people
phrase them. It is something to be careful about when designing reports. Consider
this seemingly sensible request:
Let's look at three examples, all use the well known table EMP.
Example 1
Sort the sales people by salary from greatest to least. Give the first three rows. If
there are less then three people in a department, this will return less than three
records.
SELECT * FROM (
SELECT deptno, ename, sal, ROW_NUMBER()
OVER (
PARTITION BY deptno ORDER BY sal DESC
) Top3 FROM emp
)
WHERE Top3 <= 3
/
20 SCOTT 3000 1
FORD 3000 2
JONES 2975 3
30 BLAKE 2850 1
ALLEN 1600 2
TURNER 1500 3
9 rows selected.
Execution Plan
--------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 VIEW
2 1 WINDOW (SORT)
3 2 TABLE ACCESS (FULL) OF 'EMP'
This query works by sorting each partition (or group, which is the deptno), in a
descending order, based on the salary column and then assigning a sequential row
number to each row in the group as it is processed. The use of a WHERE clause after
doing this to get just the first three rows in each partition.
Example 2
Give me the set of sales people who make the top 3 salaries - that is, find the set of
distinct salary amounts, sort them, take the largest three, and give me everyone
who makes one of those values.
SELECT * FROM (
SELECT deptno, ename, sal,
DENSE_RANK()
OVER (
PARTITION BY deptno ORDER BY sal desc
) TopN FROM emp
)
WHERE TopN <= 3
ORDER BY deptno, sal DESC
/
30 BLAKE 2850 1
ALLEN 1600 2
30 TURNER 1500 3
10 rows selected.
Execution Plan
--------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE
1 0 VIEW
2 1 WINDOW (SORT PUSHED RANK)
3 2 TABLE ACCESS (FULL) OF 'EMP'
Here the DENSE_RANK function was used to get the top three salaries. We
assigned the dense rank to the salary column and sorted it in a descending order.
The DENSE_RANK function computes the rank of a row in an ordered group of rows.
The ranks are consecutive integers beginning with 1. The largest rank value is the
number of unique values returned by the query. Rank values are not skipped in the
event of ties. Rows with equal values for the ranking criteria receive the same rank.
The DENSE_RANK function does not skip numbers and will assign the same number
to those rows with the same value. Hence, after the result set is built in the inline
view, we can simply select all of the rows with a dense rank of three or less, this
gives us everyone who makes the top three salaries by department number.
Windows
The windowing clause gives us a way to define a sliding or anchored window of data,
on which the analytic function will operate, within a group. The default window is
an anchored window that simply starts at the first row of a group an continues to the
current row.
We can set up windows based on two criteria: RANGES of data values or ROWS
offset from the current row. It can be said, that the existance of an ORDER BY in
an analytic function will add a default window clause of RANGE UNBOUNDED
PRECEDING. That says to get all rows in our partition that came before us as
specified by the ORDER BY clause.
Let's look at an example with a sliding window within a group and compute the sum
of the current row's SAL column plus the previous 2 rows in that group. If we need a
report that shows the sum of the current employee's salary with the preceding two
salaries within a departement, it would look like this.
The partition clause makes the SUM (sal) be computed within each department,
independent of the other groups. Tthe SUM (sal) is ' reset ' as the department
changes. The ORDER BY ENAME clause sorts the data within each department by
ENAME; this allows the window clause: ROWS 2 PRECEDING, to access the 2
rows prior to the current row in a group in order to sum the salaries.
For example, if you note the SLIDING TOTAL value for SMITH is 6 7 7 5, which is
the sum of 800, 3000, and 2975. That was simply SMITH's row plus the salary from
the preceding two rows in the window.
Range Windows
Range windows collect rows together based on a WHERE clause. If I say ' range 5
preceding ' for example, this will generate a sliding window that has the set of all
preceding rows in the group such that they are within 5 units of the current row.
These units may either be numeric comparisons or date comparisons and it is not
valid to use RANGE with datatypes other than numbers and dates.
Example
Count the employees which where hired within the last 100 days preceding the own
hiredate. The range window goes back 100 days from the current row's hiredate and
then counts the rows within this range. The solution ist to use the following window
specification:
We ordered the single partition by hiredate ASC. If we look for example at the row
for CLARK we can see that his hiredate was 09-JUN-81, and 100 days prior to that is
the date 01-MAR-81. If we look who was hired between 01-MAR-81 and 09-JUN-81,
we find JONES (hired: 02-APR-81) and BLAKE (hired: 01-MAY-81). This are 3 rows
including the current row, this is what we see in the column "Cnt" of CLARK's row.
As an example, compute the average salary of people hired within 100 days before
for each employee. The query looks like this:
Look at CLARK again, since we understand his range window within the group. We
can see that the average salary of 2758 is equal to (2975+2850+2450)/3. This is
the average of the salaries for CLARK and the rows preceding CLARK, those of
JONES and BLAKE. The data must be sorted in ascending order.
Row Windows
Row Windows are physical units; physical number of rows, to include in the window.
For example you can calculate the average salary of a given record with the (up to
5) employees hired before them or after them as follows:
The window consist of up to 6 rows, the current row and five rows " in front of " this
row, where " in front of " is defined by the ORDER BY clause. With ROW partitions,
we do not have the limitation of RANGE partition - the data may be of any type and
the order by may include many columns. Notice, that we selected out a COUNT(*) as
well. This is useful just to demonstrate how many rows went into making up a given
average. We can see clearly that for ALLEN's record, the average salary computation
for people hired before him used only 2 records whereas the computation for salaries
of people hired after him used 6.
Frequently you want to access data not only from the current row but the current
row " in front of " or " behind " them. For example, let's say you need a report that
shows, by department all of the employees; their hire date; how many days before
was the last hire; how many days after was the next hire.
Using straight SQL this query would be difficult to write. Not only that but its
performance would once again definitely be questionable. The approach I typically
took in the past was either to " select a select " or write a PL/SQL function that
would take some data from the current row and " find " the previous and next rows
data. This worked, but introduce large overhead into both the development of the
query and the run-time execution of the query.
set echo on
The LEAD and LAG routines could be considered a way to " index into your
partitioned group ". Using these functions you can access any individual row. Notice
for example in the above printout, it shows that the record for KING includes the
data (in bold red font) from the prior row (LAST HIRE) and the next row (NEXT-
HIRE). We can access the fields in records preceding or following the current record
in an ordered partition easily.
LAG
LAG provides access to more than one row of a table at the same time without a
self join. Given a series of rows returned from a query and a position of the cursor,
LAG provides access to a row at a given physical offset prior to that position.
If you do not specify offset, then its default is 1. The optional default value is
returned if the offset goes beyond the scope of the window. If you do not specify
default, then its default value is null.
The following example provides, for each person in the EMP table, the salary of the
employee hired just before:
SELECT ename,hiredate,sal,
LAG(sal, 1, 0)
OVER (ORDER BY hiredate) AS PrevSal
FROM emp
WHERE job = 'CLERK';
Ename Hired SAL PREVSAL
------ --------- ----- -------
SMITH 17-DEC-80 800 0
JAMES 03-DEC-81 950 800
MILLER 23-JAN-82 1300 950
ADAMS 12-JAN-83 1100 1300
LEAD
LEAD provides access to more than one row of a table at the same time without a
self join. Given a series of rows returned from a query and a position of the cursor,
LEAD provides access to a row at a given physical offset beyond that position.
If you do not specify offset, then its default is 1. The optional default value is
returned if the offset goes beyond the scope of the table. If you do not specify
default, then its default value is null.
The following example provides, for each employee in the EMP table, the hire date of
the employee hired just after:
The FIRST_VALUE and LAST_VALUE functions allow you to select the first and last
rows from a group. These rows are especially valuable because they are often used
as the baselines in calculations.
Example
The following example selects, for each employee in each department, the name of
the employee with the lowest salary.
The following example selects, for each employee in each department, the name of
the employee with the highest salary.
The following example selects, for each employee in department 30 the name of the
employee with the lowest salary using an inline view
Example
Let's say you want to show the top 3 salary earners in each department as columns.
The query needs to return exactly 1 row per department and the row would have 4
columns. The DEPTNO, the name of the highest paid employee in the department,
the name of the next highest paid, and so on. Using analytic functions this almost
easy, without analytic functions this was virtually impossible.
SELECT deptno,
MAX(DECODE(seq,1,ename,null)) first,
MAX(DECODE(seq,2,ename,null)) second,
MAX(DECODE(seq,3,ename,null)) third
FROM (SELECT deptno, ename,
row_number()
OVER (PARTITION BY deptno
ORDER BY sal desc NULLS LAST) seq
FROM emp)
WHERE seq <= 3
GROUP BY deptno
/
Note the inner query, that assigned a sequence (RowNr) to each employee by
department number in order of salary.
The DECODE in the outer query keeps only rows with sequences 1, 2 or 3 and
assigns them to the correct "column". The GROUP BY gets rid of the redundant rows
and we are left with our collapsed result. It may be easier to understand if you see
the resultset without the aggregate function MAX grouped by deptno.
SELECT deptno,
DECODE(seq,1,ename,null) first,
DECODE(seq,2,ename,null) second,
DECODE(seq,3,ename,null) third
FROM (SELECT deptno, ename,
row_number()
OVER (PARTITION BY deptno
ORDER BY sal desc NULLS LAST) seq
FROM emp)
WHERE seq <= 3
/
The MAX aggregate function will be applied by the GROUP BY column DEPTNO. In
any given DEPTNO above only one row will have a non-null value for FIRST, the
remaining rows in that group will always be NULL. The MAX function will pick out the
non-null row and keep that for us. Hence, the group by and MAX will collapse our
resultset, removing the NULL values from it and giving us what we want.
Conclusion
This new set of functionality holds some exiting possibilities. It opens up a whole
new way of looking at the data. It will remove a lot of procedural code and complex
or inefficient queries that would have taken a long tome to develop, to achieve the
same result.
CREATE TABLE vote_count (
submit_date DATE NOT NULL,
num_votes NUMBER NOT NULL);
INSERT INTO vote_count VALUES (TRUNC(SYSDATE)4, 100);
INSERT INTO vote_count VALUES (TRUNC(SYSDATE)3, 150);
INSERT INTO vote_count VALUES (TRUNC(SYSDATE)2, 75);
INSERT INTO vote_count VALUES (TRUNC(SYSDATE)3, 25);
INSERT INTO vote_count VALUES (TRUNC(SYSDATE)1, 50);
COMMIT;
SELECT * FROM vote_count;
SUBMIT_DATE NUM_RECS
10/25/2008 1
10/26/2008 1
10/26/2008 2
10/27/2008 1
10/28/2008 1
SELECT submit_date, COUNT(*)
OVER(PARTITION BY submit_date ORDER BY submit_date ROWS UNBOUNDED
PRECEDING) NUM_RECS
FROM vote_count
SELECT submit_date, num_votes, TRUNC(AVG(num_votes)
OVER(ORDER BY submit_date ROWS UNBOUNDED PRECEDING)) AVG_VOTE_PER_DAY
FROM vote_count
ORDER BY submit_date;
SELECT submit_date, num_votes, TRUNC(AVG(num_votes)
OVER(PARTITION BY submit_date ORDER BY submit_date ROWS UNBOUNDED
PRECEDING)) AVG_VOTE_PER_DAY
FROM vote_count
ORDER BY submit_date;
SELECT submit_date, num_votes, TRUNC(AVG(num_votes)) AVG_VOTE_PER_DAY
FROM vote_count
ORDER BY submit_date;
SUBMIT_DATE NUM_VOTES
10/25/2008 100
10/26/2008 150
10/27/2008 75
10/26/2008 25
10/28/2008 50