ORACLE在数据仓库的应用上,有很多针对分析数据和报表查询的函数。分类如下:
Type Used For
Ranking |
Calculating ranks, percentiles, and n-tiles of the values in a result set. |
Windowing |
Calculating cumulative and moving aggregates.
Works with these functions:
and new statistical functions. Note that the
not supported in windowing functions except for
|
Reporting |
Calculating shares, for example, market share.
Works with these functions:
Note that the
functions that support
|
|
Finding a value in a row a specified number of rows from a current row. |
|
First or last value in an ordered group. |
Linear Regression |
Calculating linear regression and other statistics (slope, intercept, and so on). |
Inverse Percentile |
The value in a data set that corresponds to a specified percentile. |
Hypothetical Rank and Distribution |
The rank or percentile that a row would have if inserted into a specified data set. |
其中 Windowing analytic function 的主要功能是 Calculating cumulative and moving aggregates,即累计计算求和。
Reporting 只是省略了WIDNOW ,即对分组后的全部数据求和,没有累加效果,很适合对某个地区销售百分额的查询。
分析函数的语法格式如下:
analytic_function ::=
analytic_clause ::=
Processing order:
首先,SQL语句中JOINS,WHERE,GROUP BY HAVING Clause先执行,然后为分析函数生成一个结果集,分析函数在此基础上进行计算,
如果有最后ORDER BY Clause,再进行排序(分析函数内部如有ORDER BY子句是在计算之前执行的)。
Result set partitions:
对JOINS,WHERE,GROUP BY HAVING 字句后生成的结果集分组,如省略次子句,默认对partition by 所有列
Window(重点):
partition by后的每一行,都拥有一个视窗。可以把每一行想象成一个天真活泼的小男孩,视窗就是他们手中拿着的可以自动伸长缩短的魔法棍。
一个结果集就如同所有的小男孩排成一个长队。然后每个小男孩拿着魔法棍,往前或者往后或者前后同时,想打谁就打谁,只要输入口令。
如
UNBOUNDED
PRECEDING,就是从最前面到自己,UNBOUNDED
FOLLOWING,就是从自己到最后面。
BETWEEN '起点' and '终点',就是以自己为中心从起点到终点的范围。
但是魔法棒有个缺点,只能范围攻击,不能单独击中某一个自己相邻的除外。
打比方打到这里,
其实魔法棒不是最佳比喻,还是WINDOW视窗比较合适,正好是范围的滑动。
Current row:
就是当前行啦,它作为视窗的参考点,并包括在整个视窗的计算范围内。
详细语法如下:
analytic_function([ arguments ])
OVER (analytic_clause)
where analytic_clause =
[ query_partition_clause ]
[ order_by_clause [ windowing_clause ] ]
and query_partition_clause =
PARTITION BY
{ value_expr[, value_expr ]...
| ( value_expr[, value_expr ]... )
}
and windowing_clause =
{ ROWS | RANGE }
{ BETWEEN
{ UNBOUNDED PRECEDING
| CURRENT ROW
| value_expr { PRECEDING | FOLLOWING }
}
AND
{ UNBOUNDED FOLLOWING
| CURRENT ROW
| value_expr { PRECEDING | FOLLOWING }
}
| { UNBOUNDED PRECEDING
| CURRENT ROW
| value_expr PRECEDING
}
}
RANGE:
定义为Logical Offset,即前后相隔可以通过具体的逻辑表达式表示,
如 range INTERVAL '1' DAY AND INTERVAL '2' DAY 就是当前
行的ORDER BY 字段的时间前一天和后两天的范围。
用RANG的时候,RANG 后面可是NUMERIC INTERVAL和表达式 ORDER BY字段的类型必须和RANG后的类型一样。
并且用RANG的时候,只能ORDER BY一个字段。
ROWS:
定义为Physical Offsets,即前后相隔只能是具体的行,如rows between 2 and 2就是当前行的前两行和后两行。可以ORDER BY多个字段。
WINDOWS 子句只能在ORDER BY子句后出现,如果不写则默认为 RANGE
BETWEEN
UNBOUNDED
PRECEDING
AND
CURRENT
ROW.
即WINDOWS都是基于分析子句中的OREDR BY,无ORDER BY的情况下,功能相当于REPORTING FUNCTION。
分析函数只能出现在SELECT子句和ORDER BY子句中。
(这里插一句,跟本主题不相关,只是突然想把在想的写下来,不管对不对。ORACLE在SQL执行计划中SELECT都放在最后,
在解析语句的时候,就解析到了分析函数,当执行GROUP BY等后,
就会在RESULT SETS中根据解析到的分析函数语法,
对排序后的每一行数据进行计算,最后在FETCH所有的行显示出来。GROUP BY 后的统计函数也是,如解析出SUM后,
对分组后的结果集中每个组中的每一行数据依次FENTCH相加,如果在SUM中写CASE WHEN就
能根据条件选择每一行的加减乘除。
但GROUP 带不出所有的行,而分析函数是基于每一行的。
哎,有些只能意会不能言传,主要还是没有实际的搞懂,要是能看到源代码就好了:)
)
下面举具体的例子说明。
Windowing Functions with Logical Offset:
1.
SELECT t.time_id, TO_CHAR (SUM(amount_sold), '9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold)) OVER (ORDER BY t.time_id RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND INTERVAL '1' DAY FOLLOWING), '9,999,999,999') AS CENTERED_3_DAY_AVG FROM sales s, times t WHERE s.time_id=t.time_id AND t. calendar_week_number IN (51) AND calendar_year=1999 GROUP BY t.time_id ORDER BY t.time_id;
TIME_ID SALES CENTERED_3_DAY
--------- ------------- --------------
20-DEC-99 134,337 106,676
21-DEC-99 79,015 102,539
22-DEC-99 94,264 85,342
23-DEC-99 82,746 93,322
24-DEC-99 102,957 82,937
25-DEC-99 63,107 87,062
26-DEC-99 95,123 79,115
每一行结果中带出 当前行的TIME_ID前后相隔一天的的SALES的值的平均值。第一行前面没有行,平均值为后一行加当前行除以2.当找不到行的时候不会把那一行算入到计算内。RANG后为INTERVAL,GROUP BY的字段必须为DATE类型。
2.
SELECT t_timekey, AVG(stock_price)
OVER (ORDER BY t_timekey RANGE
fn(t_timekey)
PRECEDING) av_price
FROM stock, time WHERE st_timekey = t_timekey
ORDER BY t_timekey;
RANG后可跟表达式,表达式可以为函数。
3.
SELECT time_id, daily_sum, SUM(daily_sum) OVER (ORDER BY time_id RANGE BETWEEN INTERVAL '10' DAY PRECEDING AND CURRENT ROW) AS current_group_sum FROM (SELECT time_id, channel_id, SUM(s.quantity_sold) AS daily_sum FROM customers c, sales s, countries WHERE c.cust_id=s.cust_id AND c.country_id = countries.country_id AND s.cust_id IN (638, 634, 753, 440 ) AND s.time_id BETWEEN '01-MAY-00' AND '13-MAY-00' GROUP BY time_id, channel_id);
TIME_ID DAILY_SUM CURRENT_GROUP_SUM
--------- ---------- -----------------
06-MAY-00 7 7 /* 7 */
10-MAY-00 1 9 /* 7 + (1+1) */
10-MAY-00 1 9 * 7 + (1+1) */
11-MAY-00 2 15 /* 7 + (1+1) + (2+4) */
11-MAY-00 4 15 /* 7 + (1+1) + (2+4) */
12-MAY-00 1 16 /* 7 + (1+1) + (2+4) + 1 */
13-MAY-00 2 23 /* 7 + (1+1) + (2+4) + 1 + (5+2) */
13-MAY-00 5 23 /* 7 + (1+1) + (2+4) + 1 + (5+2) */
因为是LOGIC OFFCET,当ORDER BY 字段中有相同值时,会把相同字段的分析函数中要分析的值相加后在根据WINDOW值累计。
注意这个特性。要想不要这种效果,也就是无论ORDER BY字段是否有相同的值,都按照WINDOW值相加就要用ROWS了。
Windowing Aggregate Functions With Physical Offsets
1.
SELECT t.time_id, TO_CHAR(amount_sold, '9,999,999,999') AS INDIV_SALE, TO_CHAR(SUM(amount_sold) OVER (PARTITION BY t.time_id ORDER BY t.time_id ROWS UNBOUNDED PRECEDING), '9,999,999,999') AS CUM_SALES FROM sales s, times t, customers c WHERE s.time_id=t.time_id AND s.cust_id=c.cust_id AND t.time_id IN (TO_DATE('11-DEC-1999'), TO_DATE('12-DEC-1999')) AND c.cust_id BETWEEN 6500 AND 6600 ORDER BY t.time_id;
TIME_ID INDIV_SALE CUM_SALES
--------- ---------- ---------
12-DEC-99 23 23
12-DEC-99 9 32
12-DEC-99 14 46
12-DEC-99 24 70
12-DEC-99 19 89
ROWS UNBOUNDED PRECEDING 表示从第一行到当前行的视窗范围,即使ORDER BY字段的值一样,也实现了第一行到当前行的求和。
2.
SELECT c.cust_id, t.calendar_month_desc, TO_CHAR (SUM(amount_sold), '9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold)) OVER (ORDER BY c.cust_id, t.calendar_month_desc ROWS 2 PRECEDING), '9,999,999,999') AS MOVING_3_MONTH_AVG FROM sales s, times t, customers c WHERE s.time_id=t.time_id AND s.cust_id=c.cust_id AND t.calendar_year=1999 AND c.cust_id IN (6510) GROUP BY c.cust_id, t.calendar_month_desc ORDER BY c.cust_id, t.calendar_month_desc;
CUST_ID CALENDAR SALES MOVING_3_MONTH
---------- -------- -------------- --------------
6510 1999-04 125 125
6510 1999-05 3,395 1,760
6510 1999-06 4,080 2,533
6510 1999-07 6,435 4,637
6510 1999-08 5,105 5,207
6510 1999-09 4,676 5,405
6510 1999-10 5,109 4,963
6510 1999-11 802 3,529
前两行到当前行求AVG,注意,如果行没有找到,将不计在内。如第二行的AVG为(第一行+第二行的SALES值)/2,
第三行则为(第一行+第二行+第三行的SALES值)/3
如果不写ORDER BY子句,便没有视窗概念,功能为REPORTING FUNCTION
SELECT prod_category, country_region, sales FROM (SELECT SUBSTR(p.prod_category,1,8) AS prod_category, co.country_region, SUM(amount_sold) AS sales, MAX(SUM(amount_sold)) OVER (PARTITION BY prod_category) AS MAX_REG_SALES FROM sales s, customers c, countries co, products p WHERE s.cust_id=c.cust_id AND c.country_id=co.country_id AND s.prod_id =p.prod_id AND s.time_id = TO_DATE('11-OCT-2001') GROUP BY prod_category, country_region) WHERE sales = MAX_REG_SALES;
查询结果
PROD_CAT COUNTRY_REGION SALES MAX_REG_SALES -------- -------------------- ---------- ------------- Electron Americas 581.92 581.92 Hardware Americas 925.93 925.93 Peripher Americas 3084.48 4290.38 Peripher Asia 2616.51 4290.38 Peripher Europe 4290.38 4290.38 Peripher Oceania 940.43 4290.38 Software Americas 4445.7 4445.7 Software Asia 1408.19 4445.7 Software Europe 3288.83 4445.7 Software Oceania 890.25 4445.7
参考资料ORACLE document《Data Warehousing Guide 》,《SQL Language Reference 》