上一篇文章我们写了近似NDV算法中的HYPERLOGLOG算法,这一章我们主要写另外一种自适应采样ADAPTIVE SAMPLING算法。
execute DBMS_STATS.delete_table_stats('U1','SALES1'); BEGIN DBMS_STATS.set_table_prefs (ownname => 'U1',tabname => 'sales1',pname => 'approximate_ndv_algorithm',pvalue => 'ADAPTIVE SAMPLING'); DBMS_STATS.set_global_prefs ('trace', TO_CHAR (2048 + 32768 + 4 + 16)); DBMS_STATS.gather_table_stats (ownname => 'U1', tabname => 'SALES1'); END;
与Hyperloglog算法一样,当Oracle使用as算法计算列统计信息时,它首先分别收集分区级别的统计信息,最后通过概要(synopses)再计算全局级别的统计信息。
DBMS_STATS: gather stats on partition SALES_Q2_1998: synopsis not gathered yet; not analyzed yet; DBMS_STATS: Start gather_stats.. pfix: ownname: U1 tabname: SALES1 pname: SALES_Q2_1998 spname: execution phase: 1 DBMS_STATS: APPROX_NDV_ALGORITHM chosen: AS DBMS_STATS: reporting_man_log_task: target: U1.SALES1.SALES_Q2_1998 objn: 74398 auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
这表示Oracle开始收集Sales_Q2_1998分区的统计信息,并使用以下SQL语句来计算统计数据。
DBMS_STATS: select /*+ full(t) no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad * /to_char(count("PROD_ID")),substrb(dump(min("PROD_ID"),16,0,64),1,240),substrb(dump(max("PROD_ID"),16,0,64),1,240),to_char(count("CUST_ID")),substrb(dump(min("CUST_ID"),16,0,64),1,240),substrb(dump(max("CUST_ID"),16,0,64),1,240),to_char(cou nt("TIME_ID")),substrb(dump(min("TIME_ID"),16,0,64),1,240),substrb(dump(max("TIME_ID"),16,0,64),1,240),to_char(count("CHANNEL_ID")),substrb(dump(min("CHANNEL_ID"),16,0,64),1,240),substrb(dump(max("CHANNEL_ID"),16,0,64),1,240),to_char(count( "PROMO_ID")),substrb(dump(min("PROMO_ID"),16,0,64),1,240),substrb(dump(max("PROMO_ID"),16,0,64),1,240),to_char(count("QUANTITY_SOLD")),substrb(dump(min("QUANTITY_SOLD"),16,0,64),1,240),substrb(dump(max("QUANTITY_SOLD"),16,0,64),1,240),to_ch ar(count("AMOUNT_SOLD")),substrb(dump(min("AMOUNT_SOLD"),16,0,64),1,240),substrb(dump(max("AMOUNT_SOLD"),16,0,64),1,240),count(rowidtochar(rowid)) from "U1"."SALES1" t where TBL$OR$IDX$PART$NUM("U1"."SALES1",0,4,0,"ROWID") = :objn /* SYN,N IL,NIL,SYN,NIL,NIL,STOPN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,SYN,NIL,NIL,RWID, B74396,U254,U254,U254,U254,U254,U254,U254U*/
然后,数据库收集Sales_Q1_1998分区的统计信息。
DBMS_STATS: gather stats on partition SALES_Q1_1998: synopsis not gathered yet; not analyzed yet; DBMS_STATS: Start gather_stats.. pfix: ownname: U1 tabname: SALES1 pname: SALES_Q1_1998 spname: execution phase: 1 DBMS_STATS: APPROX_NDV_ALGORITHM chosen: AS DBMS_STATS: reporting_man_log_task: target: U1.SALES1.SALES_Q1_1998 objn: 74397 auto_stats: FALSE status: IN PROGRESS ctx.batching_coeff: 0
最后,Oracle根据分区概要提供的信息计算SALES1表的全局统计信息。
DBMS_STATS: Start gather_stats.. pfix: ownname: U1 tabname: SALES1 pname: spname: execution phase: 1 DBMS_STATS: APPROX_NDV_ALGORITHM chosen: AS DBMS_STATS: APPROX_NDV_ALGORITHM chosen: AS DBMS_STATS: Synopsis Aggregation Degree: 1 DBMS_STATS: APPROX_NDV_ALGORITHM chosen: AS DBMS_STATS: get_agg_colstats: AS only DBMS_STATS: Deriving Global Histogram for TIME_ID DBMS_STATS: derive global freq hist... DBMS_STATS: Evaluating frequency histogram for col: "TIME_ID" DBMS_STATS: number of values = 181, max # of buckects = 254, pct = 100, ssize = 162703360 DBMS_STATS: range_skew: YES DBMS_STATS: csr.hreq: 0 Histogram gathering flags: 1037 DBMS_STATS: Derive global stats from partition synopses/stats for table SALES1
Oracle如何使用概要(synopsis)数据来计算统计数据?首先,当使用AS算法收集统计数据时,概要(synopsis)数据将同时存储在wri$_optstat_synopsis$和wri$_optstat_synopsis_head$表中。Oracle将一些信息插入到wri$_optstat_synopsis_head$表中;具体来说,拆分列非常重要。此时spare1和spare2列为空。而在使用HLL算法的时候,这两列是有数据的。
SQL> select * from sys.wri$_optstat_synopsis_head$ where BO#=74396; BO# GROUP# INTCOL# SYNOPSIS# SPLIT ANALYZETI SPARE1 SPARE2 ---------- ---------- ---------- ---------- ---------- --------- ---------- -------------------- 74396 148794 1 0 11-JUL-18 74396 148794 2 0 11-JUL-18 74396 148794 3 0 11-JUL-18 74396 148794 4 0 11-JUL-18 74396 148794 5 0 11-JUL-18 74396 148794 6 0 11-JUL-18 74396 148794 7 0 11-JUL-18 74396 148796 1 0 11-JUL-18 74396 148796 2 0 11-JUL-18 74396 148796 3 0 11-JUL-18 74396 148796 4 0 11-JUL-18 74396 148796 5 0 11-JUL-18 74396 148796 6 0 11-JUL-18 74396 148796 7 0 11-JUL-18
此外,概要值(hashvalue)存储在WRI$_optstat_synoptstat$的hashvalue列中。因此,根据公式NDV=N*power(2,i),要计算NDV,我们需要知道相应列的散列值的不同值的数目和拆分的数目。
首先我们来计算列的拆分数。
SQL> SELECT (SELECT name FROM sys.col$ WHERE obj# = 74396 AND col# = t.intcol#) column_name, MAX (split) maxsplit FROM sys.wri$_optstat_synopsis_head$ t WHERE t.bo# = 74396 GROUP BY t.intcol#; COLUMN_NAME MAXSPLIT -------------------- ---------- PROD_ID 0 QUANTITY_SOLD 0 CUST_ID 0 CHANNEL_ID 0 PROMO_ID 0 TIME_ID 0 AMOUNT_SOLD 0
计算hashvalue值的唯一值的数量。
SQL> SELECT (SELECT name FROM sys.col$ WHERE obj# = 74396 AND col# = t.intcol#) column_name, COUNT (DISTINCT (hashvalue)) dhv FROM sys.wri$_optstat_synopsis$ t WHERE bo# = 74396 GROUP BY intcol#; COLUMN_NAME DHV -------------------------- PROD_ID 60 CUST_ID 4305 TIME_ID 181 CHANNEL_ID 4 PROMO_ID 2 QUANTITY_SOLD 1 AMOUNT_SOLD 425
上述两个值查出来之后,然后根据公式NDV=N*power(2,i)来计算。例如我们使用PROD_ID=60来计算。NDV=N*power(2,i) = 60*power(2,0)=60。现在我们来检查数据字典,我们可以发现和数据字典一致。实际上,对于这些列,我们没有任何split,因此列的不同hashvalue值的数目等于它们的ndv。
SQL> SELECT column_name, num_distinct FROM dba_tab_col_statistics WHERE table_name='SALES1'; COLUMN_NAME NUM_DISTINCT ---------------------------------------- ------------ PROD_ID 60 CUST_ID 4305 TIME_ID 181 CHANNEL_ID 4 PROMO_ID 2 QUANTITY_SOLD 1 AMOUNT_SOLD 425
在Oracle12cR1中,为了近似NDV,引入了一个新函数-大约计数APPROX_COUNT_DISTINCT,它使用HLL算法。而在Oracle 12c R2中,HLL算法已经用DBMS_STATS包实现了近似数据库统计(NDV)。Oracle数据库可以使用这两种算法(ADAPTIVE SAMPLING/hyperloglog)。
使用ADAPTIVE SAMPLING,概要将存储在两个地方((wri$_optstat_synopsis_head$,wri$_optstat_synopsis$)。而使用hyperloglog算法,概要只会存储在wri$_optstat_synopsis$表中。如果收集分区统计数据,则选项近似approximate_ndv_algorithm=“REPEAT OR HYPERLOGLOG”允许我们继续通过自适应抽样方法创建概要; 这是默认设置。如果我们如果approximate_ndv_algorithm更改为“HYPERLOGLOG”, INCREMENTAL_STALENESS is NULL(在Oracle 12c数据库中,一个称为INCREMENTAL_STALENESS的新属性允许你控制分区统计信息何时被认为已陈化,并且不能胜任生成全局统计信息。在缺省情况下,INCREMENTAL_STALENESS被设为NULL),所有旧风格的概要将被删除,而将为之前的和新添加的表分区创建新的概要。如果approximate_ndv_algorithm=“HYPERLOGLOG“ ,INCREMENTAL_STALENESS= ALLOW_MIXED_FORMAT,不会立即删除旧风格的概要,而是逐步的删除。
关于这两种算法的性能,这两种统计收集过程似乎没有明显的差别。两种算法均为≈2%。而主要的区别是所需的内存。HLL允许我们以极小的内存(和磁盘存储)以高精度计算NDV。
Post a Comment