执行dbms_workload_repository.create_snapshot出现hang,等待事件enq: WF – contention

背景介绍:
有一套在HP上跑的10.2.0.5的RAC系统,我们手动执行AWR快照出现Hang住的情况。

问题分析:

对于Hang住的情况,我们一般先查v$session,看看这个会话的等待事件是什么。但是这里情况稍微有些不一样,当我们执行DBMS_WORKLOAD_REPOSITORY去创建快照的时候,该进程会去调用m000进程去完成这个操作,当m000进程完成作业之后,才会返回给我们的session。今天这个案例的情况就在于我们的session所处的等待事件是enq:wf-contention。而我们的m000进程所处的等待是gc cr request。

SQL> select sid,serial#,sql_id,blocking_instance,blocking_session from gv$session where sid=2347;

SID SERIAL# SQL_ID          EVENT                    BLOCKING_INSTANCE  BLOCKING_SESSION
---------- ----------       -----                    -----------------  ----------------- 
2347  1048   bc7gjv3ppdtbz  enq: WF - contention     2                  3369

SQL> select sid,serial#,paddr,prev_sql_id,sql_id,event,blocking_session from v$session where sid=3369;

SID SERIAL# EVENT        BLOCKING_ SESSION
--- ------- ---------    -----------------
3369 17015  gc cr request

因为m000无法正常执行结束,所以这个进程自己产生了trace。我们来看下m000的trace文件。

Unix process pid: 21851, image: oracle@RX9800_2 (m000)

*** 2014-02-26 09:35:29.046
*** ACTION NAME:(Auto-Flush Slave Action) 2014-02-26 09:35:29.045
*** MODULE NAME:(MMON_SLAVE) 2014-02-26 09:35:29.045
*** SERVICE NAME:(SYS$BACKGROUND) 2014-02-26 09:35:29.045
*** SESSION ID:(3369.17015) 2014-02-26 09:35:29.045
kjbhistory[0xa2b0.30000,(pkey 8990.0)(where 1)]
*** 2014-02-26 09:35:29.046
GLOBAL CACHE ELEMENT DUMP (address: c0000001adfca2b8):
  id1: 0xa2b0 id2: 0x30000 pkey: OBJ#8990 block: (3/41648)
  lock: NC rls: 0x0000 acq: 0x0003 latch: 12
  flags: 0x41 fair: 0 recovery: 0 fpin: 'kdswh02: kdsgrp'
  bscn: 0x0.0 bctx: 0000000000000000 write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0
  lcp: c000000c2e6b1598 lnk: [c000000c2e6b15b8,c000000c2e6b15b8] lch: [c000000a74f77d48,c000000a74f77d48]
  seq: 27830 hist: 7 352 329 144:1 502 465 7 352 329 144:1
  LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
    flg: 0x00080000 state: READING mode: EXCL
      pin: 'kdswh02: kdsgrp'
      addr: c000000a74f77c38 obj: 8990 cls: DATA bscn: 0x0.0
 GCS CLIENT c0000001adfca328,4639 sq[0000000000000000,0000000000000000] resp[0000000000000000,0xa2b0.30000] pkey 8990.0
   grant 0 cvt 0 mdrole 0x0 st 0x20 GRANTQ rl LOCAL
   master 0 owner 1 sid 1 remote[0000000000000000,0] hist 0xcd0672cd
   history 0xd.0xb.0x27.0x1.0xd.0x3. cflag 0x0 sender 0 flags 0x0 replay# 0
   disk: 0x0000.00000000 write request: 0x0000.00000000
   pi scn: 0x0000.00000000 
   msgseq 0x0 updseq 0x0 reqids[4638,0,0] infop 0x0
   pkey 8990.0
   hv 69 [stat 0x0, 0->0, wm 32767, RMno 0, reminc 38, dom 0]
   kjga st 0x4, step 0.0.0, cinc 40, rmno 11, flags 0x0
   lb 0, hb 0, myb 37974, drmb 37974, apifrz 0

在这个trace文件里面,我们可以看到我们的gc请求在对象obj为8990上。物理位置位于3号文件的第41648个block上面。那么究竟为什么不能把这个块获取过来呢?我们进一步做了hanganalyze和systemdump。这里因为是gc问题需要做全局的hanganalyze和systemdump。最终会把信息生成到diag文件上面。我们先看下节点1上的diag文件,有以下内容:

Hanganalyze Open chains found:
Chain 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
    <1/3369/17015/gc cr request>
-- <0/474/2770/0x4a9328f8/1061/enq: WF - contention>
-- <0/2347/1048/0x4a9250f8/1613/enq: WF - contention> PROCESS 73: ---------------------------------------- SO: c000000c4a9250f8, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00 (process) Oracle pid=73, calls cur/top: c000000c03bd1010/c000000c03bd1010, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 0 last post received-location: No post last process to post me: none last post sent: 0 0 94 last post sent-location: KJCS Post snd proxy to flush msg last process posted by me: c000000c5193cb50 1 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: c000000c4aaa6108 O/S info: user: oracle, term: pts/1, ospid: 1613 OSD pid info: Unix process pid: 1613, image: oracle@RX9800_1 (TNS V1-V3) Short stack dump: ksdxfstk()+48<-ksdxcb()+1200<-sspuser()+368<-<-_pw_wait()+48<-pw_wait()+128<-sskgpwwait()+384<-skgpwwait()+208<-ksliwat()+1728<-kslwaitns_timed()+112<-kskthbwt()+400<-kslwait()+640<-kjusuc()+6896<-ksipgetctx()+2336<-$cold_ksqcmi()+34560<-ksqgtlctx()+4288<-ksqgelctx()+928<-kewrfsc_flush_snapshot_c()+896<-kewrpcs_create_snapshot_i()+960<-kewrpcs_create_snapshot()+1632<-$cold_spefcpfa()+1248<-spefmccallstd()+1008<-pextproc()+128<-peftrusted()+288<-psdexsp()+448<-rpiswu2()+976<-psdextp()+816<-pefccal()+1120<-pefcal()+432<-pevm_FCAL()+288<-pfrinstr_FCAL()+176<-pfrrun_no_tool()+192<-pfrrun()+1376<-plsql_run()+1328<-peicnt()+560<-kkxexe()+1008<-opiexe()+13520<-kpoal8()+7488<-opiodr()+2144<-ttcpip()+1680<-opitsk()+2368<-opiino()+1664<-opiodr()+2144<-opidrv()+1248<-sou2o()+240<-opimai_real()+496<-main()+240<-main_opd_entry()+80 ---------------------------------------- SO: c000000c4cdf9e98, type: 4, owner: c000000c4a9250f8, flag: INIT/-/-/0x00 (session) sid: 2347 trans: 0000000000000000, creator: c000000c4a9250f8, flag: (8000041) USR/- BSY/-/-/-/-/- DID: 0001-0049-0000000D, short-term DID: 0001-0049-0000000E txn branch: 0000000000000000 oct: 47, prv: 0, sql: c000000bedba78b0, psql: c000000bec13fc98, user: 0/SYS service name: SYS$USERS O/S info: user: oracle, term: pts/1, ospid: 1612, machine: RX9800_1 program: sqlplus@RX9800_1 (TNS V1-V3) application name: sqlplus@RX9800_1 (TNS V1-V3), hash value=1131274191 waiting for 'enq: WF - contention' wait_time=0, seconds since wait started=114614 name|mode=57460006, 0=0, 0=0 blocking sess=0x0000000000000000 seq=13 Dumping Session Wait History for 'enq: WF - contention' count=1 wait_time=1.503917 sec name|mode=57460006, 0=0, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498044 sec name|mode=57460006, 0=0, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498043 sec name|mode=57460006, 0=0, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498042 sec name|mode=57460006, 0=0, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498043 sec name|mode=57460006, 0=0, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498043 sec name|mode=57460006, 0=0, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498047 sec name|mode=57460006, 0=0, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498038 sec name|mode=57460006, 0=0, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498044 sec name|mode=57460006, 0=0, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498043 sec name|mode=57460006, 0=0, 0=0 Sampled Session History of session 2347 serial 1048 --------------------------------------------------- ---------------------------------------- SO: c000000c470d69a0, type: 5, owner: c000000c03bd1010, flag: INIT/-/-/0x00 (enqueue) WF-00000000-00000000 DID: 0001-0049-0000000D lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x3 req: X, lock_flag: 0x0, lock: 0xc000000c470d69c0, res: 0xc000000c492ccaa0 PROCESS 181: ---------------------------------------- SO: c000000c4a9328f8, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00 (process) Oracle pid=181, calls cur/top: c000000c009736e0/c000000bf2d00570, flag: (2) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 140 last post received-location: kclrcvt last process to post me: c000000c5193d350 1 6 last post sent: 0 0 94 last post sent-location: KJCS Post snd proxy to flush msg last process posted by me: c000000c5193cb50 1 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: c000000c4aaa6108 O/S info: user: oracle, term: UNKNOWN, ospid: 1061 OSD pid info: Unix process pid: 1061, image: oracle@RX9800_1 (m000) Short stack dump: ksdxfstk <- ksdxcb <- sspuser <- kernel <- _pw_wait <- pw_wait <- sskgpwwait <- skgpwwait <- ksliwat <- kslwaitns_timed <- kskthbwt <- kslwait <- kjusuc <- ksipgetctx <- cold_ksqcmi <- ksqgtlctx <- ksqgelctx <- kewrgtl_get_table_lock <- kewrftec_flush_table_ehdlcx <- kewrfat_flush_all_tables <- kewrfsr_flush_snapshot_r <- kewrrfs_remote_flush_slave <- kebm_slave_main <- ksvrdp <- opirip <- opidrv <- sou2o <- opimai_real <- main ---------------------------------------- SO: c000000c51b644e8, type: 4, owner: c000000c4a9328f8, flag: INIT/-/-/0x00 (session) sid: 474 trans: 0000000000000000, creator: c000000c4a9328f8, flag: (100051) USR/- BSY/-/-/-/-/- DID: 0001-00B5-00000005, short-term DID: 0001-00B5-00000006 txn branch: 0000000000000000 oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS service name: SYS$BACKGROUND waiting for 'enq: WF - contention' wait_time=0, seconds since wait started=111931 name|mode=57460006, 0=46, 0=0 blocking sess=0x0000000000000000 seq=477 Dumping Session Wait History for 'enq: WF - contention' count=1 wait_time=0.957029 sec name|mode=57460006, 0=46, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498030 sec name|mode=57460006, 0=46, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498022 sec name|mode=57460006, 0=46, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498049 sec name|mode=57460006, 0=46, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498037 sec name|mode=57460006, 0=46, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498027 sec name|mode=57460006, 0=46, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498028 sec name|mode=57460006, 0=46, 0=0 for 'enq: WF - contention' count=1 wait_time=0.497620 sec name|mode=57460006, 0=46, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498468 sec name|mode=57460006, 0=46, 0=0 for 'enq: WF - contention' count=1 wait_time=0.498046 sec name|mode=57460006, 0=46, 0=0 Sampled Session History of session 474 serial 2770 --------------------------------------------------- ---------------------------------------- SO: c000000c470d30b8, type: 5, owner: c000000c009736e0, flag: INIT/-/-/0x00 (enqueue) WF-00000000-00000000 DID: 0001-00B5-00000005 lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x3 mode: S, lock_flag: 0x0, lock: 0xc000000c470d30d8, res: 0xc000000c492ccaa0 ---------------------------------------- SO: c000000c470d35f8, type: 5, owner: c000000c2ea906b0, flag: INIT/-/-/0x00 (enqueue) WF-00000046-00000000 DID: 0001-00B5-00000005 lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x3 req: X, lock_flag: 0x0, lock: 0xc000000c470d3618, res: 0xc000000c492eb940

从节点1的这个diag trace我们可以看到,我们的chains上的信息显示节点2的3369会话hang住了节点1的474会话和节点1的2347会话。而systemdump的输出结果是:我们的2347是我们发起的sqlplus进程,它在请求以X模式的方式获取(enqueue) WF-00000000-00000000。而此时这个资源被节点1的474以Shared的模式持有着。而该进程正在请求以X模式的方式去获取另外一个(enqueue) WF-00000046-00000000。

我们在来看下节点2的systemdump的信息

PROCESS 76:
----------------------------------------
SO: c000000c4b91f638, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=76, calls cur/top: c000000c01ab9ee0/c000000bff0e1598, flag: (2) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 151
last post received-location: kclcomm
last process to post me: c000000c4b916e38 1 6
last post sent: 0 0 94
last post sent-location: KJCS Post snd proxy to flush msg
last process posted by me: c000000c4b916e38 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c000000c4aaa6108
O/S info: user: oracle, term: UNKNOWN, ospid: 21851 
OSD pid info: Unix process pid: 21851, image: oracle@RX9800_2 (m000)
Short stack dump: 
ksdxfstk <- ksdxcb <- sspuser <- kernel <- _poll_sys
<- _poll <- sskgxp_select <- skgxppost <- skgxpwait <- ksxpwait
<- cold_ksliwat <- kslwaitns_timed <- kskthbwt <- kslwait <- ksxprcv
<- kclwcrs <- kclgclks <- cold_kcbzib <- kcbgtcr <- ktrgtc
<- kdsgrp <- kdsfbrcb <- qertbFetchByRowID <- qerjoFetch <- qerflFetch
<- opifch2 <- opifch <- opiodr <- kpoodrc <- rpiswu2
<- kpoodr <- upirtrc <- kpurcsc <- kpufch0 <- kpufch
<- OCIStmtFetch <- kewrfae_fetch_attribute_exists <- kewrgwxf1_gwrsql_exft_1 <- kewrgwxf_gwrsql_exft <- kewrfabt_flush_attribute_table
<- kewrfspl_flush_sqlplan <- kewrft_flush_table <- kewrftec_flush_table_ehdlcx <- kewrftsq_flush_topsql <- kewrft_flush_table
<- kewrftec_flush_table_ehdlcx <- kewrfat_flush_all_tables <- kewrfos_flush_onesnap <- kewrfsc_flush_snapshot_c <- kewrafs_auto_flush_slave
<- kebm_slave_main <- ksvrdp <- opirip <- opidrv <- sou2o

----------------------------------------
SO: c000000c4af18338, type: 4, owner: c000000c4b91f638, flag: INIT/-/-/0x00
(session) sid: 3369 trans: 0000000000000000, creator: c000000c4b91f638, flag: (100051) USR/- BSY/-/-/-/-/-
DID: 0002-004C-000000E9, short-term DID: 0002-004C-000000EA
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'gc cr request' wait_time=0, seconds since wait started=0
file#=3, block#=a2b0, class#=1
blocking sess=0x0000000000000000 seq=12259
Dumping Session Wait History
for 'cr request retry' count=1 wait_time=0.000008 sec
file#=3, block#=a2b0, =0
for 'gc cr block lost' count=1 wait_time=2.022422 sec
=3, =a2b0, =1
for 'cr request retry' count=1 wait_time=0.000072 sec
file#=3, block#=a2b0, =0
for 'gc cr block lost' count=1 wait_time=0.712588 sec
=3, =a2b0, =1
for 'cr request retry' count=1 wait_time=0.000066 sec
file#=3, block#=a2b0, =0
for 'gc cr block lost' count=1 wait_time=1.034585 sec
=3, =a2b0, =1
for 'cr request retry' count=1 wait_time=0.000079 sec
file#=3, block#=a2b0, =0
for 'gc cr block lost' count=1 wait_time=1.112753 sec
=3, =a2b0, =1
for 'cr request retry' count=1 wait_time=0.000008 sec
file#=3, block#=a2b0, =0
for 'gc cr block lost' count=1 wait_time=0.763256 sec
=3, =a2b0, =1
Sampled Session History of session 3369 serial 17015
---------------------------------------------------

----------------------------------------
SO: c000000c470d35f8, type: 5, owner: c000000c2ea50bd0, flag: INIT/-/-/0x00
(enqueue) WF-00000046-00000000	DID: 0002-004C-000000E9 
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x3
mode: X, lock_flag: 0x0, lock: 0xc000000c470d3618, res: 0xc000000c4930b928 SO: c000000c470d30b8, type: 5, owner: c000000bfcb27810, flag: INIT/-/-/0x00
(enqueue) WF-00000000-00000000	DID: 0002-004C-000000E9 lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x3
mode: S, lock_flag: 0x0, lock: 0xc000000c470d30d8, res: 0xc000000c4932ec00

这里我们可以看到节点2的3369进程以X模式hold了资源WF-00000046-00000000,并且以S模式也hold了(enqueue) WF-00000000-00000000资源。节点2的3369进程在等待gc cr request,它历史的等待事件显示cr request retry和gc cr block lost。所以这个问题很明显。总结情况如下:

  • 我们在节点1发起了sqlplus进程去执行dbms_workload_repository创建快照。该会话是2347。
  • 节点1和节点2同时发起了m000进程去进行一系列的操作。节点1的m000进程是474,节点2的m000进程是3369。
  • 节点1的m000进程需要以X独占模式获取WF-00000046-00000000资源,而该资源被节点2的m000进程以X模式hold住了。
  • 节点1的m000进程获取不到WF-00000046-00000000资源,所以等待事件为enq:WF-Contention。而节点2的m000进程需要节点1传输gc块,等待事件为gc cr request,而在这个传输的过程中,遇到一些问题,所以导致出现了gc cr block lost。

对于这样的问题,只有可能是两种情况:

1.遇到了bug。2.网络心跳传输有问题。

所以解决办法也很简单:

1.根据这些信息去mos上搜索。我就搜到了一个类似这样的bug,但是我们的情况和它的描述不太符合。Bug 9627113 : DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT HANGS DUE TO ‘GC CR BLOCK LOST’

2.第二个可能的原因就是网络问题,建议这种问题最好部署oswatch来进行监控。对于网络方面的诊断,我们需要参考文档:Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID 563566.1)

分享到: 更多