ORA-00494: enqueue [CF] held for too long (more than 900 seconds)问题分析

数据库在下午13点05分左右爆发了ORA-00494: enqueue [CF] held for too long (more than 900 seconds)问题,最终导致数据库异常宕机,通过对日志的分析发现下列信息:

Errors in file /u01/oracle/admin/cppicids/bdump/cppicids_lgwr_168278.trc: ORA-00494: enqueue [CF] held for too long (more than 900 seconds) (more than 900 seconds) by inst 1, osid 270814')
Thu Jun 11 13:05:21 2015System State dumped to trace file /u01/oracle/admin/cppicids/bdump/cppicids_lgwr_168278.trcKilling enqueue blocker (pid=270814) on resource CF-00000000-00000000 by killing session 882.1Killing enqueue blocker (pid=270814) on resource CF-00000000-00000000 by terminating the processLGWR: terminating instance due to error 2103

从中我们可以看到进程270814持有cf锁达到900秒,最终引发lgwr终止了实例,并抛出了错误代码2103。我们进一步分析lgwr的trace,发现下列信息。

SO: 700000e74329590, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=5, calls cur/top: 700000e754de920/700000e754de920, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 24
              last post received-location: ksasnd
              last process to post me: 700000e74329d70 1 6
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: 700000e74329d70 1 6
  (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 700000e76341580
    O/S info: user: oracle, term: UNKNOWN, ospid: 270814 
    OSD pid info: Unix process pid: 270814, image: oracle@CXIDSDBBAK (DBW0)
Dump of memory from 0x0700000E7430F978 to 0x0700000E7430FB80
SO: 700000e75474ba0, type: 4, owner: 700000e74329590, flag: INIT/-/-/0x00
  (session) sid: 882 trans: 0, creator: 700000e74329590, flag: (51) USR/- BSY/-/-/-/-/-
            DID: 0001-0005-00000005, short-term DID: 0000-0000-00000000
            txn branch: 0
            oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
  waiting for 'kcbzps' blocking sess=0x0 seq=64993 wait_time=0 seconds since wait started=60
          =0, =0, =0
  Dumping Session Wait History
   for 'kcbzps' count=1 wait_time=1102055
          =0, =0, =0
   for 'kcbzps' count=1 wait_time=97666
          =0, =0, =0
   for 'kcbzps' count=1 wait_time=97670
          =0, =0, =0
   for 'kcbzps' count=1 wait_time=97666
          =0, =0, =0
   for 'kcbzps' count=1 wait_time=97666
          =0, =0, =0
   for 'kcbzps' count=1 wait_time=97666
          =0, =0, =0
   for 'kcbzps' count=1 wait_time=97666
          =0, =0, =0
   for 'kcbzps' count=1 wait_time=97666
          =0, =0, =0
   for 'kcbzps' count=1 wait_time=97671
          =0, =0, =0
   for 'kcbzps' count=1 wait_time=97666

我们可以看到270814进程是dbw0进程,也就是数据写入进程。它的历史和现在的等待事件都是’kcbzps’,等待的时间非常之长,也就是说dbw0进程一直在等待。
我们在看一下dbw0持有的队列信息。

    SO: 700000e7350ce88, type: 5, owner: 700000e754de920, flag: INIT/-/-/0x00
      (enqueue) CF-00000000-00000000    DID: 0001-0005-00000005
      lv: 02 0f dd 46 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
      res: 0x700000e764b2458, mode: S, lock_flag: 0x0
      own: 0x700000e75474ba0, sess: 0x700000e75474ba0, proc: 0x700000e74329590, prv: 0x700000e764b2468

  SO: 700000e735123b0, type: 5, owner: 700000e000040d8, flag: INIT/-/-/0x00
      (enqueue) CF-00000000-00000000    DID: 0001-00B9-0000C7C5
      lv: 02 0f dd 46 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
      res: 0x700000e764b2458, req: SSX, lock_flag: 0x0
      own: 0x700000e74468560, sess: 0x700000e74468560, proc: 0x700000e7433f7f0, prv: 0x700000e764b2478

我们可以看到我们的dbw0进程以s模式持有了CF-00000000-00000000锁,而另外一个进程想以X模式来持有它。我们来看看是什么进程。

 (session) sid: 659 trans: 0, creator: 700000e7433f7f0, flag: (c0000041) USR/- BSY/-/-/-/-/-
              DID: 0001-00B9-0000C7C5, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 47, prv: 0, sql: 700000e47f92088, psql: 700000e47f92088, user: 0/SYS
    service name: SYS$USERS
    O/S info: user: oracle, term: , ospid: 1798860, machine: CXIDSDBBAK
              program: rman@CXIDSDBBAK (TNS V1-V3)
    application name: rman@CXIDSDBBAK (TNS V1-V3), hash value=3432807229
    action name: 0000001 FINISHED70, hash value=706507012
    waiting for 'enq: CF - contention' blocking sess=0x700000e75474ba0 seq=131 wait_time=0 seconds since wait started=0
                name|mode=43460005, 0=0, operation=0
    Dumping Session Wait History
     for 'enq: CF - contention' count=1 wait_time=2929718
                name|mode=43460005, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929703
                name|mode=43460005, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929700
                name|mode=43460005, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929699
                name|mode=43460005, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929699
                name|mode=43460005, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929709
                name|mode=43460005, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929702
                name|mode=43460005, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929700
                name|mode=43460005, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929700
                name|mode=43460005, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929701
                name|mode=43460005, 0=0, operation=0

通过trace我们发现是用户发起的rman进程需要以x模式获取cf锁,而此时我们的cf锁是被dbw0以s模式持有着,正常情况是能释放的,而dbw0进程因为等待’kcbzps’导致无法正常释放的。此时我们前台的rman进程的等待事件是enq: CF – contention。
一、 解决办法
根据错误号和相关的等待事件’kcbzps’我们可以发现文档ORA-00494: enqueue [CF] held for too long (more than 900 seconds); LGWR:Terminating Instance Due to Error 2103 (Doc ID 1089733.1)是比较符合我们的情况的
该问题提到了4中情况:
1) Starts with space error and file offline
2) Sessions are waiting for CF enqueue
3) Arguments ksliwat   kslwaitns_timed  kskthbwt  kslwait… match that in the call stack of bug
4) There is a wait on kcbzps and occurs with DBWR
我们和第四种情况匹配,为了进一步证实这个问题,我们在通过文档提到的堆栈来进行对比。
如下图所示,文档中提到的堆栈如下:
123qaz
我们trace中相关堆栈如下图:
123wsx
通过对比,堆栈是一致的。也就是印证了这个问题。
那么解决办法是安装Apply Patch 7411568。而这个数据库的PSU较高,最好先检查PSU中是否包含该补丁,如果包含此补丁,需要确认补丁相关的文件是否损坏,损坏了可以重新单独对这个补丁安装,如果没有可以立即安装这个补丁。

分享到: 更多