ORA-00600: 内部错误代码, 参数: [4442], [0x7000008A7C22EF8], [53]问题分析

数据库突然异常宕机。查看后台alert日志,PMON进程出现致命错误,报ORA-00600[4442]错误。

Thu Jan 22 11:04:46 2015 
Thread 1 advanced to log sequence 442095 
Current log# 8 seq# 442095 mem# 0: /p09bp_mm/mm_u01/mm/redo08_a.log 
Current log# 8 seq# 442095 mem# 1: /p09bp_mm/mm_u02/mm/redo08_b.log 
Thu Jan 22 11:07:27 2015 
Errors in file /u01/oracle/admin/mm/bdump/mm_pmon_1757288.trc: 
ORA-00600: 内部错误代码, 参数: [4442], [0x7000008A7C22EF8], [53], [], [], [], [], [] 
Thu Jan 22 11:07:30 2015 
Errors in file /u01/oracle/admin/mm/bdump/mm_pmon_1757288.trc: 
ORA-00600: 内部错误代码, 参数: [4442], [0x7000008A7C22EF8], [53], [], [], [], [], [] 
Thu Jan 22 11:07:30 2015 
PMON: terminating instance due to error 472 
Termination issued to instance processes. Waiting for the processes to exit 
Thu Jan 22 11:07:40 2015 
Instance termination failed to kill one or more processes 
Instance terminated by PMON, pid = 1757288 
Thu Jan 22 11:23:06 2015 
Starting ORACLE instance (normal) 
LICENSE_MAX_SESSION = 0 
LICENSE_SESSIONS_WARNING = 0 
Picked latch-free SCN scheme 3 
Autotune of undo retention is turned on.

为了进一步研究PMON进程为何突发异常,我们需要研究PMON的Trace文件。

PROCESS STATE 
------------- 
Process global information: 
process: 7000008ca28f5b8, call: 7000008cb3d54e8, xact: 0, curses: 7000008c93d0c10, usrses: 7000008c93d0c10 
---------------------------------------- 
SO: 7000008ca28f5b8, type: 2, owner: 0, flag: INIT/-/-/0x00 
(process) Oracle pid=2, calls cur/top: 7000008cb3d54e8/7000008cb3d54e8, flag: (e) SYSTEM 
int error: 0, call error: 0, sess error: 0, txn error 0 
(post info) last post received: 0 0 15 
last post received-location: ksupsc 
last process to post me: 7000008ca28fd98 1 6 
last post sent: 0 0 147 
last post sent-location: ktmpsm 
last process posted by me: 7000008cf28ba40 1 22 
(latch info) wait_event=0 bits=40 
holding (efd=4) 70000001001c670 transaction branch allocation level=6 
Location from where latch is held: ktcxdp: 
Context saved from call: 0 
state=busy, wlstate=free 
waiters [orapid (seconds since: put on list, posted, alive check)]: 
213 (0, 1421896047, 0) 
190 (0, 1421896047, 0) 
134 (0, 1421896047, 0) 
41 (0, 1421896047, 0) 
113 (0, 1421896047, 0) 
146 (0, 1421896047, 0) 
131 (0, 1421896047, 0) 
137 (0, 1421896047, 0) 
32 (0, 1421896047, 0) 
waiter count=9 
Process Group: DEFAULT, pseudo proc: 7000008ce30e4d0 
O/S info: user: oracle, term: UNKNOWN, ospid: 1757288 
OSD pid info: Unix process pid: 1757288, image: oracle@CXSFDB1 (PMON) 

可以看到PMON进程的trace显示Location from where latch is held: ktcxdp。也就是在执行函数ktcxdp的时候出现了latch的阻塞。导致pmon进程被堵塞。而pmon的进一步堵塞导致了9个会话被堵。

而Trace里面的函数堆栈显示:

----- Call Stack Trace ----- 
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst+001c          bl       ksedst1              0FFFFFFFF ? 000000020 ?
ksedmp+0290          bl       ksedst               104811CB8 ?
ksfdmp+0018          bl       03F40AE0             
kgeriv+0108          bl       _ptrgl               
kgesiv+0080          bl       kgeriv               1048E451C ? 1048E451C ?
                                                   1048E451C ? 1048E451C ?
                                                   1048E451C ?
ksesic2+0060         bl       kgesiv               000000002 ? 054C0696F ?
                                                   000000001 ? 70000001001C670 ?
                                                   1104D3E80 ?
ktcxdp+0344          bl       ksesic2              115A0000115A ? 000000002 ?
                                                   7000008A7C22EF8 ? 000000000 ?
                                                   000000035 ? 00011D1D3 ?
                                                   000000000 ? 000000000 ?
ksucln+0938          bl       ktcxdp               
ksbrdp+03e0          bl       _ptrgl               
opirip+03fc          bl       01FAFCBC             
opidrv+0448          bl       opirip               11029E590 ? 41029FED0 ?
                                                   FFFFFFFFFFFF6D0 ?
sou2o+0090           bl       opidrv               32031840FC ? 4A0170928 ?
                                                   FFFFFFFFFFFF6D0 ?
opimai_real+0150     bl       01FAB1B4             
main+0098            bl       opimai_real          000000000 ? 000000000 ?
__start+0098         bl       main                 000000000 ? 000000000 ?

ksedst <- ksedmp <- ksfdmp <- kgeriv <- kgesiv 
<- ksesic2 <- ktcxdp <- ksucln <- ksbrdp <- opirip 
<- opidrv <- sou2o <- opimai_real <- main <- start 

通过metalink note进行搜索,找到了文档Database Crashed After ORA-600 [4442] (文档 ID 368877.1)里面的文档中的Call Stack和我们的类似。而文档中指出的Call Stack如下所示:

ORA-600 [4442] is signalled by PMON, with the stack trace:
ktcxdp <- ksucln <- ksbrdp <- opirip <- opidrv <- sou2o

因此我们可以确定命中了Bug 4891103 'INSTANCE DOWN OWING TO ORA-600 [4442]'. This is due to PMON cleaning up a distributed transaction incorrectly.那么这个问题就出在我们的PMON去清理分布式事务的时候触发的。为了进一步确认分布式事务引起的问题,我们还在PMON的trace中看到下列信息。

B0BB3C91:3BDB1F4E    12  1315 10254  71 KSBCTI: (RECO) : (interrupt action) : acnum=[1] comment=[KSB action for X-instance calls]
B0BB3C9D:3BDB1F4F    12  1315 10254  71 KSBCTI: (RECO) : (interrupt action) : acnum=[63] comment=[Scumnt mount lock]
B0BB3C9E:3BDB1F50    12  1315 10254  71 KSBCTI: (RECO) : (interrupt action) : acnum=[64] comment=[Poll system events broadcast channel]
B0BB3CA1:3BDB1F51    12  1315 10254  73 KSBCTI: (RECO) : (timeout action)   : acnum=[0] comment=[Monitor Cleanup]
B0BB3CD9:3BDB1F52    12  1315 10254  73 KSBCTI: (RECO) : (timeout action)   : acnum=[119] comment=[distributed recovery wakeup]
B0BB3E0F:3BDB1F53    12  1315 10254  71 KSBCTI: (RECO) : (interrupt action) : acnum=[1] comment=[KSB action for X-instance calls]

这里可以看到我们的reco的进程也是被唤醒过的,恢复进程recoverer process (RECO)用于分布式数据库结构,自动解决分布式事务的错误。但是因为数据库异常宕机,通过trace文件无法定位到具体的分布式SQL语句。
问题解决

  1. 建议安装单独的patch 4891103来规避这个问题.
  2. 目前版本是10.2.0.2,在10.2.0.3上已经修复该问题,建议升级到10.2.0.5以上的版本.
  3. 通过dba_2pc_pengding视图监控相关的分布式事务,找出引发异常2pc事务的源头。
  4. 文档126069.1给出了一个手动解决有问题的分布式事务的方法。NOTE:126069.1 - Manually Resolving In-Doubt Transactions: Different Scenarios
分享到: 更多