数据库突然异常宕机。查看后台alert日志,PMON进程出现致命错误,报ORA-00600[4442]错误。
Thu Jan 22 11:04:46 2015 Thread 1 advanced to log sequence 442095 Current log# 8 seq# 442095 mem# 0: /p09bp_mm/mm_u01/mm/redo08_a.log Current log# 8 seq# 442095 mem# 1: /p09bp_mm/mm_u02/mm/redo08_b.log Thu Jan 22 11:07:27 2015 Errors in file /u01/oracle/admin/mm/bdump/mm_pmon_1757288.trc: ORA-00600: 内部错误代码, 参数: [4442], [0x7000008A7C22EF8], [53], [], [], [], [], [] Thu Jan 22 11:07:30 2015 Errors in file /u01/oracle/admin/mm/bdump/mm_pmon_1757288.trc: ORA-00600: 内部错误代码, 参数: [4442], [0x7000008A7C22EF8], [53], [], [], [], [], [] Thu Jan 22 11:07:30 2015 PMON: terminating instance due to error 472 Termination issued to instance processes. Waiting for the processes to exit Thu Jan 22 11:07:40 2015 Instance termination failed to kill one or more processes Instance terminated by PMON, pid = 1757288 Thu Jan 22 11:23:06 2015 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Picked latch-free SCN scheme 3 Autotune of undo retention is turned on.
为了进一步研究PMON进程为何突发异常,我们需要研究PMON的Trace文件。
PROCESS STATE ------------- Process global information: process: 7000008ca28f5b8, call: 7000008cb3d54e8, xact: 0, curses: 7000008c93d0c10, usrses: 7000008c93d0c10 ---------------------------------------- SO: 7000008ca28f5b8, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=2, calls cur/top: 7000008cb3d54e8/7000008cb3d54e8, flag: (e) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 15 last post received-location: ksupsc last process to post me: 7000008ca28fd98 1 6 last post sent: 0 0 147 last post sent-location: ktmpsm last process posted by me: 7000008cf28ba40 1 22 (latch info) wait_event=0 bits=40 holding (efd=4) 70000001001c670 transaction branch allocation level=6 Location from where latch is held: ktcxdp: Context saved from call: 0 state=busy, wlstate=free waiters [orapid (seconds since: put on list, posted, alive check)]: 213 (0, 1421896047, 0) 190 (0, 1421896047, 0) 134 (0, 1421896047, 0) 41 (0, 1421896047, 0) 113 (0, 1421896047, 0) 146 (0, 1421896047, 0) 131 (0, 1421896047, 0) 137 (0, 1421896047, 0) 32 (0, 1421896047, 0) waiter count=9 Process Group: DEFAULT, pseudo proc: 7000008ce30e4d0 O/S info: user: oracle, term: UNKNOWN, ospid: 1757288 OSD pid info: Unix process pid: 1757288, image: oracle@CXSFDB1 (PMON)
可以看到PMON进程的trace显示Location from where latch is held: ktcxdp。也就是在执行函数ktcxdp的时候出现了latch的阻塞。导致pmon进程被堵塞。而pmon的进一步堵塞导致了9个会话被堵。
而Trace里面的函数堆栈显示:
----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst+001c bl ksedst1 0FFFFFFFF ? 000000020 ? ksedmp+0290 bl ksedst 104811CB8 ? ksfdmp+0018 bl 03F40AE0 kgeriv+0108 bl _ptrgl kgesiv+0080 bl kgeriv 1048E451C ? 1048E451C ? 1048E451C ? 1048E451C ? 1048E451C ? ksesic2+0060 bl kgesiv 000000002 ? 054C0696F ? 000000001 ? 70000001001C670 ? 1104D3E80 ? ktcxdp+0344 bl ksesic2 115A0000115A ? 000000002 ? 7000008A7C22EF8 ? 000000000 ? 000000035 ? 00011D1D3 ? 000000000 ? 000000000 ? ksucln+0938 bl ktcxdp ksbrdp+03e0 bl _ptrgl opirip+03fc bl 01FAFCBC opidrv+0448 bl opirip 11029E590 ? 41029FED0 ? FFFFFFFFFFFF6D0 ? sou2o+0090 bl opidrv 32031840FC ? 4A0170928 ? FFFFFFFFFFFF6D0 ? opimai_real+0150 bl 01FAB1B4 main+0098 bl opimai_real 000000000 ? 000000000 ? __start+0098 bl main 000000000 ? 000000000 ? ksedst <- ksedmp <- ksfdmp <- kgeriv <- kgesiv <- ksesic2 <- ktcxdp <- ksucln <- ksbrdp <- opirip <- opidrv <- sou2o <- opimai_real <- main <- start
通过metalink note进行搜索,找到了文档Database Crashed After ORA-600 [4442] (文档 ID 368877.1),里面的文档中的Call Stack和我们的类似。而文档中指出的Call Stack如下所示:
ORA-600 [4442] is signalled by PMON, with the stack trace: ktcxdp <- ksucln <- ksbrdp <- opirip <- opidrv <- sou2o
因此我们可以确定命中了Bug 4891103 'INSTANCE DOWN OWING TO ORA-600 [4442]'. This is due to PMON cleaning up a distributed transaction incorrectly.那么这个问题就出在我们的PMON去清理分布式事务的时候触发的。为了进一步确认分布式事务引起的问题,我们还在PMON的trace中看到下列信息。
B0BB3C91:3BDB1F4E 12 1315 10254 71 KSBCTI: (RECO) : (interrupt action) : acnum=[1] comment=[KSB action for X-instance calls] B0BB3C9D:3BDB1F4F 12 1315 10254 71 KSBCTI: (RECO) : (interrupt action) : acnum=[63] comment=[Scumnt mount lock] B0BB3C9E:3BDB1F50 12 1315 10254 71 KSBCTI: (RECO) : (interrupt action) : acnum=[64] comment=[Poll system events broadcast channel] B0BB3CA1:3BDB1F51 12 1315 10254 73 KSBCTI: (RECO) : (timeout action) : acnum=[0] comment=[Monitor Cleanup] B0BB3CD9:3BDB1F52 12 1315 10254 73 KSBCTI: (RECO) : (timeout action) : acnum=[119] comment=[distributed recovery wakeup] B0BB3E0F:3BDB1F53 12 1315 10254 71 KSBCTI: (RECO) : (interrupt action) : acnum=[1] comment=[KSB action for X-instance calls]
这里可以看到我们的reco的进程也是被唤醒过的,恢复进程recoverer process (RECO)用于分布式数据库结构,自动解决分布式事务的错误。但是因为数据库异常宕机,通过trace文件无法定位到具体的分布式SQL语句。
问题解决
- 建议安装单独的patch 4891103来规避这个问题.
- 目前版本是10.2.0.2,在10.2.0.3上已经修复该问题,建议升级到10.2.0.5以上的版本.
- 通过dba_2pc_pengding视图监控相关的分布式事务,找出引发异常2pc事务的源头。
- 文档126069.1给出了一个手动解决有问题的分布式事务的方法。NOTE:126069.1 - Manually Resolving In-Doubt Transactions: Different Scenarios