一、问题背景
2015年1月16日早晨8点03分,数据库后台报ORA-00600[kddummy_blkchk]错误,系统显示SMON进程在恢复的死事务的过程中遇到了坏块,在多次尝试无效后,最终SMON出现致命错误导致系统宕机。
Fri Jan 16 08:03:19 2015 Errors in file /u01/oracle/admin/actdb/udump/actdb_ora_10355644.trc: ORA-00600: [kddummy_blkchk], [118], [639865], [6110], [], [], [], [] Fri Jan 16 08:03:19 2015 Corrupt Block Found TSN = 8, TSNAME = ACTUARY_DATA RFN = 118, BLK = 639865, RDBA = 495567737 OBJN = 1087274, OBJD = 1097591, OBJECT = TBL_TRAD_QID, SUBOBJECT = SEGMENT OWNER = ACTUARY, SEGMENT TYPE = Table Segment Fri Jan 16 08:34:25 2015 Errors in file /u01/oracle/admin/actdb/bdump/actdb_p006_65077728.trc: ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110], [], [], [], [] Fri Jan 16 08:34:27 2015 Doing block recovery for file 118 block 639852 Block recovery from logseq 90945, block 1260180 to scn 9106774485353 Fri Jan 16 08:35:03 2015 Errors in file /u01/oracle/admin/actdb/bdump/actdb_smon_45548262.trc: ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110], [], [], [], [] Fri Jan 16 08:35:04 2015 Errors in file /u01/oracle/admin/actdb/bdump/actdb_pmon_51053242.trc: ORA-00474: SMON process terminated with error
二、问题处理
出现该问题,首先可以看到SMON尝试对文件118的639852块做恢复。而在恢复的过程中会因为发现坏块,会导致宕库。所以建议首先设置10513事件去阻止SMON对死事务的恢复。
Fri Jan 16 09:37:45 2015 ALTER SYSTEM SET event='10513 trace name context forever,level 2' SCOPE=SPFILE;
设置完事件后重启数据库,发现SMON已经不在尝试去做恢复了,但是后台进程依然会报错。这里主要是检测到报检测的坏块的错误。
Fri Jan 16 09:42:28 2015 Corrupt Block Found TSN = 8, TSNAME = ACTUARY_DATA RFN = 118, BLK = 639852, RDBA = 495567724 OBJN = 1087274, OBJD = 1097591, OBJECT = TBL_TRAD_QID, SUBOBJECT = SEGMENT OWNER = ACTUARY, SEGMENT TYPE = Table Segment
于是考虑对该对象进行重建。在执行CTAS对表TBL_TRAD_QID进行重建的过程中无法完成,报如下错误:
Fri Jan 16 09:50:32 2015 Errors in file /u01/oracle/admin/actdb/udump/actdb_ora_42664692.trc: ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110], [], [], [], [] Fri Jan 16 09:54:39 2015 Errors in file /u01/oracle/admin/actdb/udump/actdb_ora_34472916.trc: ORA-00600: [kddummy_blkchk], [118], [639852], [6110], [], [], [], [] Fri Jan 16 09:54:45 2015 Corrupt Block Found TSN = 8, TSNAME = ACTUARY_DATA RFN = 118, BLK = 639852, RDBA = 495567724 OBJN = 0, OBJD = 1097591, OBJECT = , SUBOBJECT = SEGMENT OWNER = , SEGMENT TYPE = Invalid Type
此时可以看到错误发生在用户进程上面,而不是发生在SMON进程上,因此不会导致数据库宕机。但是因为这个表对于应用来说很重要,所以必须考虑重建进行修复。但是重建的时候还是会去读这个对象,读的过程同样会对表做逻辑检查,一旦做逻辑检查,就会出现上述错误,从而导致读取也会失败,报ORA-00600错误。所以为了能够正常的读取该对象,我们考虑关闭数据库的逻辑校验。修改下列两个参数。
SQL> alter system set db_block_checking=false; System altered. SQL> alter system set db_block_checksum=false; System altered.
修改参数成功后,再次运行CTAS重建表成功。把老的对象成功drop掉之后,将event 10513事件取消和两个校验参数重新设置回来后重启数据库仍然报错。
Fri Jan 16 11:18:05 2015 Errors in file /u01/oracle/admin/actdb/bdump/actdb_smon_51446784.trc: ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110], [], [], [], [] Doing block recovery for file 118 block 639852 Block recovery from logseq 90953, block 66187 to scn 9106774948071
此时SMON进程再一次尝试恢复该对象,但是这个对象在我们的数据库中已经被我们删除不存在了,经过分析发现表drop掉了,会放在回收站中,物理的对象仍然存在。为了彻底的解决这个问题,需要把对象彻底的purge删除。
purge table table_name
三、原因分析
从后台trace,我们可以看到是下列语句造成了该问题。
ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110], [], [], [], [] ORA-00607: Internal error occurred while making a change to a data block ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639865], [6110], [], [], [], [] Current SQL statement for this session: UPDATE Tbl_Trad_QID A SET QCCXRQ = (SELECT /*+PARALLEL(NT,3)*/ ACCTDATE FROM NREGCLM NT WHERE NT.BDH = A.BDH AND A.YXYQCBZW=1 AND ROWNUM = 1)
下列语句涉及到一组更新,在mos上查询下列错误能够找到下列bug。
ALERT: Bug 7662491 – Array Update can corrupt a row. ORA-600 [kghstack_free1] ORA-600 [kddummy_blkchk][6110/6129] (文档 ID 861965.1)
7662491: INSTANCE CRASH / ORA-600 [KDDUMMY_BLKCHK] HIT DURING RECOVER
根据文档的提示,能在trace中找到下列信息:
kdbchk: the amount of space used is not equal to block size used=7888 fsc=0 avsp=302 dtl=8064 rechecking block failed with error code 6110
同时文档还指出
The trace file shows that the error is produced by an UPDATE with OP:11.19 (Array Update) and check code [6110] ("kdbchk: the amount of space used is not equal to block size") or check code [6129] ("kdbchk: fsbo(<XX>) wrong, (hsz <YY>)"). Note that check codes are not limited to 6129/6110.
可以看到OP:11.19就是Array Update操作。我们在仔细的看Trace,我们发现下列信息。
CHANGE #1 TYP:0 CLS: 1 AFN:118 DBA:0x1d89c370 OBJ:1097591 SCN:0x0848.55de4efc SEQ: 2 OP:11.19
这里可以看到在Trace里面对象1097591做了OP:11.19(Array Update)的操作,命中了这个Bug。
强烈建议:安装补丁7662491。