ORA-27002和ORA-00600: internal error code, arguments: [kcrrrfswda.11], [4], [368], [], [], [], [], []

这个错误产生的环境是这样子的,主库是一套10.2.0.4的RAC系统,备库是一套Logical Standby的单机.也是10.2.0.4.主库使用的存储方式是ASM,而备库使用的存储方式是FS,当系统运行一段时间后,主库会报ORA-27002错误,而备库会报ORA-00600[kcrrrfswda.11], [4], [368]错误.错误情况如下所示:

1.primary database
Sat Aug 18 21:32:51 2012
Errors in file /oracle/admin/gbps/bdump/gbps2_arc1_369590.trc:
ORA-00272: error writing archive log
SUCCESS: diskgroup ARCH was dismounted
FAL[server, ARC1]: FAL archive failed, see trace file.
Sat Aug 18 21:32:51 2012
Errors in file /oracle/admin/gbps/bdump/gbps2_arc1_369590.trc:
ORA-16055: FAL request rejected
ARCH: FAL archive failed. Archiver continuing
Sat Aug 18 21:32:51 2012
ORACLE Instance gbps2 - Archival Error. Archiver continuing.

而在备库上面. 则出现下列错误

RFS[24]: Assigned to RFS process 389522
RFS[24]: Identified database type as 'logical standby'
Sat Aug 18 21:33:59 2012
RFS LogMiner: Client enabled and ready for notification
Sat Aug 18 21:33:59 2012
RFS LogMiner: RFS id [389522] assigned as thread [2] PING handler
Sat Aug 18 21:33:59 2012
Errors in file /home/oracle/app/admin/gbpsstd/udump/gbpsstd_rfs_389522.trc:
ORA-00600: internal error code, arguments: [kcrrrfswda.11], [4], [368], [], [], [], [], []
Redo Shipping Client Connected as PUBLIC

查看备库的trace文件,可以看到下列信息

*** ACTION NAME:() 2012-08-18 21:33:59.422
*** MODULE NAME:(oracle@p570b (TNS V1-V3)) 2012-08-18 21:33:59.422
*** SERVICE NAME:(gbpsstd) 2012-08-18 21:33:59.422
*** SESSION ID:(2179.157) 2012-08-18 21:33:59.422
RFS LogMiner [snc]: Encountered exception [604] while querying apply info.
Corrupt redo block 1247 detected: bad checksum

从这个问题中我们可以了解到一个信息,就是可以看到在apply日志的时候,出现了Corrupt redo block bad chechsum.同时我给Oracle方面开了一个SR,SR给出的回复如下:

The ORA-600 [kcrrrfswda.11] is a side effect of ora-00368 "checksum error in redo log block".
Oracle has detected an invalid checksum on a archived redo log transported from PRIMARY and reported the error.

This looks like an OS/network/hardware problem. The only reason Oracle raises an error is because of the checksum mismatch.
There is no evidence of Oracle (functionality) actually failing or causing the problem.

We should involve the OS/network/hardware vendor to investigate the problem. If you have a FIREWALL between PRIMART and STANDBY, it should be reviewed.

同时我还在OTN上搜到一篇文章:https://forums.oracle.com/forums/thread.jspa?threadID=681766,和我遇到的问题一模一样.通过上述的描述中,我们可以知道Oracle认为可能是OS/network/hardware的错误,但是我观察过我的OS.网络和硬件都没有问题.基于SR上的回复,我研究了一下LOGMINER GENERATES CORRUPT REDO BLOCK DETECTED: BAD CHECKSUM [ID 751286.1],从这篇文章的solution我们可以看到:如果在logminner能够成功读取和发现备用日志成员组有好的block的时候,logminner的capture将不会终止,在trace中会出现下列信息.

Corrupt redo block <bno> detected: bad checksum
Rereading log member '<file_path>' (corruption)

可惜的是我在trace中并没有发现上述rereading log member信息.但是有一个好消息就是业务人员告诉我,虽然发生了600和ora-27002,但是业务数据并没有丢失.所以这也是一个悬案,后来我对logical data guatd的参数做了些小小的调整后,大大降低了该问题爆发的频率,其实我就是调整了下dba_logstdby_parameters里面的MAX_SGA,MAX_SERVERS,APPLY_SERVERS等参数.这个问题还在继续跟进,目前客户在检查防火墙问题.

分享到: 更多

Post a Comment

Your email is never published nor shared. Required fields are marked *