这个错误产生的环境是这样子的,主库是一套10.2.0.4的RAC系统,备库是一套Logical Standby的单机.也是10.2.0.4.主库使用的存储方式是ASM,而备库使用的存储方式是FS,当系统运行一段时间后,主库会报ORA-27002错误,而备库会报ORA-00600[kcrrrfswda.11], [4], [368]错误.错误情况如下所示:
1.primary database Sat Aug 18 21:32:51 2012 Errors in file /oracle/admin/gbps/bdump/gbps2_arc1_369590.trc: ORA-00272: error writing archive log SUCCESS: diskgroup ARCH was dismounted FAL[server, ARC1]: FAL archive failed, see trace file. Sat Aug 18 21:32:51 2012 Errors in file /oracle/admin/gbps/bdump/gbps2_arc1_369590.trc: ORA-16055: FAL request rejected ARCH: FAL archive failed. Archiver continuing Sat Aug 18 21:32:51 2012 ORACLE Instance gbps2 - Archival Error. Archiver continuing.
而在备库上面. 则出现下列错误
RFS[24]: Assigned to RFS process 389522 RFS[24]: Identified database type as 'logical standby' Sat Aug 18 21:33:59 2012 RFS LogMiner: Client enabled and ready for notification Sat Aug 18 21:33:59 2012 RFS LogMiner: RFS id [389522] assigned as thread [2] PING handler Sat Aug 18 21:33:59 2012 Errors in file /home/oracle/app/admin/gbpsstd/udump/gbpsstd_rfs_389522.trc: ORA-00600: internal error code, arguments: [kcrrrfswda.11], [4], [368], [], [], [], [], [] Redo Shipping Client Connected as PUBLIC
查看备库的trace文件,可以看到下列信息
*** ACTION NAME:() 2012-08-18 21:33:59.422 *** MODULE NAME:(oracle@p570b (TNS V1-V3)) 2012-08-18 21:33:59.422 *** SERVICE NAME:(gbpsstd) 2012-08-18 21:33:59.422 *** SESSION ID:(2179.157) 2012-08-18 21:33:59.422 RFS LogMiner [snc]: Encountered exception [604] while querying apply info. Corrupt redo block 1247 detected: bad checksum
从这个问题中我们可以了解到一个信息,就是可以看到在apply日志的时候,出现了Corrupt redo block bad chechsum.同时我给Oracle方面开了一个SR,SR给出的回复如下:
The ORA-600 [kcrrrfswda.11] is a side effect of ora-00368 "checksum error in redo log block". Oracle has detected an invalid checksum on a archived redo log transported from PRIMARY and reported the error. This looks like an OS/network/hardware problem. The only reason Oracle raises an error is because of the checksum mismatch. There is no evidence of Oracle (functionality) actually failing or causing the problem. We should involve the OS/network/hardware vendor to investigate the problem. If you have a FIREWALL between PRIMART and STANDBY, it should be reviewed.
同时我还在OTN上搜到一篇文章:https://forums.oracle.com/forums/thread.jspa?threadID=681766,和我遇到的问题一模一样.通过上述的描述中,我们可以知道Oracle认为可能是OS/network/hardware的错误,但是我观察过我的OS.网络和硬件都没有问题.基于SR上的回复,我研究了一下LOGMINER GENERATES CORRUPT REDO BLOCK DETECTED: BAD CHECKSUM [ID 751286.1],从这篇文章的solution我们可以看到:如果在logminner能够成功读取和发现备用日志成员组有好的block的时候,logminner的capture将不会终止,在trace中会出现下列信息.
Corrupt redo block <bno> detected: bad checksum Rereading log member '<file_path>' (corruption)
可惜的是我在trace中并没有发现上述rereading log member信息.但是有一个好消息就是业务人员告诉我,虽然发生了600和ora-27002,但是业务数据并没有丢失.所以这也是一个悬案,后来我对logical data guatd的参数做了些小小的调整后,大大降低了该问题爆发的频率,其实我就是调整了下dba_logstdby_parameters里面的MAX_SGA,MAX_SERVERS,APPLY_SERVERS等参数.这个问题还在继续跟进,目前客户在检查防火墙问题.
Post a Comment