又遇ORA-04021: timeout occurred while waiting to lock object ,这次是DBRM进程

版权声明:本文为Buddy Yuan原创文章,未经允许不得转载。原文地址:又遇ORA-04021: timeout occurred while waiting to lock object,这次是DBRM进程

第二次在ADG上遇到这个错误了,上一次遇到这个错误的传送门,但是这一次的错误并不是LGWR进程引起的,这一次是另外一套11.2.0.4的DG备库,是DBRM进程导致的实例宕机。我们先来看下Trace。

Errors in file /oracle/app/product/diag/rdbms/dghbyydba/hbyydba2/trace/hbyydba2_dbrm_12256282.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-04021: timeout occurred while waiting to lock object 
DBRM (ospid: 12256282): terminating the instance due to error 604
Wed Oct 10 01:20:08 2018
System state dump requested by (instance=2, osid=12256282 (DBRM)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/product/diag/rdbms/dghbyydba/hbyydba2/trace/hbyydba2_diag_11470468_20181010012008.trc
Dumping diagnostic data in directory=[cdmp_20181010012008], requested by (instance=2, osid=12256282 (DBRM)), summary=[abnormal instance termination].
Instance terminated by DBRM, pid = 12256282

Alert的trace报错很清楚,是DBRM进程最后终止了实例,我们在看一下DBRM进程的Trace。

*** 2018-10-10 00:35:41.489
PQQ: Active Services changed
PQQ: Old service table
SvcIdx  SvcId Active ActDop
PQQ: New service table
SvcIdx  SvcId Active ActDop
     1      1      1      0
     2      2      1      0
error 604 detected in background process

*** 2018-10-10 01:20:08.823
ORA-00604: error occurred at recursive SQL level 1
ORA-04021: timeout occurred while waiting to lock object
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+240<-kjzdssdmp()+240<-kjzduptcctx()+228<-kjzdicrshnfy()+120<-ksuitm()+1532<-ksbrdp()+4696<-opirip()+1620<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+276<-main()+20
4<-__start()+112
----- End of Abridged Call Stack Trace -----

*** 2018-10-10 01:20:08.824
DBRM (ospid: 12256282): terminating the instance due to error 604
ksuitm: waiting up to [5] seconds before killing DIAG(11470468)

这里有堆栈信息,根据上述错误,我们在MOS中很快就定位了这个问题。根据文档:Bug 18101696 Database Resource Manager Crashes Standby Instance with ORA-604 / ORA-1489 After Turning Apply On。文档中描述下列信息。我们这里报了ORA-00604的递归错误,同时我们的堆栈也包含了”kjzdicrshnfy<-ksuitm<-ksbrdp“这三个。

Problem Description:
-------------------------
Database Resource Manager (DBRM) terminating with:
 
      ORA-604: error occurred at recursive SQL level 1
      ORA-1489: result of string concatenation is too long
Instance terrmination is likely to follow. The call stack would include:

kjzdicrshnfy<-ksuitm<-ksbrdp
但是疑问就是我们这里没有报ORA-1489,而是报了ORA-04021: timeout occurred while waiting to lock object,其实这个我们在上一篇文章介绍过,当通过ADG中的恢复,LGWR将DB INSTANCE状态对象锁定为独占模式。这样的结果是LGWR可以阻止SQL的解析,而SQL的解析也能阻止LGWR。而这里不是LGWR进程而是DBRM进程,这说明了另外一点,DBRM进程也会将一些对象锁定,从而导致超时问题。针对这个超时问题,还是建议将隐含参数”_adg_parselock_timeout”设置成500。
而MOS上的根本workground是关闭resource manager,但是奇怪的地方就是resource manager plan是没有开启的。而DBRM进程主要作用是为数据库实例配置资源计划。那么这可能是一个ADG上的BUG,但是如果我们安装补丁18101696是会出问题的,他会导致出现另外一个错误:ORA-604 AND ORA-904 AFTER INSTALLING PATCH 18101696 (文档 ID 2072853.1)。这个补丁安装之后就增加了一个新函数rm $ _get_mappings_hash,以及调用该函数的代码。而如果要使用它就必须运行postinstall.sql,但是问题就在于ADG是只读的。所以打这个补丁需要在主库上也打,并且在主库上运行该脚本,然后同步到物理备库。不然就会出现ORA-00904报错。而还有一个办法就是安装补丁:Bug 25188350 – DBRM terminating the instance due to error 604 (文档 ID 25188350.8),这个补丁有一个说明就是“This fix supersedes the fix of bug 18101696, installing patch 25188350 resolves both issues.”。也就是补丁18101696已经被取代了,安装补丁25188350能解决这所有的问题。

因此,我认为这个问题的终极solution就是:
1、设置参数”_adg_parselock_timeout”,防止lock object而导致的超时。
2、安装补丁25188350,阻止DBRM进程运行调用SYS.DBMS_RMIN”出现的递归错误引起的ORA-604。

分享到: 更多

Post a Comment

Your email is never published nor shared. Required fields are marked *