SMON遇到ORA-21779: duration not active错误

问题背景 

客户一套10.2.0.4的RAC数据库,因为要进行11g升级,所以最近部署了SPA采集任务,结果该任务部署没多久,SMON就爆发了ORA-21779错误。导致数据库运行非常缓慢。作为运维厂商,我们立马对数据库出错原因进行了分析。我们首先来看一下具体的SMON报的错误信息。

*** 2014-04-26 13:47:28.987
SMON: following errors trapped and ignored:
ORA-21779: duration not active
ORA-06512: at line 1
Sat Apr 26 13:48:06 2014
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=36
System State dumped to trace file /oracle/app/oracle/admin/wcrma/bdump/wcrma2_smon_3610.trc

wcrma2_smon_3610.trc
name=SPA.SYSTP9+MPUyJ8/wLgRAAUT0SBEA== 
Drop transient type: SYSTP9+MPUyJ8/wLgRAAUT0SBEA==m 

call stack:ksdxfstk()+36<-ksdxdocmdmult()+6788<-ksudmp_proc()+1008<-ksudss()+2996<-kqrigt()+1856<-kqrpre1()+4832<-kqrpre()+40<-kturax()+632<-ktprbeg()+404<-ktmmon()+5984<-ksbrdp()+976<-opirip()+824<-opidrv()+1200<-sou2o()+80<-opimai_real()+268<-main()+152<-_start()+380

row cache parent object: address=142ff1d7b8 cid=3(dc_rollback_segments)
LIBRARY OBJECT HANDLE: handle=1349991ab0 mtx=1349991be0(1) cdp=1 
name=select o.name, u.name from obj$ o, type$ t, user$ u where o.oid$ = t.tvoid and u.user#=o.owner# and bitand(t.properties,8388608) = 8388608 and (sysdate-o.ctime) > 0.0007
hash=f8aafde8df1d0384b3c6849b889a14c9 timestamp=04-26-2014 12:48:31 
LIBRARY OBJECT HANDLE: handle=1431525e88 mtx=1431525fb8(0) cdp=0

我们仔细的来看,SMON在清理一个transient type的时候,报了这个错误,也就是清理过程中出现了问题。而transient type是一种临时的type。它是在动态执行的时候产生在内存里面的。例如应用程序在语句里面定义了自定义的Type,在运行过程中,运行到这一步它就会转化成transient type。根据文档SMON: Following Errors Trapped And Ignored ORA-21779 (文档 ID 988663.1)的描述。这个问题会引起alert日志大量的增长,产生大量的trace,但是不会影响到数据库正常使用,但是在实际情况下,我们发现并非如此,因为我们的SMON进程是非常关键的进程,它在删除某些东西的时候需要申请到锁,而我们从这一次的堆栈中可以看到下列函数。我们主要来看下kqrigt函数。该函数在谷歌上搜索可以查到下列内容,可以看到这个函数里面需要申请row cache instance lock锁,如果申请不到,就会报WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK。而这里它需要申请的是dc_rollback_segments。

* Function kqrigt

May call kslfre (...)
May call kqrprl (...)
May call ksesec0 (0x0FB5)
May call ksbasend (dword ptr ds:kqrglk_+4, offset kqrglk_, 0x10, ...)
May call kslgetl (?, 1, 0, offset kqrpl10_)
May call kslwait (?, ds:kqrelk_, 0, ...)
May call kslgetl (?, 1, ?, offset kqrpl10_)
May call ksu_dispatch_tac ()
May call kskchk ()
May call ksuitr ()
May call ksesec0 (...)
May call kgeasnmierr (ds:ksmgpp_, ds:ksefac_, "kqrigt-1", 0)
May call kqrchk ()
May call ksurea ()
May call ksdwrf ("Failed to get row cache instance lock (post=%d,res=%lx)\n")
May call kqrdrs (...)
May call ksesec0 (0x259)
May call ksesec0 (0x2BF)
May call ksbcic (0x1E, ?, ?, 1)
May call kslgetl (?, 1, ?, offset kqrpl43_)
May call kgeasnmierr (ds:ksmgpp_, ds:ksefac_, "kqrgplt", 2)
May call kgeasnmierr (ds:ksmgpp_, ds:ksefac_, "kqrigt-2", 0)
May call ksdwra (">>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=%d")
May call ksdwrf (">>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! <<<\n")
May call ksdwrf ("row cache enqueue: session: %p, mode: %c, request: %c\n")
May call ksudss (0x102)
May call ksedst (0)

解决办法

那么这个问题怎么解决了?我们可以参考文档Receiving ORA-21780 Continuously in the Alert Log and SMON Trace Reports "Drop transient type". (文档 ID 1081950.1)的解决办法。首先针对这个transient type,smon大概会12个小时去清理一次。我们可以去手动进行清理,或者设置内部事件22834去阻止smon清理这类的type。

1.直接删除

Step 1
Make sure you have a good full backup.

Step 2
spool obj.lis
alter session set nls_date_format='DD-MON-YYYY HH24:MI:SS';
set pagesize 1000
select o.* from obj$ o, type$ t
where o.oid$ = t.tvoid and
bitand(t.properties,8388608) = 8388608 and (sysdate-o.ctime) > 0.0007;
spool off

Step 3
Find the object owner:

Step 4 
Drop the objects.
DROP TYPE "SYSTPf/r2wN4keX7gQKjA3AFMSw==" FORCE;

2.设置内部事件22834去阻止SMON清理这类型的Type,但是可能会消耗更多的内存。

SQL> alter system set event='22834 trace name context forever, level 1' scope=spfile;
System altered.

SQL> alter system set event='22834 trace name context forever, level off' scope=spfile;
System altered.

当然这个事件也可以使用oradebug来进行设置。这样就不需要进行重启的操作。

参考文档:

Receiving ORA-21780 Continuously in the Alert Log and SMON Trace Reports "Drop transient type". (文档 ID 1081950.1)

SMON: Following Errors Trapped And Ignored ORA-21779 (文档 ID 988663.1)

分享到: 更多