客户数据库节点1,在11点02分左右主机出现异常,导致宕机。此时vip发生了漂移,大量的节点连入节点2,导致节点2短期CPU R队列突增。此时节点2因为节点1的crash,需要做Reconfiguration。节点2在11点10分左右也异常宕机。接到告警后,登录到主机上于11点13分将数据库2节点拉起恢复了数据库。
节点1是因为主机硬件问题导致的宕机,而节点2在节点1出问题之后,按照道理应该进行接管,而节点2在接管的过程中出现了数据库宕库。
根据当时的alert日志分析,可以发现当时是被LMON进程宕掉了实例。
Fri Sep 25 11:11:15 2020 ORA-1092 : opitsk aborting process Fri Sep 25 11:11:17 2020 Termination issued to instance processes. Waiting for the processes to exit Fri Sep 25 11:11:17 2020 ORA-1092 : opitsk aborting process Instance terminated by LMON, pid = 213044而继续查看LMON的trace文件,可以发现下列信息:
*** 2020-09-25 11:05:06.721 2020-09-25 11:05:06.721325 : * Begin lmon rcfg step KJGA_RCFG_TIMERQ 2020-09-25 11:05:06.721606 : * Begin lmon rcfg step KJGA_RCFG_DDQ 2020-09-25 11:05:06.723678 : * Begin lmon rcfg step KJGA_RCFG_SETMASTER 2020-09-25 11:05:06.941692 : Set master node info 2020-09-25 11:05:06.942669 : * Begin lmon rcfg step KJGA_RCFG_ENQREPLAY *** 2020-09-25 11:05:07.128 2020-09-25 11:05:07.128144 : Submitted all remote-enqueue requests 2020-09-25 11:05:07.129277 : * Begin lmon rcfg step KJGA_RCFG_ENQDUBIOUS Dwn-cvts replayed, VALBLKs dubious 2020-09-25 11:05:07.386873 : * Begin lmon rcfg step KJGA_RCFG_ENQGRANT All grantable enqueues granted 2020-09-25 11:05:07.527865 : * Begin lmon rcfg step KJGA_RCFG_PCMREPLAY 2020-09-25 11:05:07.724845 : 2020-09-25 11:05:07.724928 : Post SMON to start 1st pass IR *** 2020-09-25 11:10:34.000 2020-09-25 11:10:34.000754 : * kjfclmsync: waited 327 secs for lmses to finish parallel rcfg work, terminating instance kjzduptcctx: Notifying DIAG for crash event ----- Abridged Call Stack Trace ----- ksedsts()+465<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+63<-ksuitm()+5594<-kjfclmsync()+941<-kjfcrfg()+78119 <-kjfcln()+8349<-ksbrdp()+1045<-opirip()+623<-opidrv()+603<-sou2o()+103<-opimai_real()+250<-ssthrdmain()+265<-main( )+201<-__libc_start_main()+253 ----- End of Abridged Call Stack Trace ----- *** 2020-09-25 11:10:34.017 LMON (ospid: 213044): terminating the instance due to error 481从LMON的Trace文件中可以发现11点05分之前,就发生了rcfg的操作,而这个代表了Reconfiguration动作。在11点10分钟的时候,遇到了:kjfclmsync: waited 327 secs for lmses to finish parallel rcfg work, terminating instance。


建议后期定期进行高可用切换演练测试,通过测试来评估是否存在切换瞬间压力过大导致此类宕库的问题。
Post a Comment