早上8点28分,Ticket 2节点出现异常导致数据库宕机。
通过节点2的日志,可以发现。
2020-10-04T06:00:00.394539+08:00 Setting Resource Manager plan SCHEDULER[0x4AC5]:DEFAULT_MAINTENANCE_PLAN via scheduler window Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter 2020-10-04T08:28:18.782135+08:00 LCK1 (ospid: 102585) waits for event 'libcache interrupt action by LCK' for 72 secs. 2020-10-04T08:28:18.782264+08:00 LCK1 (ospid: 102585) is hung in an acceptable location (libcache 0x41.02). 2020-10-04T08:30:38.830872+08:00 LCK1 (ospid: 102585) waits for event 'libcache interrupt action by LCK' for 212 secs. 2020-10-04T08:30:38.830927+08:00 LCK1 (ospid: 102585) is hung in an acceptable location (libcache 0x41.02). 2020-10-04T08:31:19.073614+08:00 LMS0 (ospid: 102556) has detected no messaging activity from instance 1 2020-10-04T08:31:19.075489+08:00 LMS3 (ospid: 102562) has detected no messaging activity from instance 1 2020-10-04T08:31:19.077201+08:00 USER (ospid: 102556) issues an IMR to resolve the situation Please check USER trace file for more detail. 2020-10-04T08:31:19.077949+08:00 Communications reconfiguration: instance_number 1 by ospid 102556 2020-10-04T08:31:19.079198+08:00 USER (ospid: 102562) issues an IMR to resolve the situation Please check USER trace file for more detail. 2020-10-04T08:31:19.083950+08:00 LMS2 (ospid: 102560) has detected no messaging activity from instance 1 USER (ospid: 102560) issues an IMR to resolve the situation Please check USER trace file for more detail. 2020-10-04T08:31:19.087934+08:00 LMON (ospid: 102552) drops the IMR request from LMS3 (ospid: 102562) because IMR is in progress and inst 1 is marked bad. LMON (ospid: 102552) drops the IMR request from LMS2 (ospid: 102560) because IMR is in progress and inst 1 is marked bad. Detected an inconsistent instance membership by instance 1 Errors in file /u01/app/oracle/diag/rdbms/ticket/ticket2/trace/ticket2_lmon_102552.trc (incident=1246889): ORA-29740: evicted by instance number 1, group incarnation 10 Incident details in: /u01/app/oracle/diag/rdbms/ticket/ticket2/incident/incdir_1246889/ticket2_lmon_102552_i1246889.trc 2020-10-04T08:31:19.413794+08:00 Received an instance abort message from instance 1 2020-10-04T08:31:19.413839+08:00 Received an instance abort message from instance 1 2020-10-04T08:31:19.413908+08:00 Received an instance abort message from instance 1 2020-10-04T08:31:19.413927+08:00 Received an instance abort message from instance 1
通过2节点日志,我们可以发现最早出问题的进程是节点2 LCK1进程。在等待” libcache interrupt action by LCK”,和Library cache lock有关。
继续看节点2的LCK1日志。

为什么节点2会等待。这是因为RAC中每个节点可能都有某张表的引用信息,在row cache(数据字典)中。当任何一个节点上修改了表的结构,都需要把所有节点上把与该表相关的对象置为invaild, 所以在传统的library cache lock外每个节点上的lck进程会对本instance的library cache中的对象加一个shared-mode iv(invalidation) instance lock,如果想修改对象定义,必须获取一个exclusive的iv lock,然后会通知本地的lck释放shared-mode lock,本地lck0在释放这个lock之前,会通知其它节点上的lck进程,其它节点上的lck进程收到这个消息后,就会将本地library cache中的所有相关对象置为invalidate,所以oracle对这些资源采取了广播机制,每个节点上的变化都通知所有节点。正是因为RAC这个机制,需要LCK进程负责释放持有row cache的用户进程协调工作及Library cache的请求
在看节点一8点26分40秒的等待事件。可以发现集中在row cache lock和gc current request上面。











而在12.1和11.2.0.4中有类似Bug 17052702 – LCK hang with ‘enq: IV – contention’ (Doc ID 2410520.1) 。但是我们当前环境是12.2,还没发现与之相匹配的Bug。
而在出故障最早时间点,出现的library cache lock等待均发生在HL_QUERY用户上,暂时ASH没有发现是什么DDL操作导致的library cache失效。而HL_QUERY会运行一些job,而这些job并没有指定实例。也就是可以在1节点上运行也可以在2节点上运行。由于业务系统始终只连一边,仍然产生了较为严重的GC,这也是因为job设置没有指定到具体实例导致的。
由于上述机制问题,建议实施以下三个步骤:
1.当前版本是12.2.0.1,DB RU为20190716,建议升级到次新版。
2.修改job到1节点,防止job随机在2个节点运行,产生大量的GC等待事件。
3.修改参数cursor_invalidation为deferred,使用游标滚动失效,可以有效的避免当做了DDL后游标大量失效,并随着时间的推移可以分散重新编译的工作量。
Trackbacks & Pingbacks 1
[…] Yuan原创文章,转载请注明出处。原文地址:一个节点挂导致另外一个节点也跟着挂的案例一则 […]
Post a Comment