四个节点实例异常重启故障分析

今天发生了一个故障,居然是四个RAC节点全部同一时间发生重启了,这种情况还是很少见的。以下是问题分析过程

节点1日志

2020-05-12T04:50:13.955295+08:00
Errors in file /u01/app/oracle/diag/rdbms/dwdb/dwdb1/trace/dwdb1_ora_118868.trc  (incident=4418543):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
skgxpvfynet: mtype: 61 process 123980 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
opidrv aborting process M003 ospid (123980) as a result of ORA-603
2020-05-12T05:07:29.342931+08:00
Reconfiguration started (old inc 8, new inc 10)
List of instances (total 1) :
1
Dead instances (total 3) :
2 3 4
My inst 1
2020-05-12T05:07:31.305816+08:00
NOTE: ASMB process state dumped to trace file /u01/app/oracle/diag/rdbms/dwdb/dwdb1/trace/dwdb1_gen0_178160.trc
2020-05-12T05:07:32.229320+08:00
skgxpvfynet: mtype: 61 process 124166 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
opiodr aborting process unknown ospid (124166) as a result of ORA-603
2020-05-12T05:09:27.659099+08:00
skgxpvfynet: mtype: 61 process 124818 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
2020-05-12T05:10:08.457843+08:00
Process m003 died, see its trace file
2020-05-12T05:10:34.164320+08:00
DDE: Problem Key 'ORA 603' was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2020-05-12T05:10:39.931465+08:00
skgxpvfynet: mtype: 61 process 125146 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
2020-05-12T05:10:39.931458+08:00
opiodr aborting process unknown ospid (124818) as a result of ORA-603
2020-05-12T05:10:39.931454+08:00
GCR0[178417]: LMHB process succesfully killed
2020-05-12T05:11:40.473632+08:00
opiodr aborting process unknown ospid (125146) as a result of ORA-603
2020-05-12T05:18:01.842419+08:00
ksxp_exafusion_enabled_dcf: ipclw_enabled=0
Starting ORACLE instance (normal) (OS id: 108120)

节点2日志

2020-05-12T04:47:56.141213+08:00
Incremental checkpoint up to RBA [0x254f.3988e5.0], current log tail at RBA [0x254f.39cc3a.0]
2020-05-12T05:05:18.334836+08:00
NOTE: ASMB0 registering with ASM instance as Flex client 0x10006 (reg:3205650803) (reconnect)
2020-05-12T05:06:11.234166+08:00
Dumping diagnostic data in directory=[cdmp_20200512050610], requested by (instance=4, osid=605758 (GEN0)), summary=[abnormal instance termination].
2020-05-12T05:06:12.197007+08:00
ERROR: terminating instance because ASMB is stuck for 241 seconds
GEN0 (ospid: 110374): terminating the instance due to error 15082
2020-05-12T05:06:18.242304+08:00
Instance terminated by GEN0, pid = 110374
2020-05-12T05:15:33.418428+08:00
ksxp_exafusion_enabled_dcf: ipclw_enabled=0

节点3日志

2020-05-12T05:05:49.290986+08:00
LGWR (ospid: 364374) waits for event 'enq: CF - contention' for 74 secs.
2020-05-12T05:05:49.291033+08:00
LGWR (ospid: 364374) is hung in an acceptable location (cfio 0x11.00).
2020-05-12T05:06:05.433427+08:00
NOTE: ASMB0 registering with ASM instance as Flex client 0x10007 (reg:4119066454) (reconnect)
2020-05-12T05:06:11.239521+08:00
Dumping diagnostic data in directory=[cdmp_20200512050610], requested by (instance=4, osid=605758 (GEN0)), summary=[abnormal instance termination].
2020-05-12T05:06:16.340150+08:00
Dumping diagnostic data in directory=[cdmp_20200512050613], requested by (instance=2, osid=110374 (GEN0)), summary=[abnormal instance termination].
2020-05-12T05:06:34.967473+08:00
ERROR: terminating instance because ASMB is stuck for 242 seconds
GEN0 (ospid: 364027): terminating the instance due to error 15082
2020-05-12T05:06:38.182059+08:00
System state dump requested by (instance=3, osid=364027 (GEN0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/dwdb/dwdb3/trace/dwdb3_diag_364051_20200512050638.trc
2020-05-12T05:06:40.996565+08:00
Instance terminated by GEN0, pid = 364027
2020-05-12T05:15:36.490770+08:00
ksxp_exafusion_enabled_dcf: ipclw_enabled=0 
Starting ORACLE instance (normal) (OS id: 10317)
2020-05-12T05:15:36.499203+08:00
CLI notifier numLatches:131 maxDescs:19888
2020-05-12T05:15:36.501049+08:00
**********************************************************************
2020-05-12T05:15:36.501093+08:00
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)

节点4日志

2020-05-12T05:04:12.064862+08:00
CKPT (ospid: 605951) waits for event 'enq: CF - contention' for 63 secs.
2020-05-12T05:04:12.064910+08:00
CKPT (ospid: 605951) is hung in an acceptable location (cfio 0x11.00).
2020-05-12T05:05:18.151878+08:00
NOTE: ASMB0 registering with ASM instance as Flex client 0x10008 (reg:2344414256) (reconnect)
2020-05-12T05:05:25.774637+08:00
CKPT (ospid: 605951) waits for event 'enq: CF - contention' for 137 secs.
2020-05-12T05:05:25.774701+08:00
CKPT (ospid: 605951) is hung in an acceptable location (inwait 0x201.00).
2020-05-12T05:06:10.637714+08:00
ERROR: terminating instance because ASMB is stuck for 242 seconds
GEN0 (ospid: 605758): terminating the instance due to error 15082
2020-05-12T05:06:10.646458+08:00
DWPUB(3):opiodr aborting process unknown ospid (220276) as a result of ORA-1092
2020-05-12T05:06:10.646808+08:00
DWPUB(3):opiodr aborting process unknown ospid (330270) as a result of ORA-1092
2020-05-12T05:06:10.647256+08:00
DWPUB(3):opiodr aborting process unknown ospid (371195) as a result of ORA-1092
2020-05-12T05:06:10.651213+08:00
DWPUB(3):opiodr aborting process unknown ospid (220280) as a result of ORA-1092
2020-05-12T05:06:11.233314+08:00
System state dump requested by (instance=4, osid=605758 (GEN0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/dwdb/dwdb4/trace/dwdb4_diag_605768_20200512050611.trc
2020-05-12T05:06:16.690258+08:00
Instance terminated by GEN0, pid = 605758

从上述日志可以看出,节点4是先宕机的,然后节点2接着宕机,节点3在宕机,最后是节点1最后宕机。节点1最后认为节点234都已经Dead。

List of instances (total 1) :
 1
Dead instances (total 3) :
 2 3 4
My inst 1

继续分析节点1,在宕机之前,节点1后台一直在报错。

ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2

配合节点1 oswatch的内存图,可以发现问题期间内存free值非常低。

进一步分析空闲内存低的时候top情况,发现系统开启大量的并行进程。结合当时的addm报告,可以发现应用当时create table的时候开了很多并行。

内存不足导致ORA-错误,通过mos文档。
“ORA-27301: OS failure message: No buffer space available” occurs on OPC RAC (Doc ID 2397062.1)

需要设置参数vm.min_free_kbytes。
当请求分配内存的时候,如果有足够的内存,则可以成功分配,当没有足够内存的时候,操作就会阻塞。他需要等待系统先去释放内存,再分配内存。而对于系统级别的一些原子性的请求,它是不能被阻塞的,如果分配不到内存的话,就会出现失败。内核为了避免原子请求失败,必须设置一块保留的内存。而这个就是通过这个参数来设置的。
节点4的oswwatch内存图

这里可以看到节点4的内存也是free持续变低。


从top进程中可以看到,节点4的情况和节点1不一样,osysmond进程占用内存逐步增加,内存free越来越少。通过mos查询,可以发现osysmond进程在12c版本存在内存泄露的情况,Bug 18701017 – Memory Leak with osysmond.bin (Doc ID 18701017.8)

优化建议
1.建议所有节点设置参数vm.min_free_kbytes,必须设置一块保留的内存,该参数可以让内核避免原子请求失败。具体值需要主机工程师评估。
2.节点1建表语句开启并行太高,建议将单个语句并行个数控制在160以内。

3.所有集群节点关闭System Monitor Service(osysmond)。osysmond会将每个节点的资源使用情况发送给cluster logger service,后者将会把所有节点的信息都接收并保存到CHM的资料库。由于我们已经安装oswatch,所以这个可以关闭。

On each node, as root user:
# /bin/crsctl stop res ora.crf -init
# /bin/crsctl modify res ora.crf -attr ENABLED=0 -init
分享到: 更多

Post a Comment

Your email is never published nor shared. Required fields are marked *