RAC集群network资源和vip资源无法启动故障诊断

IBM主机工程师需要对主机进行维护操作,当维护完成之后,数据库侧进程集群重启发现数据库的监听启动之后有问题,集群资源network及vip均无法启动。如下图所示:

这里报错显示连接失败。
进一步查看集群资源状态显示如下图:

可以看到这里的资源状态ora.ons,ora.net1.network,ora.pdzwdb2.vip,还有监听资源ora.listener.lsnr等资源都是异常的offline的状态。
因为这些是集群资源,我们直接查看集群相关的日志。在集群的日志agent/crsd/orarootagent_root下面找到下列信息.

2019-02-21 04:25:27.266: [ora.net1.network][2315]{1:20222:2} [check] Checking if en6 Interface is fine
2019-02-21 04:25:27.272: [    AGFW][2057]{1:20222:2} Agent sending reply for: RESOURCE_START[ora.net1.network pdzwdb1 1] ID 4098:217
2019-02-21 04:25:27.298: [ora.net1.network][2315]{1:20222:2} [check] ifname=en6
2019-02-21 04:25:27.298: [ora.net1.network][2315]{1:20222:2} [check] subnetmask=255.255.255.0
2019-02-21 04:25:27.298: [ora.net1.network][2315]{1:20222:2} [check] subnetnumber=10.25.4.0
2019-02-21 04:25:27.328: [ora.net1.network][2315]{1:20222:2} [check] ifname=en6
2019-02-21 04:25:27.328: [ora.net1.network][2315]{1:20222:2} [check] subnetmask=255.255.255.0
2019-02-21 04:25:27.328: [ora.net1.network][2315]{1:20222:2} [check] subnetnumber=10.25.4.0
2019-02-21 04:25:27.329: [ora.net1.network][2315]{1:20222:2} [check] CRS-5008: Invalid attribute value: en6 for the network interface

2019-02-21 04:25:27.329: [ora.net1.network][2315]{1:20222:2} [check] NetworkAgent::init exit }
2019-02-21 04:25:27.329: [ora.net1.network][2315]{1:20222:2} [check] ioctl Error
2019-02-21 04:25:27.330: [ora.net1.network][2315]{1:20222:2} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 6, other:
2019-02-21 04:25:27.330: [    AGFW][2057]{1:20222:2} ora.net1.network pdzwdb1 1 state changed from: STARTING to: OFFLINE
2019-02-21 04:25:27.330: [    AGFW][2057]{1:20222:2} Switching online monitor to offline one

可以看到上述日志报错,显示在集群启动的时候对en6网卡资源的属性进行了检查,然后检查完成之后报“Invalid attribute value: en6 for the network interface”。于是最终导致这个资源offline。无法正常启动。
针对上述错误,我们首先检查了主机,发现主机的报错是在重启之前的,在重启完成之后就没有报错。但是上述显示en6网卡属性有问题。并且后面报OS error 6,于是我们联系IBM主机厂商协助我们进行核查。
接下来对两台主机又分别进行了再一次重启,重启完成之后错误依旧,而这一次节点一也开始报相关错误。同时,我么针对相关报错,开始查询mos,在mos中查了多篇文档之后,我们找到一篇文档和我们报错有点相似。
Unable To Start ora.net2.network with CRS-2672 CRS-2674 and CRS-5008: Invalid attribute value: en4 for the network interface (文档 ID 1548049.1)
在文档中提到在crs资源的子网掩码和主机系统的是不一致的。我么马上对其进行了详细的检查。通过crs命令查看ora.net1.network资源的详细信息。

grid@pdzwdb1:/home/grid>crsctl status resource ora.net1.network -p
NAME=ora.net1.network
TYPE=ora.network.type
ACL=owner:root:rwx,pgrp:system:r-x,other::r--,group:dba:r-x,user:grid:r-x
ACTION_FAILURE_TEMPLATE=
ACTION_SCRIPT=
AGENT_FILENAME=%CRS_HOME%/bin/orarootagent%CRS_EXE_SUFFIX%
ALIAS_NAME=
AUTO_START=restore
CHECK_INTERVAL=1
DEFAULT_TEMPLATE=
DEGREE=1
DESCRIPTION=Oracle Network resource
ENABLED=1
LOAD=1
LOGGING_LEVEL=1
NLS_LANG=
NOT_RESTARTING_TEMPLATE=
OFFLINE_CHECK_INTERVAL=60
PROFILE_CHANGE_TEMPLATE=
RESTART_ATTEMPTS=5
SCRIPT_TIMEOUT=60
START_DEPENDENCIES=
START_TIMEOUT=0
STATE_CHANGE_TEMPLATE=
STOP_DEPENDENCIES=
STOP_TIMEOUT=0
TYPE_VERSION=2.2
UPTIME_THRESHOLD=1d
USR_ORA_AUTO=static
USR_ORA_ENV=
USR_ORA_IF=
USR_ORA_NETMASK=255.255.255.128
USR_ORA_SUBNET=10.25.4.0
VERSION=11.2.0.4.0

在这里我们发现节点1的这个资源的netmask显示是255.255.255.128。而集群还有主机显示的子网掩码都是255.255.255.0
而ocr里面的信息:

grid@pdzwdb1:/home/grid>oifcfg iflist -p -n
en0  9.181.63.0  PUBLIC  255.255.255.0
en6  10.25.4.0  PUBLIC  255.255.255.0
en7  100.1.1.0  PUBLIC  255.255.255.0

系统主机上的信息

grid@pdzwdb1:/home/grid>ifconfig -a
en0: flags=1e084863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
        inet 9.181.63.12 netmask 0xffffff00 broadcast 9.181.63.255
         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
en6: flags=1e084863,8c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
        inet 10.25.4.25 netmask 0xffffff00 broadcast 10.25.4.255
        inet 10.25.4.30 netmask 0xffffff00 broadcast 10.25.4.255
         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
en7: flags=1e084963,8c0<UP,BROADCAST,NOTRAILERS,RUNNING,PROMISC,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
        inet 100.1.1.36 netmask 0xffffff00 broadcast 100.1.1.255
        inet 169.254.142.107 netmask 0xffff0000 broadcast 169.254.255.255
         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN>
        inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
        inet6 ::1%1/0
         tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

也就是说CRS网卡配置信息,CRS中的netwoek的资源网卡配置信息、主机的网卡配置信息三者必须是一致的。切换到root账号。执行srvctl命令修改network中的资源配置信息。然后集群和数据库最终恢复。

# srvctl modify network -k 1 -S 10.25.4.0/255.255.255.0 en6

上述问题产生的原因就是主机、集群、neywork资源三者的公网ip段和子网掩码不一致,所以在修改主机和集群配置的时候,一定要注意三个地方要保持一致。
参考文档:Unable To Start ora.net2.network with CRS-2672 CRS-2674 and CRS-5008: Invalid attribute value: en4 for the network interface (文档 ID 1548049.1)

分享到: 更多

Post a Comment

Your email is never published nor shared. Required fields are marked *