ORA-00603,ORA-27501,ORA-27300,ORA-27301,ORA-27302故障分析

背景介绍

这是一套windows的rac系统。数据库后台日志报ORA-00474:SMON process terminated with error。接着报ORA-00603,ORA-27501,ORA-27300,ORA-27301,ORA-27302等错误。

问题分析:

1.数据库alert日志

Fri Feb 28 04:12:09 2014
Reconfiguration started (old inc 32, new inc 34)
List of nodes:
 0
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Fri Feb 28 04:12:09 2014
 LMS 2: 0 GCS shadows cancelled, 0 closed
Fri Feb 28 04:12:10 2014
 LMS 1: 0 GCS shadows cancelled, 0 closed
Fri Feb 28 04:12:11 2014
 LMS 3: 0 GCS shadows cancelled, 0 closed
Fri Feb 28 04:12:12 2014
 LMS 0: 2 GCS shadows cancelled, 2 closed
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Fri Feb 28 04:12:12 2014
Instance recovery: looking for dead threads
Fri Feb 28 04:12:12 2014
 LMS 3: 10102 GCS shadows traversed, 0 replayed
Fri Feb 28 04:12:12 2014
 LMS 1: 10083 GCS shadows traversed, 0 replayed
Fri Feb 28 04:12:13 2014
Beginning instance recovery of 1 threads
Fri Feb 28 04:12:14 2014
Errors in file d:\oracle\product\10.2.0\admin\uaprac\bdump\uaprac1_smon_6456.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27501: IPC error creating a port
ORA-27300: OS system dependent operation:IPC_CreateNamedSocket failed with status: 10049
ORA-27301: OS failure message: The requested address is not valid in its context.
ORA-27302: failure occurred at: initport_1

从数据库的alert日志,可以看到rac实例发生了重新配置(另外一个节点重启),节点1的SMON进程开始恢复事务,在恢复事务的过程中创建ipc端口失败,Oracle数据库在创建端口的这一个过程中需要调用操作系统的一些函数,在调用操作系统的函数过程中遇到错误并返回错误代码10049。我们进一步看SMON trace的日志内容

*** 2014-02-28 04:10:26.269
*** SERVICE NAME:(SYS$BACKGROUND) 2014-02-28 04:10:26.238
*** SESSION ID:(1091.1) 2014-02-28 04:10:26.238
IPCSendMsg: could not initiate send on conn 0x1f157b40 to node [uap : 4920 : 6200 : 223515], err 10054
IPCGetRequestInfo: failed a request rqh(0x13582080), type(6), status(2), bytes(0)
*** 2014-02-28 04:12:12.928
Start recovery for domain 0, valid = 0, flags = 0x0

IPC_CreateNamedSocket: bind failed for [10.0.0.3 : 11716], err(10049)

IPCInitPort: could not create listening socket err 10049

在这个trace里面,我们能够更加清楚的看到一些信息。比如bind failed for [10.0.0.3:11716],err(10049)。这个是什么意思呢?要了解这一块需要对socket编程有一定的了解。在socket编程里面有一个非常重要的函数叫bind。在进行网络通信的时候,必须把一个套接字和一个地址相关联。这个过程就是地址绑定的过程。在许多时候内核会为我们绑定一个地址,然而有时候用户可能需要自己来完成这个绑定的过程,以满足实际的需要。下面是bind函数的定义。

#include
int bind(int sockfd, struct sockaddr* addr, socklen_t addrlen)

这里不一一介绍这些参数的含义,我们可以看到我们的trace中显示:bind函数在尝试绑定10.0.0.3和11716这个端口上出了问题,报了10049。那么操作系统的10049又是什么错误了。这里我们需要输入

C:\>net helpmsg 10049
The requested address is not valid in its context.

这里我们看到请求的地址在它的上下文中无效。所以从这个地方推测,我们请求的IP地址10.0.0.3应该发生了变化。出现这种情况一般是网卡出现了问题或者是交换机路由器出现了问题。导致IP地址失效。

2.CRS日志

[    CSSD]2014-02-28 04:10:12.800 [5928] >TRACE:   clssgmPeerDeactivate: node 2 (uap-sgs-db10), death 0, state 0x80000001 connstate 0xf
[    CSSD]2014-02-28 04:10:12.800 [6080] >WARNING: clssnmeventhndlr: Receive failure with node 2 (uap-sgs-db10), state 3, con(00000000022453A0), probe(0000000000000000), rc=11
[    CSSD]2014-02-28 04:10:12.800 [6080] >TRACE:   clssnmDiscHelper: uap-sgs-db10, node(2) connection failed, con (00000000022453A0), probe(0000000000000000)
[    CSSD]2014-02-28 04:10:35.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 50 1.135363e-320artbeat fatal, eviction in 29.594 seconds
[    CSSD]2014-02-28 04:10:35.801 [5940] >TRACE:   clssnmPollingThread: node uap-sgs-db10 (2) is impending reconfig, flag 1, misstime 30406
[    CSSD]2014-02-28 04:10:35.801 [5940] >TRACE:   clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[    CSSD]2014-02-28 04:10:36.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 50 1.138327e-320artbeat fatal, eviction in 28.594 seconds
[    CSSD]2014-02-28 04:10:50.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 75 1.145244e-320artbeat fatal, eviction in 14.594 seconds
[    CSSD]2014-02-28 04:10:51.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 75 1.146232e-320artbeat fatal, eviction in 13.594 seconds
[    CSSD]2014-02-28 04:10:59.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.151173e-320artbeat fatal, eviction in 5.594 seconds
[    CSSD]2014-02-28 04:11:00.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.152161e-320artbeat fatal, eviction in 4.594 seconds
[    CSSD]2014-02-28 04:11:01.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.153149e-320artbeat fatal, eviction in 3.594 seconds
[    CSSD]2014-02-28 04:11:02.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.154137e-320artbeat fatal, eviction in 2.594 seconds
[    CSSD]2014-02-28 04:11:03.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.155125e-320artbeat fatal, eviction in 1.594 seconds
[    CSSD]2014-02-28 04:11:04.801 [5940] >WARNING: clssnmPollingThread: node uap-sgs-db10 (2) at 90 1.158090e-320artbeat fatal, eviction in 0.594 seconds
[    CSSD]2014-02-28 04:11:05.411 [5940] >TRACE:   clssnmPollingThread: Eviction started for node uap-sgs-db10 (2), flags 0x0001, state 3, wt4c 0
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmDoSyncUpdate: Initiating sync 109
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmDoSyncUpdate: diskTimeout set to (57000)ms
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmSetupAckWait: Ack message type (11) 
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmSetupAckWait: node(1) is ALIVE
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmSendSync: syncSeqNo(109)
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmWaitForAcks: Ack message type(11), ackCount(1)
[    CSSD]2014-02-28 04:11:05.411 [6080] >TRACE:   clssnmHandleSync: diskTimeout set to (57000)ms
[    CSSD]2014-02-28 04:11:05.411 [6080] >TRACE:   clssnmHandleSync: Acknowledging sync: src[1] srcName[uap-sgs-db09] seq[1] sync[109]
[    CSSD]2014-02-28 04:11:05.411 [5956] >USER:    NMEVENT_SUSPEND [00][00][00][06]
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmWaitForAcks: done, msg type(11)
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmDoSyncUpdate: Terminating node 2, uap-sgs-db10, misstime(60015) state(5)
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmSetupAckWait: Ack message type (13) 
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmSetupAckWait: node(1) is ACTIVE
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)
[    CSSD]2014-02-28 04:11:05.411 [6080] >TRACE:   clssnmSendVoteInfo: node(1) syncSeqNo(109)
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmWaitForAcks: done, msg type(13)
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmCheckDskInfo: node 2, uap-sgs-db10, state 5 with leader 2 has smaller cluster size 1; my cluster size 1 with leader 1
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmEvict: Start
[    CSSD]2014-02-28 04:11:05.411 [5920] >TRACE:   clssnmEvict: Evicting node 2, uap-sgs-db10, birth 96, death 109, impendingrcfg 1, stateflags 0x1

从CRS日志上我们可以看到,这里发生了节点驱逐的现象。因为节点1发现节点2无法通信,此时节点2也发现无法与节点1通信,最终整个实例决定剔除节点2。这里我们可以看到我们的网络心跳的misscount时间是60秒。在50%左右的时候开始出现提示。CRS软件进一步确认是操作系统网络层面出现了问题。

3.操作系统日志

2014-2-28    4:10:12    Tcpip    信息    无    4202    N/A    UAP-SGS-DB09    
系统检测到网卡 Broadcom BCM5709C NetXtreme II GigE (NDIS VBD Client) #2 与网络断开, 而且网卡的网络配置已经释放。如果网卡没有断开,这可能意味着它出现故障。请与您的供应商联系以获得更新的驱动程序。
2014-2-28    4:10:06    l2nd    警告    无    4    N/A    UAP-SGS-DB09   
Broadcom BCM5709C: The network link is down.  Check to make sure the network cable is properly connected.

最终我们发现在操作系统上4:10:06秒的时候网卡出现了down的情况。而我们的CRS是在4:10:36秒的时候出现50%的心跳告警。而最终在4:11:05秒,也就是达到misscount设置的60秒的时候剔除了节点2。

结论

整个过程我们已经分析完毕,正常情况下,网卡down掉直接剔除一个节点就行了,而在这儿最大的问题是IP地址瞬间丢失,虽然导致节点2顺利剔除,但是节点1的SMON进程在进行恢复的过程中,仍然调用了socket编程中的bind函数,该函数需要对IP地址、端口进行重新绑定,在这个绑定的过程中,如果这个IP和前面的IP是不一样的(比如禁用网卡,IP地址从10.0.0.3会变成0.0.0.0)。最终会导致bind失败,bind失败导致smon恢复事务失败,最终导致节点1也宕机。所以这个案例告诉我们,心跳网卡最好做多方面的冗余措施,才能避免这种问题的发生。

分享到: 更多