Trouble Shooting 记一次CRSD无法启动案例

今天一大早被人叫起来搞远程,客户刚刚安装的Oracle 11g R2 RAC,重启了一次就再也无法启动起来了.我最讨厌大清早起来被人叫起来搞远程.但是也没有办法.谁叫我们选择干DBA这一行呢?好了,废话少说,RAC启动不起来,按照我的经验,首先我们执行crsctl check crs,看看是哪个进程无法启动,然后根据回显的结果去判断.当时执行完命令后,迅速发现是crsd进程无法启动.

image

那么先来看看Oracle 11g Release GI Processes的进程调用结构图.根据这个图的调用,我们先来看下orarootagent,这个代理进程是root用户来运行的,管理和维护着ora.crsd,ora.ctssd,ora.drivers.acfs还有diskmon这些资源,而crsd进程启动后,会去启动下面三个代理进程,分别是orarootagent,oraagent(grid),oraagent(oracle),对应着三个用户各自管理和维护的resource.可以看到root维护的都是网络方面的,包括GNS,Node VIP,SCAN VIP等等.而Grid则维护的资源是ASM,及Node监听,SCAN监听,Oracle的agent则简单许多,负责维护的资源是Database,Instance,还有Servcie.
分析了这么多,我还是先看了下crsd.log的日志,此时我发现里面没有记录当前时间点的任何信息.也就是crsd根本就没有启动起来,那么我们就应该顺藤摸瓜,去查找orarootagent的日志.在orarootagent里面我们看到了一些信息,如下所示:

2012-06-18 20:50:30.024: [ USRTHRD][2829] {0:0:2} checkCrsStat 2 CLSCRS_STAT ret: 184
2012-06-18 20:50:30.024: [ora.evmd][2829] {0:0:2} [start] clsn_agent::start }
2012-06-18 20:50:30.024: [    AGFW][2829] {0:0:2} Command: start for resource: ora.evmd 1 1 completed with status: SUCCESS
2012-06-18 20:50:30.026: [    AGFW][2314] {0:0:2} Agent sending reply for: RESOURCE_START[ora.evmd 1 1] ID 4098:331
2012-06-18 20:50:30.026: [ora.evmd][2057] {0:0:2} [check] clsdmc_respget return: status=0, ecode=0
2012-06-18 20:50:30.026: [ USRTHRD][2057] {0:0:2} Thread:[DaemonCheck:evmd]start {
2012-06-18 20:50:30.027: [ USRTHRD][2057] {0:0:2} Thread:[DaemonCheck:evmd]start }
2012-06-18 20:50:30.027: [ora.evmd][2057] {0:0:2} [check] DaemonAgent::check returned 0
2012-06-18 20:50:30.027: [ora.evmd][2057] {0:0:2} [check] Deep check returned 0
2012-06-18 20:50:30.028: [    AGFW][2314] {0:0:2} ora.evmd 1 1 state changed from: STARTING to: PARTIAL
2012-06-18 20:50:30.028: [    AGFW][2314] {0:0:2} Started implicit monitor for:ora.evmd 1 1
2012-06-18 20:50:30.028: [    AGFW][2314] {0:0:2} Agent sending last reply for: RESOURCE_START[ora.evmd 1 1] ID 4098:331
2012-06-18 20:50:30.029: [CLSFRAME][1] TM [MultiThread] is changing desired thread # to 4. Current # is 3
2012-06-18 20:50:35.452: [    AGFW][2314] {0:0:2} Agent received the message: RESOURCE_CLEAN[ora.asm 1 1] ID 4100:337
2012-06-18 20:50:35.452: [    AGFW][2314] {0:0:2} Preparing CLEAN command for: ora.asm 1 1
2012-06-18 20:50:35.452: [    AGFW][2314] {0:0:2} ora.asm 1 1 state changed from: UNKNOWN to: CLEANING
2012-06-18 20:50:35.454: [ora.asm][2829] {0:0:2} [clean] clsn_agent::clean {
2012-06-18 20:50:35.454: [ora.asm][2829] {0:0:2} [clean] __IS_HASD_AGENT = TRUE
2012-06-18 20:50:35.455: [ora.asm][2829] {0:0:2} [clean] AsmAgent::AsmAgent Constructor resName ora.asm compId +ASM
2012-06-18 20:50:35.455: [ora.asm][2829] {0:0:2} [clean] AsmAgent::refresh
2012-06-18 20:50:35.455: [ora.asm][2829] {0:0:2} [clean] AsmAgent::refresh ORACLE_HOME = /oracle/app/grid_home
2012-06-18 20:50:35.455: [ora.asm][2829] {0:0:2} [clean] AsmAgent:getOracleSid 3 oracle_sid = +ASM1
2012-06-18 20:50:35.455: [ora.asm][2829] {0:0:2} [clean] AsmAgent:getOracleSid oracle_sid = +ASM1
2012-06-18 20:50:35.455: [ora.asm][2829] {0:0:2} [clean] AsmAgent::refresh ORACLE_SID = +ASM1
2012-06-18 20:50:35.455: [ora.asm][2829] {0:0:2} [clean] getOracleHomeAttrib: oracle_home = /oracle/app/grid_home
2012-06-18 20:50:35.455: [ora.asm][2829] {0:0:2} [clean] getOracleHomeAttrib: oracle_home = /oracle/app/grid_home
2012-06-18 20:50:35.455: [ora.asm][2829] {0:0:2} [clean] ConnectionPool::ConnectionPool 2 m_oracleHome:/oracle/app/grid_home, m_oracleSid:+ASM1, m_usrOraEnv:
2012-06-18 20:50:35.456: [ora.asm][2829] {0:0:2} [clean] clean {
2012-06-18 20:50:35.456: [ora.asm][2829] {0:0:2} [clean] InstAgent::stop_option stop mode immediate option 1
2012-06-18 20:50:35.456: [ora.asm][2829] {0:0:2} [clean] InstAgent::stop {
2012-06-18 20:50:35.456: [ora.asm][2829] {0:0:2} [clean] InstAgent::stop original reason system do shutdown abort
2012-06-18 20:50:35.456: [ora.asm][2829] {0:0:2} [clean] Gimh::check OH /oracle/app/grid_home SID +ASM1
2012-06-18 20:50:35.456: [ora.asm][2829] {0:0:2} [clean] Gimh::check condition changes to (GIMH_NEXT_NUM) 0 exists
2012-06-18 20:50:35.456: [ora.asm][2829] {0:0:2} [clean] (:CLSN00006:)AsmAgent::check failed gimh state 0
2012-06-18 20:50:35.628: [ COMMCRS][6428]clsc_connect: (115de4db0) no listener at (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_UI_SOCKET))
2012-06-18 20:50:35.628: [ora.asm][2829] {0:0:2} [clean] checkCrsStat 2 CLSCRS_STAT ret: 184
2012-06-18 20:50:35.628: [ora.asm][2829] {0:0:2} [clean] clsnUtils::error Exception type=2 string=
2012-06-18 20:50:35.628: [ora.asm][2829] {0:0:2} [clean] AsmAgent::checkCbk: Exception UserErrorException
2012-06-18 20:50:35.628: [ora.asm][2829] {0:0:2} [clean]
2012-06-18 20:50:35.628: [ora.asm][2829] {0:0:2} [clean] InstAgent::check db/asm 2clsagfw_res_status 5 poolState 0
2012-06-18 20:50:35.628: [ora.asm][2829] {0:0:2} [clean] (:CLSN00006:)InstAgent::check: return unplanned offline
2012-06-18 20:50:35.628: [ USRTHRD][2829] {0:0:2} Gimh::destructor gimh_dest_query_ctx rc=0
2012-06-18 20:50:35.628: [ USRTHRD][2829] {0:0:2} Gimh::destructor gimh_dest_inst_ctx rc=0
2012-06-18 20:50:35.628: [ora.asm][2829] {0:0:2} [clean] ConnectionPool::stopConnection
2012-06-18 20:50:35.628: [ora.asm][2829] {0:0:2} [clean] ConnectionPool::removeConnection connection count 0
2012-06-18 20:50:35.629: [ora.asm][2829] {0:0:2} [clean] ConnectionPool::removeConnection freed 0
2012-06-18 20:50:35.629: [ora.asm][2829] {0:0:2} [clean] ConnectionPool::stopConnection sid +ASM1 status  1
2012-06-18 20:50:35.629: [ora.asm][2829] {0:0:2} [clean] InstAgent::check checkCounter 0 prev clsagfw_res_status 3 current clsagfw_res_status 1
2012-06-18 20:50:35.629: [ora.asm][2829] {0:0:2} [clean] InstAgent::stop  shutdown mode: 4
2012-06-18 20:50:35.629: [ora.asm][2829] {0:0:2} [clean] ConnectionPool::getConnection 160 ERROR sid +ASM1 pool status 1
2012-06-18 20:50:35.629: [ora.asm][2829] {0:0:2} [clean] ConnectionPool::getConnection 260 pConnxn 00000000
2012-06-18 20:50:35.629: [ora.asm][2829] {0:0:2} [clean] makeConnectStr = (DESCRIPTION=(ADDRESS=(PROTOCOL=beq)(PROGRAM=/oracle/app/grid_home/bin/oracle)(ARGV0=oracle+ASM1)(ENVS='ORACLE_HOME=/oracle/app/grid_home,ORACLE_SID=+ASM1')(ARGS='(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))'))(CONNECT_DATA=(SID=+ASM1)))
2012-06-18 20:50:35.633: [ora.asm][2829] {0:0:2} [clean] InstConnection::connectInt: server not attached
2012-06-18 20:50:35.704: [ora.asm][2829] {0:0:2} [clean] ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
IBM AIX RISC System/6000 Error: 2: No such file or directory
Process ID: 0
Session ID: 0 Serial number: 0
2012-06-18 20:50:35.704: [ora.asm][2829] {0:0:2} [clean] InstConnection::connectInt (2) Exception OCIException
2012-06-18 20:50:35.704: [ora.asm][2829] {0:0:2} [clean] InstConnection:connect:excp OCIException OCI error 1034
2012-06-18 20:50:35.704: [ora.asm][2829] {0:0:2} [clean] InstAgent::stop: connect1 errcode 1034
2012-06-18 20:50:35.705: [ora.asm][2829] {0:0:2} [clean] InstAgent::stop: connect2 oracleHome /oracle/app/grid_home oracleSid +ASM1
2012-06-18 20:50:35.705: [ora.asm][2829] {0:0:2} [clean] InstConnection::connectInt: server not attached
2012-06-18 20:50:35.795: [ora.asm][2829] {0:0:2} [clean] ORA-01031: insufficient privileges
2012-06-18 20:50:35.795: [ora.asm][2829] {0:0:2} [clean] InstConnection::connectInt (2) Exception OCIException
2012-06-18 20:50:35.795: [ora.asm][2829] {0:0:2} [clean] InstConnection:connect:excp OCIException OCI error 1031
2012-06-18 20:50:35.796: [ora.asm][2829] {0:0:2} [clean] InstAgent::stop: connect2 errcode 1031
2012-06-18 20:50:35.796: [ora.asm][2829] {0:0:2} [clean] clsnUtils::error Exception type=2 string=
ORA-01031: insufficient privileges

在我们orarootagent日志中,我们发现首先是去尝试启动ora.asm资源,而在启动这个资源的时候,报了两个错误,一个是我们熟知的ORA-27001,一个是ORA-01031.错误,这两个错误说明了一个问题,ASM实例没启动,权限不够.那么我们尝试手动去启动ora.crsd进程看下输出结果.

# ./crsctl start res ora.crsd -init
CRS-2672: Attempting to start 'ora.asm' on 'zc-i780-db1'
ORA-01031: insufficient privileges
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-01031: insufficient privilegesCRS-2674: Start of 'ora.asm' on 'zc-i780-db1' failed
CRS-2679: Attempting to clean 'ora.asm' on 'zc-i780-db1'
ORA-01031: insufficient privileges
CRS-2681: Clean of 'ora.asm' on 'zc-i780-db1' succeeded
CRS-4000: Command Start failed, or completed with errors.

这里同样是尝试启动ora.asm资源的时候报ORA-01031错误,由此可以判断是asm实例因为权限问题无法启动.于是询问客户是否做了什么安全设置,客户告诉我,在他们的系统中,设置了sqlnet.ora参数,操作系统认证设置了nts.把这行参数删除之后,顺利启动crsd,集群也恢复了正常.
其实这只是一个很简单的案例,还有许多曾经处理过的案例,我的方法是:
1.crsctl check crs检查,首先判断哪个服务无法启动.
2.查看Oracle 11g Release 2 GI Prosesses进程之间的调用,找到依赖关系.根据第一步查到的进程分析与其相关的进程是否启动.
3.确定好那个进程无法启动之后,我们可以查看文档:Troubleshoot Grid Infrastructure Startup Issues [ID 1050908.1].根据下面的Case在继续查找日志进行判断,最终确定引起问题的原因.
Troubleshoot Grid Infrastructure Startup Issues
Start up sequence:
Cluster status
Case 1: OHASD does not start
Case 2: OHASD Agents does not start
Case 3: OCSSD.BIN does not start
Case 4: CRSD.BIN does not start
Case 5: GPNPD.BIN does not start
Case 6: Various other daemons do not start
Case 7: CRSD Agents do not start
Network and Naming Resolution Verification
Log File Location, Ownership and Permission
Network Socket File Location, Ownership and Permission
Diagnostic file collection

分享到: 更多

Post a Comment

Your email is never published nor shared. Required fields are marked *