版权声明:本文为Buddy Yuan原创文章,未经允许不得转载。原文地址:记一次ora.asm服务无法启动和GPnP Get Item的问题
昨天晚上去打DB PSU,这是一套11.2.0.3 RAC。很快我就把数据库和集群全部停止,然后把补丁安装完了。我心想还是蛮顺利的啊,结果在重启集群的时候,突然发现集群没办法起来。通过crsctl check crs可以发现无法启动crsd的服务,而我去查看crsd.log的时候发现根本没有任何记录。一直是空的。然后我通过ocrcheck命令检查会一直hang住,执行crsctl query css votedisk确是好的,很快都能查到。然后我又执行了kfed去读取asm磁盘,发现也是能读的。通过进一步检查,发现是ora.asm服务启动不成功。进一步登陆到asm实例上查看asm的alert日志,发现能读取spfile文件,并且实例是能正常启动的。我通过查看asm的日志发现,磁盘组并没有走到最后一步mount成功的地方。如下所示:
SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ NOTE: Diskgroups listed in ASM_DISKGROUPS are DG_FRA DG_DATA NOTE: Diskgroup used for Voting files is: DG_OCR Diskgroup with spfile:DG_OCR Diskgroup used for OCR is:DG_OCR NOTE: cache registered group DG_DATA number=1 incarn=0x50ec3256 NOTE: cache began mount (not first) of group DG_DATA number=1 incarn=0x50ec3256 NOTE: cache registered group DG_FRA number=2 incarn=0x50fc3257 NOTE: cache began mount (not first) of group DG_FRA number=2 incarn=0x50fc3257 NOTE: cache registered group DG_OCR number=3 incarn=0x50fc3258 NOTE: cache began mount (not first) of group DG_OCR number=3 incarn=0x50fc3258 NOTE: Assigning number (1,3) to disk (/dev/diskgroup/dg_data4) NOTE: Assigning number (1,2) to disk (/dev/diskgroup/dg_data3) NOTE: Assigning number (1,1) to disk (/dev/diskgroup/dg_data2) NOTE: Assigning number (1,0) to disk (/dev/diskgroup/dg_data1) NOTE: Assigning number (2,0) to disk (/dev/diskgroup/dg_fra) NOTE: Assigning number (3,0) to disk (/dev/diskgroup/dg_ocr) GMON querying group 1 at 4 for pid 23, osid 128640 NOTE: cache opening disk 0 of grp 1: DG_DATA_0000 path:/dev/diskgroup/dg_data1 NOTE: F1X0 found on disk 0 au 2 fcn 0.0 NOTE: cache opening disk 1 of grp 1: DG_DATA_0001 path:/dev/diskgroup/dg_data2 NOTE: cache opening disk 2 of grp 1: DG_DATA_0002 path:/dev/diskgroup/dg_data3 NOTE: cache opening disk 3 of grp 1: DG_DATA_0003 path:/dev/diskgroup/dg_data4 NOTE: cache mounting (not first) external redundancy group 1/0x50EC3256 (DG_DATA)
可以看到日志前面都是正常能够读盘的,中间做了一些其他的动作,在最后面出现下列错误:
NOTE: detached from domain 3
NOTE: cache dismounted group 3/0x50FC3258 (DG_OCR)
NOTE: cache ending mount (fail) of group DG_OCR number=3 incarn=0x50fc3258
NOTE: cache deleting context for group DG_OCR 3/0x50fc3258
GMON dismounting group 3 at 9 for pid 23, osid 128640
NOTE: Disk in mode 0x8 marked for de-assignment
ERROR: diskgroup DG_OCR was not mounted
接下来我通过asm实例进入到数据库查了一下,在正常实例的磁盘组是mounted状态,而异常的数据库实例里面磁盘组的状态是mounting状态。也就是正在mount的状态。这里的一个疑点就是怎么mount不上。
为了把这个疑点弄清楚,我做了一个全局的hang分析。这里必须要做全局的才能知道磁盘组为什么无法mount。
*** 2018-10-26 04:36:34.280 =============================================================================== HANG ANALYSIS: instances (db_name.oracle_sid): +asm.+asm2, +asm.+asm1 oradebug_node_dump_level: 3 analysis initiated by oradebug os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 04:36:34 ] NOTE: scheduling delay has not been sampled for 0.184529 secs 0.000000 secs from [ 04:36:30 - 04:36:35 ], 5 sec avg 0.000000 secs from [ 04:35:34 - 04:36:35 ], 1 min avg 0.000000 secs from [ 04:31:34 - 04:36:35 ], 5 min avg =============================================================================== Chains most likely to have caused the hang: [a] Chain 1 Signature: 'GPnP Get Item'<='DFS lock handle' Chain 1 Signature Hash: 0xe08712d6 [b] Chain 2 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention' Chain 2 Signature Hash: 0xec6b634f [c] Chain 3 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention' Chain 3 Signature Hash: 0xec6b634f =============================================================================== Non-intersecting chains: ------------------------------------------------------------------------------- Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 2 (+asm.+asm2) os id: 88049 process id: 23, oracle@KSPX-DB2 (TNS V1-V3) session id: 1427 session serial #: 3 } is waiting for 'DFS lock handle' with wait info: { p1: 'type|mode'=0x43490005 p2: 'id1'=0x3c p3: 'id2'=0x2 time in wait: 49 min 17 sec timeout after: never wait id: 107 blocking: 0 sessions current sql: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+160<-ksliwat()+1865<-kslwaitctx()+163<-kjusuc()+3619<-ksigeti()+3781<-ksbcic_int()+21066<-ksbcic()+12<-kfgbSendPeelWait()+53<-kfgFinalizeMount()+4736<-kfgscFinalize()+1405<-kfgForEachKfgsc()+193<-kfgsoFinalize()+135<-kfgFinalize()+396<-kfxdrvMount()+4781<-kfxdrvEntry()+2218<-opiexe()+20368<-opiosq0()+2948<-kpooprx()+274<-kpoal8()+829<-opiodr()+916<-ttcpip()+2242<-opitsk()+1673<-opiino()+966<-opiodr()+916<-opidr wait history: * time between current wait and wait #1: 0.000149 sec 1. event: 'rdbms ipc reply' time waited: 0.000338 sec wait id: 106 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff * time between wait #1 and #2: 0.000054 sec 2. event: 'kfk: async disk IO' time waited: 0.000182 sec wait id: 105 p1: 'count'=0x1 p2: 'intr'=0x0 p3: 'timeout'=0xffffffff * time between wait #2 and #3: 0.000157 sec 3. event: 'rdbms ipc reply' time waited: 0.000372 sec wait id: 104 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff } and is blocked by => Oracle session identified by: { instance: 1 (+asm.+asm1) os id: 15643 process id: 18, oracle@KSPX-DB1 (RBAL) session id: 1117 session serial #: 1 } which is waiting for 'GPnP Get Item' with wait info: { time in wait: 221831 min 33 sec timeout after: never wait id: 124844244 blocking: 44 sessions current sql: short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-__poll()+47<-sgipcwWaitHelper()+5271<-sgipcwWait()+475<-gipcWaitOsd()+281<-gipcInternalWait()+14965<-gipcWaitF()+3234<-gipcInternalRecvSync()+8749<-gipcRecvSyncF()+3343<-clsgpnpm_gipcGets()+297<-clsgpnpm_receiveMsg()+408<-clsgpnpm_exchange()+1059<-clsgpnp_profileCallUrlInt()+3568<-clsgpnp_getProfileEx()+325<-clsgpnp_dbmsGetItem()+263<-kggpnpAttrGet()+1272<-kfdParseProfileString()+885<-kfdDiscoverShallow()+2023<-kfgbDriver()+1745<-ksbabs()+ wait history: * time between current wait and wait #1: 0.000007 sec 1. event: 'GPnP Initialization' time waited: 0.018700 sec wait id: 124844243 * time between wait #1 and #2: 0.000207 sec 2. event: 'CSS initialization' time waited: 0.000027 sec wait id: 124844242 * time between wait #2 and #3: 0.000062 sec 3. event: 'rdbms ipc message' time waited: 0.002787 sec wait id: 124844241 p1: 'timeout'=0x12c }
通过全局的hang分析,我们可以发现,在节点2 asm启动之后,运行了”ALTER DISKGROUP ALL MOUNT”的命令,而现在这个命令没有执行成功,被阻塞了40多分钟。它是被节点1的rbal进程阻塞的,而该进程正在等待一个叫 “GPnP Get Item”的等待事件。那么既然我们知道问题的原因了,我们就可以开始在mos上查找相关资料了。通过查看文档Diskgroup Mount Hangs with RBAL Waiting on ‘GPnP Get Item’ and ‘enq: DD – contention’ (文档 ID 1375505.1),发现这个问题是这是由未发布的bug:12398300引起的。这个问题的最根本原因就是gpnp进程stuck了,然后就会导致rbal进程产生等待。解决办法其实很简单,就是干掉节点1的gpnp进程就会释放。而干掉gpnp进程集群不会重启,它只会单独的把该进程重启。
通过kill gpnp之后,故障恢复,节点2的集群顺利启动。
这里有一个技巧,以后针对ora.asm服务无法启动的问题,其实都可以使用全局的hang分析。
Trackbacks & Pingbacks 1
[…] Yuan原创文章,转载请注明出处。原文地址:CentOS 7安装Percona XtraDB Cluster 5.7 […]
Post a Comment