记一次ORA-15196:invalid ASM block header故障解决

背景介绍

客户数据库空间不够用,拟定增加磁盘,工程师向ASM磁盘组增加磁盘成功,结果在磁盘重新进行平衡的时候出现ORA-15196错误. ASM磁盘组能够短暂Mount,瞬间又会DisMount.磁盘组无法Mount导致对数据库任何操作都无法实施.整个数据库大概在5.6T左右,所有的数据文件和归档日志文件都存放在ASM磁盘组中,该系统是关键系统,如果不紧急处理,将造成巨大的影响.

Thu Dec 06 10:19:21 BEIST 2012
SQL> alter diskgroup DISKGROUP1 add disk '/dev/rhdisk14' rebalance power 11
Thu Dec 06 10:35:16 BEIST 2012
SQL> alter diskgroup DISKGROUP1 add disk '/dev/rhdisk14' rebalance power 11
Thu Dec 06 11:04:06 BEIST 2012
SQL> alter diskgroup DISKGROUP1 add disk '/dev/rhdisk14' rebalance power 11
Thu Dec 06 11:04:06 BEIST 2012
NOTE: reconfiguration of group 1/0xadd17f0c (DISKGROUP1), full=1
Thu Dec 06 11:04:06 BEIST 2012
NOTE: initializing header on grp 1 disk DISKGROUP1_0008
NOTE: cache opening disk 8 of grp 1: DISKGROUP1_0008 path:/dev/rhdisk14
NOTE: requesting all-instance disk validation for group=1
Thu Dec 06 11:04:06 BEIST 2012
NOTE: disk validation pending for group 1/0xadd17f0c (DISKGROUP1)
SUCCESS: validated disks for 1/0xadd17f0c (DISKGROUP1)
Thu Dec 06 11:04:08 BEIST 2012
NOTE: PST update: grp = 1
NOTE: requesting all-instance membership refresh for group=1
Thu Dec 06 11:04:08 BEIST 2012
NOTE: membership refresh pending for group 1/0xadd17f0c (DISKGROUP1)
SUCCESS: refreshed membership for 1/0xadd17f0c (DISKGROUP1)
Thu Dec 06 11:04:14 BEIST 2012
NOTE: starting rebalance of group 1/0xadd17f0c (DISKGROUP1) at power 11
Thu Dec 06 11:04:33 BEIST 2012
WARNING: cache read a corrupted block gn=1 dsk=4 blk=15 from disk 4
NOTE: a corrupted block was dumped to the trace file
ERROR: cache failed to read dsk=4  blk=15 from disk(s): 4
ORA-15196: invalid ASM block header [kfc.c:8281] [check_kfbh] [2147483652] [15] [2170720495 != 4107964284]
System State dumped to trace file /oracle/products/admin/+ASM/bdump/+asm3_arb5_258680.trc
NOTE: cache initiating offline of disk 4  group 1
WARNING: process 258680 initiating offline of disk 4.3038875643 (DISKGROUP1_0004) with mask 0x3 in group 1
WARNING: Disk 4 in group 1 in mode: 0x7,state: 0x2 will be taken offline
NOTE: PST update: grp = 1, dsk = 4, mode = 0x6
Thu Dec 06 11:04:33 BEIST 2012
ERROR: too many offline disks in PST (grp 1)
Thu Dec 06 11:04:33 BEIST 2012
ERROR: PST-initiated MANDATORY DISMOUNT of group DISKGROUP1
Thu Dec 06 11:04:33 BEIST 2012
WARNING: Disk 4 in group 1 in mode: 0x7,state: 0x2 was taken offline
Thu Dec 06 11:04:33 BEIST 2012
NOTE: halting all I/Os to diskgroup DISKGROUP1

从出错的Alert日志里可以看到在我们的ASM磁盘出现了坏块.坏块所在的位置在gn=1 dsk=4 blkn=5.也就是磁盘组disk group1,磁盘disk4, 块号15出现了损坏.后续工程师做了几次的mount磁盘组的操作,都出现错误.后续具体错误罗列如下:

===================================================================================
WARNING: cache read a corrupted block gn=1 dsk=4 blk=15 from disk 4
NOTE: a corrupted block was dumped to the trace file
ERROR: cache failed to read dsk=4  blk=15 from disk(s): 4
ORA-15196: invalid ASM block header [kfc.c:8281] [check_kfbh] [2147483652] [15] [2170720495 != 4107964284]
System State dumped to trace file /oracle/products/admin/+ASM/bdump/+asm3_arb3_148116.trc
NOTE: cache initiating offline of disk 4  group 1
WARNING: process 148116 initiating offline of disk 4.2080029767 (DISKGROUP1_0004) with mask 0x3 in group 1
WARNING: Disk 4 in group 1 in mode: 0x7,state: 0x2 will be taken offline
===================================================================================
ORA-00372: file 4 cannot be modified at this time
ORA-15196: invalid ASM block header [kfc.c:13709] [check_kfbh] [4] [10] [96019589 != 96284194]
===================================================================================
ORA-00372: file 4 cannot be modified at this time
ORA-15196: invalid ASM block header [kfc.c:13709] [check_kfbh] [4] [4] [95954078 != 95692632]
===================================================================================
ORA-00372: file 4 cannot be modified at this time
ORA-15196: invalid ASM block header [kfc.c:13709] [check_kfbh] [4] [12] [95692616 != 95823630]

处理过程

根据Alter日志报错现象,是I/O错误引起的ASM坏块.引起该类问题的原因有很多.如下:

  • 硬件的I/O错误
  • 操作系统的I/O错误
  • 缓冲问题
  • 内存或paging问题
  • 软件bug问题

因为当时时间比较紧急,必须全力将系统恢复,所以一边把日志和trace上传给Oracle Support进行分析,一边做进一步的尝试.

根据最先报错信息,我们可以看到

ORA-15196: invalid ASM block header [kfc.c:8281] [check_kfbh] [2147483652] [15] [2170720495 != 4107964284]

这个错误有5个参数有以下含义:

  • kfc.c:8281—代表程序文件异常的行号.此处代表kfc.c程序文件的8281行出错;
  • check_kfdh代表验证错误的字段.这里是结构体kfdh中的check值.
  • ASM object Number存储在block 2147483652;
  • ASN Block Number存储在block 15;
  • 验证异常的值是2170720495 正确的值是4107964284;

知道这些参数的含义之后,我们只需要修改ASM底层结构体的check值为正确的值就能解决这个错误.于是使用了ASM的Kfed工具.具体操作如下:

oracle@TJ_ORACLE_1:/oracle/amdu $ kfed read /dev/rhdisk6 blkn=15 | more
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                      15 ; 0x004: T=0 NUMB=0xf
kfbh.block.obj:              2147483652 ; 0x008: TYPE=0x8 NUMB=0x4
kfbh.check:                  2170720495 ; 0x00c: 0x816290ef
kfbh.fcn.base:                    52199 ; 0x010: 0x0000cbe7
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdatb.aunum:                      5824 ; 0x000: 0x000016c0
kfdatb.shrink:                      448 ; 0x004: 0x01c0
kfdatb.ub2pad:                    19216 ; 0x006: 0x4b10
kfdatb.auinfo[0].link.next:           8 ; 0x008: 0x0008
kfdatb.auinfo[0].link.prev:           8 ; 0x00a: 0x0008
kfdatb.auinfo[0].free:                0 ; 0x00c: 0x0000
kfdatb.auinfo[0].total:             448 ; 0x00e: 0x01c0
kfdatb.auinfo[1].link.next:          16 ; 0x010: 0x0010
kfdatb.auinfo[1].link.prev:          16 ; 0x012: 0x0010
kfdatb.auinfo[1].free:                0 ; 0x014: 0x0000
kfdatb.auinfo[1].total:               0 ; 0x016: 0x0000
kfdatb.auinfo[2].link.next:          24 ; 0x018: 0x0018
kfdatb.auinfo[2].link.prev:          24 ; 0x01a: 0x0018
kfdatb.auinfo[2].free:                0 ; 0x01c: 0x0000
kfdatb.auinfo[2].total:               0 ; 0x01e: 0x0000
kfdatb.auinfo[3].link.next:          32 ; 0x020: 0x0020
kfdatb.auinfo[3].link.prev:          32 ; 0x022: 0x0020
kfdatb.auinfo[3].free:                0 ; 0x024: 0x0000
kfdatb.auinfo[3].total:               0 ; 0x026: 0x0000
kfdate[0].discriminator:              1 ; 0x028: 0x00000001

这里我们看到了ASM底层kfbh结构体check的值为2170720495,我们先读出来block15的所有值,然后修改成正常的值4107964284,重新merge回去.在做这项操作之前,以防万一先做一次备份.4

dd if=/dev/rhdisk6 of=/oracle/amdu/file4bl15.dd bs=1m count=1 skip=14
dd if=/dev/rhdisk6 of=/oracle/amdu/file4b bs=1m count=1000
kfed read /dev/rhdisk6 blkn=15 > /oracle/amdu/15.txt

修改文本/oracle/amdu/15.txt中kfbh结构体的check值为4107964284,此处需要注意的是十六进制0x816290ef需要修改成0xf4da8f7c

kfed merge /dev/rhdisk6 blkn=15  /oracle/amdu/15.txt
oracle@TJ_ORACLE_1:/oracle/amdu $ kfed read /dev/rhdisk6 blkn=15 | more
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                      15 ; 0x004: T=0 NUMB=0xf
kfbh.block.obj:              2147483652 ; 0x008: TYPE=0x8 NUMB=0x4
kfbh.check:                  4107964284 ; 0x00c: 0xf4da8f7c
kfbh.fcn.base:                    52199 ; 0x010: 0x0000cbe7

做完这项操作后,磁盘组能够正常mount了修复后ASM日志如下所示:

oracle@TJ_ORACLE_1:/oracle/products/admin/+ASM/bdump $ tail -200f alert*.log
SQL> alter diskgroup diskgroup1 mount
Thu Dec 06 15:46:14 BEIST 2012
NOTE: cache registered group DISKGROUP1 number=1 incarn=0xfb2a4d04
Thu Dec 06 15:46:14 BEIST 2012
NOTE: Hbeat: instance first (grp 1)
Thu Dec 06 15:46:19 BEIST 2012
NOTE: start heartbeating (grp 1)
NOTE: cache opening disk 0 of grp 1: DISKGROUP1_0000 path:/dev/rhdisk10
Thu Dec 06 15:46:19 BEIST 2012
NOTE: F1X0 found on disk 0 fcn 0.8776559
NOTE: cache opening disk 1 of grp 1: DISKGROUP1_0001 path:/dev/rhdisk11
NOTE: cache opening disk 2 of grp 1: DISKGROUP1_0002 path:/dev/rhdisk12
NOTE: cache opening disk 3 of grp 1: DISKGROUP1_0003 path:/dev/rhdisk13
NOTE: cache opening disk 4 of grp 1: DISKGROUP1_0004 path:/dev/rhdisk6
NOTE: cache opening disk 5 of grp 1: DISKGROUP1_0005 path:/dev/rhdisk7
NOTE: cache opening disk 6 of grp 1: DISKGROUP1_0006 path:/dev/rhdisk8
NOTE: cache opening disk 7 of grp 1: DISKGROUP1_0007 path:/dev/rhdisk9
NOTE: cache opening disk 8 of grp 1: DISKGROUP1_0008 path:/dev/rhdisk14
NOTE: cache mounting (first) group 1/0xFB2A4D04 (DISKGROUP1)
* allocate domain 1, invalid = TRUE
kjbdomatt send to node 3
Thu Dec 06 15:46:19 BEIST 2012
NOTE: attached to recovery domain 1
Thu Dec 06 15:46:19 BEIST 2012
NOTE: starting recovery of thread=1 ckpt=368.5092 group=1
NOTE: advancing ckpt for thread=1 ckpt=368.5122
NOTE: cache recovered group 1 to fcn 0.87768160
Thu Dec 06 15:46:20 BEIST 2012
NOTE: LGWR attempting to mount thread 1 for disk group 1
NOTE: LGWR mounted thread 1 for disk group 1
NOTE: opening chunk 1 at fcn 0.87768160 ABA
NOTE: seq=369 blk=5123
Thu Dec 06 15:46:20 BEIST 2012
NOTE: cache mounting group 1/0xFB2A4D04 (DISKGROUP1) succeeded
SUCCESS: diskgroup DISKGROUP1 was mounted
Thu Dec 06 15:46:22 BEIST 2012
NOTE: recovering COD for group 1/0xfb2a4d04 (DISKGROUP1)
SUCCESS: completed COD recovery for group 1/0xfb2a4d04 (DISKGROUP1)
Thu Dec 06 15:46:22 BEIST 2012
NOTE: starting rebalance of group 1/0xfb2a4d04 (DISKGROUP1) at power 11
Starting background process ARB0
Starting background process ARB1
ARB0 started with pid=18, OS id=258622
Thu Dec 06 15:46:22 BEIST 2012
Starting background process ARB2

磁盘组顺利的mount起来之后,我们需要做一次完整的check all repair操作,看看还有那些坏块.

SQL> alter diskgroup diskgroup1 check all repair
Thu Dec 06 15:47:30 BEIST 2012
NOTE: starting check of diskgroup DISKGROUP1
ERROR: file +diskgroup1.256.727465361: F256 PX9971 => D4 A439 => F2 PX6636321: fnum mismatch
ERROR: file +diskgroup1.256.727465361: F256 PX9977 => D4 A440 => F0 PX0: fnum mismatch
ERROR: file +diskgroup1.256.727465361: F256 PX9985 => D4 A441 => F0 PX0: fnum mismatch
ERROR: file +diskgroup1.256.727465361: F256 PX9992 => D4 A442 => F17200 PX16777215: fnum mismatch
ERROR: file +diskgroup1.256.727465361: F256 PX9999 => D4 A443 => F208944 PI3420470: fnum mismatch
ERROR: file +diskgroup1.264.727465541: F264 PX28 => D4 A444 => F0 PX44716: fnum mismatch
ERROR: file +diskgroup1.264.727465541: F264 PX36 => D4 A445 => F409648 PX27304: fnum mismatch
ERROR: file +diskgroup1.264.727465541: F264 PX44 => D4 A446 => F900847 PX11386607: fnum mismatch
ERROR: file +diskgroup1.264.727465541: F264 PX50 => D4 A447 => F0 PX11386607: fnum mismatch
Thu Dec 06 15:51:29 BEIST 2012
ERROR: disk DISKGROUP1_0004, AT 13: D4 A6270 => F900847 X11386607: no fd block
ERROR: disk DISKGROUP1_0004, AT 13: free AU count mismatch [63] [55]
Thu Dec 06 15:51:56 BEIST 2012
WARNING: disk DISKGROUP1_0004, changing DD used AUs from 1130474 to 1130081
echo
Thu Dec 06 15:53:07 BEIST 2012
ERROR: check of diskgroup DISKGROUP1 found 11 total errors
Thu Dec 06 15:53:07 BEIST 2012
ORA-15049: diskgroup "DISKGROUP1" contains 11 error(s)

这里可以看到我们的ASM磁盘组还有11个坏块,从这些错误上我们可以看到是256和264号文件产生的.所以需要查看文件256和文件264有没有损坏.我们使用DBV进行校验.

oracle@TJ_ORACLE_1:/oracle/products/10.2/db/dbs $ dbv file=+DISKGROUP1/ULTRAIAM/DATAFILE/SYSTEM.256.727465361 blocksize=8192 userid=xj_exp_data/fred_2010
DBVERIFY: Release 10.2.0.5.0 - Production on Thu Dec 6 16:01:24 2012
Copyright (c) 1982, 2007, Oracle.  All rights reserved.
DBVERIFY - Verification starting : FILE = +DISKGROUP1/ULTRAIAM/DATAFILE/SYSTEM.256.727465361
DBVERIFY - Verification complete

Total Pages Examined         : 1280000
Total Pages Processed (Data) : 74819
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 18923
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 1968
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 1184290
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 0 (0.0)

oracle@TJ_ORACLE_1:/oracle $ dbv file=+DISKGROUP1/ULTRAIAM/DATAFILE/UNDOTBS2.264.727465541 blocksize=8192 userid=xj_exp_data/fred_2010
DBVERIFY: Release 10.2.0.5.0 - Production on Thu Dec 6 17:59:18 2012
Copyright (c) 1982, 2007, Oracle.  All rights reserved.
DBVERIFY - Verification starting : FILE = +DISKGROUP1/ULTRAIAM/DATAFILE/UNDOTBS2.264.727465541
DBVERIFY - Verification complete

Total Pages Examined         : 4194302
Total Pages Processed (Data) : 0
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 4194299
Total Pages Processed (Seg)  : 0
Total Pages Failing   (Seg)  : 0
Total Pages Empty            : 3
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 0 (0.0)

这里值得庆幸的是我们的数据库文件没有发生损坏.至于那剩下的11个错误,是ASM的物理元数据损坏,是Allocation Table的问题.Allocation Table(AT):记录了asm file中所分配的au.用于跟踪磁盘中空间的分配情况.磁盘上的au都会在Allocation Table中有条allocate table entry记录.当我们重新添加或者删除磁盘的时候,进行重新平衡的时候,这个AT表就会发生变化.

后面在我们剔除或者增加磁盘的时候,由于AT的问题,导致重新平衡失败,一直报ORA-00600: internal error code, arguments: [kfdAuDealloc2], [441], [256], [9985], [], [], [], [].但是这个错误并不影响数据库的使用.只需要我们把参数power_limit修改成0,不对AT进行操作就能确保数据库安全使用.

因为这个系统的重要性,我们采取了保守措施,将power_limit修改为0,阻止AT发生变化引起后续的问题.然后我们找个合适的时间做一次RMAN copy diskgroup.然后切换diskgroup就可以彻底解决掉这个问题.

分享到: 更多

Post a Comment

Your email is never published nor shared. Required fields are marked *