Buddy Yuan的个人技术博客

热爱钻研Oracle技术.提供Oracle技术支持服务和咨询.

12C/11g asmlib导致ocr丢失恢复的全过程

by buddy on 2015 年 1 月 22 日

一、问题背景

客户一套12C的数据库,CRS无法启动,经过简单的查看发现OCR的磁盘出现了问题。客户使用ASMLIB包实现将物理磁盘转换成Oracle能识别的ASM的磁盘。为了达到冗余的目的,客户使用了3块2G的物理盘用于创建OCR。默认情况下通过ASMLIB包绑定后,我们是应该可以看到可用的3个ASM磁盘的,但是不知道是什么原因,导致只能看到一个磁盘。

[10:21:17]grid@oracle12c01:/dev/oracleasm/disks> ls -l
[10:21:17]total 0
[10:21:17]brw-rw---- 1 grid asmadmin 8,  81 Jan 20 19:05 ASMDATA1
[10:21:17]brw-rw---- 1 grid asmadmin 8,  82 Jan 20 19:05 ASMDATA2
[10:21:17]brw-rw---- 1 grid asmadmin 8,  83 Jan 20 19:05 ASMDATA3
[10:21:17]brw-rw---- 1 grid asmadmin 8,  84 Jan 20 19:05 ASMDATA4
[10:21:17]brw-rw---- 1 grid asmadmin 8,  85 Jan 20 19:05 ASMDATA5
[10:21:17]brw-rw---- 1 grid asmadmin 8, 144 Jan 20 19:05 ASMOCR2G3

可以看到本来应该有ASMOCR2G1, ASMOCR2G2, ASMOCR2G3三个磁盘的,而现在只有ASMOCR2G3了。那么是不是磁盘头发生了损坏,导致我们不能识别呢?

[10:39:11]oracle12c01:~ # /oracle/app/12.1.0/grid/bin/kfed read /dev/mapper/emc2G01
[10:39:12]kfbh.endian:                          1 ; 0x000: 0x01
[10:39:12]kfbh.hard:                          130 ; 0x001: 0x82
[10:39:12]kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
[10:39:12]kfbh.datfmt:                          1 ; 0x003: 0x01
[10:39:12]kfbh.block.blk:                       0 ; 0x004: blk=0
[10:39:12]kfbh.block.obj:              2147483648 ; 0x008: disk=0

[10:39:37]oracle12c01:~ # /oracle/app/12.1.0/grid/bin/kfed read /dev/mapper/emc2G02
[10:39:37]kfbh.endian:                          1 ; 0x000: 0x01
[10:39:37]kfbh.hard:                          130 ; 0x001: 0x82
[10:39:37]kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
[10:39:37]kfbh.datfmt:                          1 ; 0x003: 0x01
[10:39:37]kfbh.block.blk:                       0 ; 0x004: blk=0
[10:39:37]kfbh.block.obj:              2147483649 ; 0x008: disk=1

[10:35:46]oracle12c01:~ # /oracle/app/12.1.0/grid/bin/kfed read /dev/mapper/emc2G03
[10:35:49]kfbh.endian:                          0 ; 0x000: 0x00
[10:35:49]kfbh.hard:                            0 ; 0x001: 0x00
[10:35:49]kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
[10:35:49]kfbh.datfmt:                          0 ; 0x003: 0x00
[10:35:49]kfbh.block.blk:                       0 ; 0x004: blk=0
[10:35:49]kfbh.block.obj:                       0 ; 0x008: file=0

通过kfed工具我们发现其实ASMOCR2G1和ASMOCR2G2的磁盘头都是正常的。而ASMLIB能识别的ASMOCR2G3确显示磁盘头是无效的。针对这种问题,我觉得一大部分原因可以归咎到ASMLIB上去,网上也有很多关于它和udev优缺点的讨论。那么我们探讨的是出现这种问题如何恢复。

二、故障解决

那么首先我们考虑的是以前的备份,如果以前OCR的备份是好的,那么我们直接恢复就可以了,但是问题在于你不清楚这些备份是不是好的。如果备份是无效的,那么我们还可以重建OCR。从风险和难易程度上来看,我们先用备份来恢复吧。

[11:04:56]oracle12c01:/oracle/app/12.1.0/grid/bin # ./ocrconfig –showbackup
[11:05:02]PROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy
[11:05:02]oracle12c01     2015/01/20 17:22:43     /oracle/app/12.1.0/grid/cdata/oracle1-cluster/backup00.ocr
[11:05:02]oracle12c01     2015/01/20 13:22:43     /oracle/app/12.1.0/grid/cdata/oracle1-cluster/backup01.ocr
[11:05:02]oracle12c01     2015/01/20 09:22:42     /oracle/app/12.1.0/grid/cdata/oracle1-cluster/backup02.ocr

[11:09:03]oracle12c01:/oracle/app/12.1.0/grid/bin # ./ocrconfig -restore  /oracle/app/12.1.0/grid/cdata/oracle1-cluster/backup00.ocr
[11:09:03]PROT-35: The configured OCR locations are not accessible

直接就报了这种错误,restore的时候OCR路径不可访问。那么这个情况是因为我们的asmlib下面看不到盘,所以我们要重新解决asmlib识别盘的问题。

[11:32:32]oracle12c01:/oracle/app/12.1.0/grid/bin # dd if=/dev/mapper/emc2G01 of=/tmp/ocr01.ocr bs=8192 count=10000000
[11:36:32]oracle12c01:/oracle/app/12.1.0/grid/bin # dd if=/dev/zero of=/dev/mapper/emc2G01 bs=8192 count=100000000
[11:37:03]oracle12c01:/oracle/app/12.1.0/grid/bin #  /etc/init.d/oracleasm createdisk ASMOCR2G1 /dev/mapper/emc2G01
[11:37:03]Marking disk "ASMOCR2G1" as an ASM disk:                                                                                                                               done

按照上面的操作把emc2G01,emc2G02,emc2G03都备份了一遍,然后格式掉后,重新用asmlib成功进行了绑定。绑定完成之后继续restore,发现还是报同样的错误。

[11:46:42]oracle12c01:/oracle/app/12.1.0/grid/bin # ./ocrconfig -restore /oracle/app/12.1.0/grid/cdata/oracle1-cluster/backup01.ocr
[11:46:42]PROT-35: The configured OCR locations are not accessible

可以看到仍然和以前错误一样的。所以,我们还是去Metalink上搜一搜这种情况,究竟该怎么restore。我们可以参考metalink上的文档:How to restore ASM based OCR after complete loss of the CRS diskgroup on Linux/Unix systems (文档 ID 1062983.1)

1.停止CRS软件,两个节点都要停掉

[13:44:11]oracle12c01:/oracle/app/12.1.0/grid/bin # ./crsctl stop crs –f

2.启动CRS到exclusive mode

[13:44:50]oracle12c01:/oracle/app/12.1.0/grid/bin # ./crsctl start crs -excl -nocrs

文档中作完这一步就是重新使用asmlib包去标记物理磁盘,而这一步骤我们前面已经做过了,所以剩下的步骤就是在sqlplus里面创建磁盘组。

3.使用sqlplus创建crs磁盘组

[14:12:53]grid@oracle12c01:~> sqlplus / as sysasm
[14:12:53]SQL*Plus: Release 12.1.0.1.0 Production on Wed Jan 21 14:16:54 2015
[14:12:53]Copyright (c) 1982, 2013, Oracle.  All rights reserved.
[14:12:53]Connected to:
[14:12:53]Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
[14:12:53]With the Real Application Clusters and Automatic Storage Management options
[14:17:38]SQL> create diskgroup OCRDG external redundancy disk '/dev/oracleasm/disks/ASMOCR2G1', '/dev/oracleasm/disks/ASMOCR2G2', '/dev/oracleasm/disks/ASMOCR2G3'  attribute 'COMPATIBLE.ASM' = '12.1.0.0.0';
[14:17:53]Diskgroup created.

这里我们使用grid账号操作登陆到sqlplus里面进行操作,因为我是12.1.0.1.0的数据库,所以这里的asm兼容性参数也设置成了12.1.0.0.0.

4.Restore ocr备份

切换到root用户下面执行ocr的还原,这次很顺利还原成功。

[14:18:49]oracle12c01:/oracle/app/12.1.0/grid/bin # ./ocrconfig -restore  /oracle/app/12.1.0/grid/cdata/oracle1-cluster/backup01.ocr
[14:20:53]oracle12c01:/oracle/app/12.1.0/grid/bin # ./ocrcheck
[14:20:53]Status of Oracle Cluster Registry is as follows :
[14:20:53]         Version                  :          4
[14:20:53]         Total space (kbytes)     :     409568
[14:20:53]         Used space (kbytes)      :       1460
[14:20:53]         Available space (kbytes) :     408108
[14:20:53]         ID                       : 1063957750
[14:20:53]         Device/File Name         :     +OCRDG
[14:20:53]                                    Device/File integrity check succeeded
[14:20:53]                                    Device/File not configured
[14:20:53]                                    Device/File not configured
[14:20:53]                                    Device/File not configured
[14:20:53]                                    Device/File not configured
[14:20:53]         Cluster registry integrity check succeeded
[14:20:55]         Logical corruption check succeeded

5.重新创建Voteing Disk

[14:21:10]oracle12c01:/oracle/app/12.1.0/grid/bin # ./crsctl replace votedisk +OCRDG
[14:21:12]CRS-4602: Failed 27 to add voting file aa67b0c44b724f92bfc2e2a1d88f6b28.
[14:21:12]Failed to replace voting disk group with +OCRDG.
[14:21:12]CRS-4000: Command Replace failed, or completed with errors.

原本以为可以一路的顺畅到底,没想到啊,在创建votedisk的时候遇到了CRS-4602错误。这个问题根据文档:CRS-4256 CRS-4602 While Replacing Voting Disk (文档 ID 1475588.1)的描述,总共有4种可能性。

  • 这个命令必须是使用root用户执行;
  • ASM diskgroup 不是online或者空间不足;
  • ASM compatible attribute 不正确;
  • asm_diskstring参数没设置。

第一条和第三条,不存在这种问题。我们的问题可能是问题二和问题四。

SQL> select group_number  "Group"
,disk_number   "Disk"
,header_status "Header"
,mode_status   "Mode"
,state         "State"
,redundancy    "Redundancy"
,total_mb      "Total MB"
,free_mb       "Free MB"
,name          "Disk Name"
,failgroup     "Failure Group"
,path          "Path"
from   v$asm_disk
order by group_number
,disk_number
/
[14:47:37]Group Disk Header Mode    State    Redunda   Total MB  Free MB Disk Name  Failure Group   Path
[14:47:37]----- ---- -------------- -------- ------- --------------------------------------------   -----
[14:47:37]    1    0 MEMBER ONLINE  NORMAL   UNKNOWN       1024  949 OCRDG_0000     OCRDG_0000      /dev/oracleasm/disks/ASMOCR2G1
[14:47:37]    1    1 MEMBER ONLINE  NORMAL   UNKNOWN       1024  947 OCRDG_0001     OCRDG_0001      /dev/oracleasm/disks/ASMOCR2G2
[14:47:37]    1    2 MEMBER ONLINE  NORMAL   UNKNOWN       1024  951 OCRDG_0002     OCRDG_0002      /dev/oracleasm/disks/ASMOCR2G3

可以看到我们的磁盘组是刚刚创建的,空间也是足够的。然后检查asm_diskstring的值,发现是空的。

[15:05:01]SQL> show parameter disk
[15:05:01]
[15:05:01]NAME                                 TYPE        VALUE
[15:05:01]------------------------------------ ----------- ------------------------------
[15:05:01]asm_diskgroups                       string      OCRDG
[15:05:01]asm_diskstring                       string

于是只能重新配置参数。在ASM的alert日志里面找了一份历史的启动参数做了一个pfile文件。

[15:11:55]grid@oracle12c01:/oracle/app/12.1.0/grid/dbs> sqlplus / as sysasm
[15:11:55]SQL*Plus: Release 12.1.0.1.0 Production on Wed Jan 21 15:15:57 2015
[15:11:55]Copyright (c) 1982, 2013, Oracle.  All rights reserved.
[15:11:55]Connected to an idle instance.
[15:12:36]SQL> startup pfile='/oracle/app/12.1.0/grid/dbs/initasm.ora';
[15:12:41]ASM instance started
[15:12:41]Total System Global Area 1135747072 bytes
[15:12:41]Fixed Size                  2297344 bytes
[15:12:41]Variable Size            1108283904 bytes
[15:12:41]ASM Cache                  25165824 bytes
[15:12:48]ASM diskgroups mounted
[15:12:48]ASM diskgroups volume enabled
[15:13:57]SQL> show parameter asm
[15:13:57]NAME                                 TYPE        VALUE
[15:13:57]------------------------------------ ----------- ------------------------------
[15:13:57]asm_diskgroups                       string
[15:13:57]asm_diskstring                       string      /dev/oracleasm/disks/*
[15:13:57]asm_power_limit                      integer     1
[15:13:57]asm_preferred_read_failure_groups    string

再次执行创建votedisk的操作,执行成功。

[15:14:30]grid@oracle12c01:/oracle/app/12.1.0/grid/dbs> crsctl replace votedisk +OCRDG
[15:14:33]Successful addition of voting disk 08d0bf0676724fe7bf189d454cdfe158.
[15:14:33]Successfully replaced voting disk group with +OCRDG.
[15:14:33]CRS-4266: Voting file(s) successfully replaced

完成这一步后可以按照metalink文档重新创建asm的spfile并存放到ocr中,我这里没有创建,我发现也可以。只不过asm_diskstrings参数是空白的,官方文档对于这一步的操作也是可选的。

6.重启CRS

以上操作完成之后,重启CRS到正常模式即可。这里建议最好按照官方文档的命令做一下asmlib的scandisk的操作。

[15:18:22]oracle12c01:/oracle/app/12.1.0/grid/bin # ./crsctl stop crs –f
[15:19:04]oracle12c01:/oracle/app/12.1.0/grid/bin # /usr/sbin/oracleasm scandisks
[15:19:05]Reloading disk partitions: done
[15:19:05]Cleaning any stale ASM disks...
[15:19:05]Scanning system for ASM disks...
[15:19:13]oracle12c01:/oracle/app/12.1.0/grid/bin # ./crsctl start crs
[15:19:21]CRS-4123: Oracle High Availability Services has been started.

[15:23:20]SQL> select name,state from v$asm_disk;
[15:23:20]
[15:23:20]NAME                           STATE
[15:23:20]------------------------------ --------
[15:23:20]OCRDG_0002                     NORMAL
[15:23:20]OCRDG_0001                     NORMAL
[15:23:20]OCRDG_0000                     NORMAL
[15:23:20]DATADG1_0001                   NORMAL
[15:23:20]DATADG2_0002                   NORMAL
[15:23:20]DATADG2_0001                   NORMAL
[15:23:20]DATADG2_0000                   NORMAL
[15:23:20]DATADG1_0000                   NORMAL

参考文档:

How to restore ASM based OCR after complete loss of the CRS diskgroup on Linux/Unix systems (文档 ID 1062983.1)

CRS-4256 CRS-4602 While Replacing Voting Disk (文档 ID 1475588.1)

ORA-04030: out of process memory when trying to allocate 381096 bytes (kkoutlCreatePh,logdef* : kkoabr) 问题分析及处理

by buddy on 2015 年 1 月 20 日

一、问题背景

2015年1月20日下午4点左右, eupdb数据库异常的缓慢,登陆到主机上执行命令都非常慢。通过vmstat发现系统计算内存使用非常多,page in/page out交换非常的严重。而在1月15日晚上也同时出现过类似的问题。通过分析当时的trace发现下列信息:

ORA-04030: out of process memory when trying to allocate 824504 bytes (pga heap,kco buffer)
ORA-04030: out of process memory when trying to allocate 381096 bytes (kkoutlCreatePh,logdef* : kkoabr)

此处可以发现系统无法分配多余的内存给应用进程。继续看Trace文件发现下列信息:

=======================================
TOP 10 MEMORY USES FOR THIS PROCESS
---------------------------------------
*** 2015-01-15 22:52:01.018
96%   15 GB, 56108 chunks: "permanent memory          "  SQL
         kkoutlCreatePh  ds=114165290  dsprt=1109b5b10                   <<<<< -- shows kkoutlCreatePh
 3%  495 MB, 42766 chunks: "free memory               "  
         top call heap   ds=110101460  dsprt=0
 0%   17 MB, 1660 chunks: "permanent memory          "  SQL
         kxs-heap-c      ds=1109b5b10  dsprt=110101460
 0%   11 MB, 96373 chunks: "optdef: qcopCreateOptInte "  
         TCHK^7e632a00   ds=1109d9580  dsprt=110d4c810
 0% 7904 KB, 3068 chunks: "free memory               "  SQL
         kkoutlCreatePh  ds=114165290  dsprt=1109b5b10
 0% 7226 KB, 48665 chunks: "opndef: qcopCreateOpnViaM "  
         TCHK^7e632a00   ds=1109d9580  dsprt=110d4c810
 0% 6774 KB, 96338 chunks: "logdef: qcopCreateLog     "  
         TCHK^7e632a00   ds=1109d9580  dsprt=110d4c810
 0% 5323 KB, 48641 chunks: "strdef: qcopCreateStr     "  
         TCHK^7e632a00   ds=1109d9580  dsprt=110d4c810
 0% 3834 KB, 97719 chunks: "chedef : qcuatc           "  
         TCHK^7e632a00   ds=1109d9580  dsprt=110d4c810
 0% 3218 KB,   6 chunks: "kgh stack                 "  
         pga heap        ds=110004990  dsprt=0

当出现ORA-04030错误的时候,Oracle会自动把出错的进程使用内存的信息给Dump出来,这里可以看到15G的内存都使用在了SQL kkoutlCreatePh上面。通过在MOS上查找,发现文档:

A Query Fails With ORA-4030 On "kkoutlCreatePh,logdef* : kkoabr" (文档 ID 1474457.1)跟我们的Trace信息很吻合。kkoutlCreatePh引起很高的heap memory分配。而进一步根据文档指出的bug信息去查,可以查到Bug 12907522 : EXPLAIN PLAN FOR SQL WITH LARGE INLIST CAUSES EXCESSIVE MEMORY CONSUMPTION,可以发现如果在一个SQL语句中,含有大量的INLIST,也就是IN的语法括号中跟着一大堆的LIST的会引起过度的内存损耗。而我们的语句也是这类型的语句,IN里面含有大量的LIST值。详情可见附件。

image

这个SQL放到word里面长达40多页,但是后面in里面含有太多的值,符合bug 12907522的描述。同时我们看到文档中指出的堆栈是:

Explain plan for a SQL with high inlist members runs indefinitely consuming
  huge PGA (over 4G and still not finishing). Heap dump show majority memory
  consumed in -
  Call Heap -> kxs-heap-c -> kkoutlCreatePh

而我们从Trace中也发现了类似的信息。可以看到我们的Heap中的信息也是kxs-heap-c和kkoutlCreatePh。

PRIVATE HEAP SUMMARY DUMP
16 GB total:
    15 GB commented, 260 KB permanent
   496 MB free (0 KB in empty extents),
      16 GB,   1 heap:    "kxs-heap-c     "            495 MB free held
------------------------------------------------------
Summary of subheaps at depth 1
15 GB total:
    15 GB commented, 17 MB permanent
   182 KB free (16 KB in empty extents),
      15 GB,   1 heap:    "kkoutlCreatePh "   

二、解决办法

知道了原因,解决办法也很简单。

1. 根据文档1474457.1的解决办法,修改“_b_tree_bitmap_plans”为false可以规避这个问题,这个参数是可以动态调整的,但是这个参数的风险是修改为false会导致系统无法生成bitmap的执行计划。不过我们对系统进行了检查,发现系统并不存在bitmap的索引,因此并无影响。但是alter system的将会导致以后如果需要建立bitmap的时候,会无法产生相应的计划,所以建议在hint上加参数来解决。

SELECT /*+ OPT_PARAM('_b_tree_bitmap_plans ' 'false') */ *

2. 根据Bug 12907522的描述,建议将SQL语句进行修改,把IN中的LIST值适当的减少。

ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110] 故障处理

by buddy on 2015 年 1 月 19 日

一、问题背景

2015年1月16日早晨8点03分,数据库后台报ORA-00600[kddummy_blkchk]错误,系统显示SMON进程在恢复的死事务的过程中遇到了坏块,在多次尝试无效后,最终SMON出现致命错误导致系统宕机。

Fri Jan 16 08:03:19 2015
Errors in file /u01/oracle/admin/actdb/udump/actdb_ora_10355644.trc:
ORA-00600: [kddummy_blkchk], [118], [639865], [6110], [], [], [], []
Fri Jan 16 08:03:19 2015
Corrupt Block Found
         TSN = 8, TSNAME = ACTUARY_DATA
         RFN = 118, BLK = 639865, RDBA = 495567737
         OBJN = 1087274, OBJD = 1097591, OBJECT = TBL_TRAD_QID, SUBOBJECT = 
         SEGMENT OWNER = ACTUARY, SEGMENT TYPE = Table Segment

Fri Jan 16 08:34:25 2015
Errors in file /u01/oracle/admin/actdb/bdump/actdb_p006_65077728.trc:
ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110], [], [], [], []
Fri Jan 16 08:34:27 2015
Doing block recovery for file 118 block 639852
Block recovery from logseq 90945, block 1260180 to scn 9106774485353
Fri Jan 16 08:35:03 2015
Errors in file /u01/oracle/admin/actdb/bdump/actdb_smon_45548262.trc:
ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110], [], [], [], []
Fri Jan 16 08:35:04 2015
Errors in file /u01/oracle/admin/actdb/bdump/actdb_pmon_51053242.trc:
ORA-00474: SMON process terminated with error

二、问题处理

出现该问题,首先可以看到SMON尝试对文件118的639852块做恢复。而在恢复的过程中会因为发现坏块,会导致宕库。所以建议首先设置10513事件去阻止SMON对死事务的恢复。

Fri Jan 16 09:37:45 2015
ALTER SYSTEM SET event='10513 trace name context forever,level 2' SCOPE=SPFILE;

设置完事件后重启数据库,发现SMON已经不在尝试去做恢复了,但是后台进程依然会报错。这里主要是检测到报检测的坏块的错误。

Fri Jan 16 09:42:28 2015
Corrupt Block Found
         TSN = 8, TSNAME = ACTUARY_DATA
         RFN = 118, BLK = 639852, RDBA = 495567724
         OBJN = 1087274, OBJD = 1097591, OBJECT = TBL_TRAD_QID, SUBOBJECT = 
         SEGMENT OWNER = ACTUARY, SEGMENT TYPE = Table Segment

于是考虑对该对象进行重建。在执行CTAS对表TBL_TRAD_QID进行重建的过程中无法完成,报如下错误:

Fri Jan 16 09:50:32 2015
Errors in file /u01/oracle/admin/actdb/udump/actdb_ora_42664692.trc:
ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110], [], [], [], []
Fri Jan 16 09:54:39 2015
Errors in file /u01/oracle/admin/actdb/udump/actdb_ora_34472916.trc:
ORA-00600:  [kddummy_blkchk], [118], [639852], [6110], [], [], [], []
Fri Jan 16 09:54:45 2015
Corrupt Block Found
         TSN = 8, TSNAME = ACTUARY_DATA
         RFN = 118, BLK = 639852, RDBA = 495567724
         OBJN = 0, OBJD = 1097591, OBJECT = , SUBOBJECT = 
         SEGMENT OWNER = , SEGMENT TYPE = Invalid Type

此时可以看到错误发生在用户进程上面,而不是发生在SMON进程上,因此不会导致数据库宕机。但是因为这个表对于应用来说很重要,所以必须考虑重建进行修复。但是重建的时候还是会去读这个对象,读的过程同样会对表做逻辑检查,一旦做逻辑检查,就会出现上述错误,从而导致读取也会失败,报ORA-00600错误。所以为了能够正常的读取该对象,我们考虑关闭数据库的逻辑校验。修改下列两个参数。

SQL> alter system set db_block_checking=false;
System altered.
SQL> alter system set db_block_checksum=false;
System altered.

修改参数成功后,再次运行CTAS重建表成功。把老的对象成功drop掉之后,将event 10513事件取消和两个校验参数重新设置回来后重启数据库仍然报错。

Fri Jan 16 11:18:05 2015
Errors in file /u01/oracle/admin/actdb/bdump/actdb_smon_51446784.trc:
ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110], [], [], [], []
Doing block recovery for file 118 block 639852
Block recovery from logseq 90953, block 66187 to scn 9106774948071

此时SMON进程再一次尝试恢复该对象,但是这个对象在我们的数据库中已经被我们删除不存在了,经过分析发现表drop掉了,会放在回收站中,物理的对象仍然存在。为了彻底的解决这个问题,需要把对象彻底的purge删除。

purge table table_name

三、原因分析

从后台trace,我们可以看到是下列语句造成了该问题。

ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639852], [6110], [], [], [], []
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [kddummy_blkchk], [118], [639865], [6110], [], [], [], []
Current SQL statement for this session:
UPDATE Tbl_Trad_QID A
                 SET QCCXRQ =
                     (SELECT /*+PARALLEL(NT,3)*/
                       ACCTDATE
                        FROM NREGCLM NT
                       WHERE NT.BDH = A.BDH
                         AND A.YXYQCBZW=1
                         AND ROWNUM = 1)

下列语句涉及到一组更新,在mos上查询下列错误能够找到下列bug。

ALERT: Bug 7662491 – Array Update can corrupt a row. ORA-600 [kghstack_free1] ORA-600 [kddummy_blkchk][6110/6129] (文档 ID 861965.1)

7662491: INSTANCE CRASH / ORA-600 [KDDUMMY_BLKCHK] HIT DURING RECOVER

根据文档的提示,能在trace中找到下列信息:

kdbchk: the amount of space used is not equal to block size
        used=7888 fsc=0 avsp=302 dtl=8064
rechecking block failed with error code 6110

同时文档还指出

The trace file shows that the error is produced by an UPDATE with OP:11.19 (Array Update) and check code [6110] ("kdbchk: the amount of space used is not equal to block size") or check code [6129] ("kdbchk: fsbo(<XX>) wrong, (hsz <YY>)").  Note that check codes are not limited to 6129/6110.

可以看到OP:11.19就是Array Update操作。我们在仔细的看Trace,我们发现下列信息。

CHANGE #1 TYP:0 CLS: 1 AFN:118 DBA:0x1d89c370 OBJ:1097591 SCN:0x0848.55de4efc SEQ:  2 OP:11.19

这里可以看到在Trace里面对象1097591做了OP:11.19(Array Update)的操作,命中了这个Bug。

强烈建议:安装补丁7662491。

EXADATA升级—从11.2.3.1.0到11.2.3.3.0–(5)升级Bundle Patch 23和交换机

by buddy on 2014 年 9 月 30 日

升级Bundle Patch 23

1.下载最新的OPatch,p6880880_112000_Linux-x86-64.zip,解压并检查OPatch是否是最新的版本。

[oracle@gxx2db01 bp23]$ unzip p6880880_112000_Linux-x86-64.zip -d /u01/app/11.2.0.3/grid/
[oracle@gxx2db01 bp23]$ $ORACLE_HOME/OPatch/opatch version 
2.配置OCM文件

% $ORACLE_HOME/OPatch/ocm/bin/emocmrsp
3.验证Oracle Inventory

% $ORACLE_HOME/OPatch/opatch lsinventory -detail -oh $ORACLE_HOME
4.解压BP23

% cd /u01/app/oracle/patches
% unzip p18835772_112030_Linux-x86-64.zip
% cd /u01/app/oracle/patches/18835772
# chown -R oracle:oinstall /u01/app/oracle/patches/18835772
5.进行冲突检测

For Grid Infrastructure Home, as home user:

% $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /18835772/18707883
% $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /18835772/18906063
For Database home, as home user:

% $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /18835772/18707883
% $ORACLE_HOME/OPatch/opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir /18835772/18906063/custom/server/18906063
6.检查SystemSpace,这里主要是ORACLE_HOME是否有足够空间。

For Grid Infrastructure Home, as home user:

% $ORACLE_HOME/OPatch/opatch prereq CheckSystemSpace -phBaseDir /18835772/18707883
% $ORACLE_HOME/OPatch/opatch prereq CheckSystemSpace -phBaseDir /18835772/18906063
For Database home, as home user:

% $ORACLE_HOME/OPatch/opatch prereq CheckSystemSpace -phBaseDir /18835772/18707883
% $ORACLE_HOME/OPatch/opatch prereq CheckSystemSpace -phBaseDir /18835772/18906063/custom/server/18906063
7.使用auto的方式打patch

# opatch auto /u01/app/oracle/patches/18835772
8.执行脚本

% SQL> sqlplus / as sysdba
% SQL> @rdbms/admin/catbundle.sql exa apply

升级InfiniBand交换机

最后一个步骤是升级InfiniBand交换机,交换机的软件安装介质也是在cells里面自带的。使用和cells一样的升级命令和方式。不过需要注意的一点是,交换机的网卡信息要和交换机主机下的/etc/hosts所配置的内容相同,如果不同,升级过程中会失败。如下所示:

[FAIL     ] Mismatch between address in ifcfg-eth[0,1] and /etc/hosts in gxx2sw-ib3. ACTION: Correct entry in /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 or /etc/hosts

此时需要我们登陆到InfiniBand交换机的主机上,通过ifconfig –a和more /etc/hosts两个命令去查找,修改好自己的/etc/hosts文件后再次运行预检测命令就能够成功通过检测了。

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -ibswitches -upgrade -ibswitch_precheck

2014-09-07 14:10:25 +0800 1 of 1 :SUCCESS: DO: Initiate pre-upgrade validation check on InfiniBand switch(es).
 ----- InfiniBand switch update process started Sun Sep  7 14:10:25 CST 2014 -----
[NOTE     ] Log file at /var/log/cellos/upgradeIBSwitch.log

[INFO     ] List of InfiniBand switches for upgrade: ( gxx2sw-ib3 gxx2sw-ib2 )
[PROMPT   ] Use the default password for all switches? (y/n) [n]: y
[PROMPT   ] Updating only 2 switch(es). Are you sure you want to continue? (y/n) [n]: y
[SUCCESS  ] Verifying Network connectivity to gxx2sw-ib3
[SUCCESS  ] Verifying Network connectivity to gxx2sw-ib2
[SUCCESS  ] Validating verify-topology output
[INFO     ] Master Subnet Manager is set to gxx2sw-ib2 in all Switches

[INFO     ] ---------- Starting with IBSwitch gxx2sw-ib2
[SUCCESS  ] gxx2sw-ib2 is at 1.3.3-2. Meets minimal patching level 1.3.3-2
[SUCCESS  ] Verifying that /tmp has 120M in gxx2sw-ib2, found 249M
[SUCCESS  ] Verifying that / has 80M in gxx2sw-ib2, found 200M
[SUCCESS  ] Verifying that gxx2sw-ib2 has 120M free memory, found 408M
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib2
[SUCCESS  ] Verifying that gxx2sw-ib2 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 20:12:34
[SUCCESS  ] Pre-update validation on gxx2sw-ib2

[INFO     ] ---------- Starting with InfiniBand Switch gxx2sw-ib3
[SUCCESS  ] gxx2sw-ib3 is at 1.3.3-2. Meets minimal patching level 1.3.3-2
[SUCCESS  ] Verifying that /tmp has 120M in gxx2sw-ib3, found 249M
[SUCCESS  ] Verifying that / has 80M in gxx2sw-ib3, found 200M
[SUCCESS  ] Verifying that gxx2sw-ib3 has 120M free memory, found 410M
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib3
[SUCCESS  ] Verifying that gxx2sw-ib3 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 20:17:29
[SUCCESS  ] Pre-update validation on gxx2sw-ib3
[SUCCESS  ] Overall status

 ----- InfiniBand switch update process ended Sun Sep  7 14:11:07 CST 2014 -----
2014-09-07 14:11:07 +0800 1 of 1 :SUCCESS: DONE: Initiate pre-upgrade validation check on InfiniBand switch(es).

检测完成之后,就可以正式进行升级了,如下所示,是一个升级的输出。

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -ibswitches -upgrade

2014-09-07 14:11:34 +0800 1 of 1 :SUCCESS: DO: Initiate upgrade of InfiniBand switches to 2.1.3-4. Expect up to 15 minutes for each switch
 ----- InfiniBand switch update process started Sun Sep  7 14:11:35 CST 2014 -----
[NOTE     ] Log file at /var/log/cellos/upgradeIBSwitch.log

[INFO     ] List of InfiniBand switches for upgrade: ( gxx2sw-ib3 gxx2sw-ib2 )
[PROMPT   ] Use the default password for all switches? (y/n) [n]: y
[PROMPT   ] Updating only 2 switch(es). Are you sure you want to continue? (y/n) [n]: y
[SUCCESS  ] Verifying Network connectivity to gxx2sw-ib3
[SUCCESS  ] Verifying Network connectivity to gxx2sw-ib2
[SUCCESS  ] Validating verify-topology output
[INFO     ] Proceeding with upgrade of InfiniBand switches to version 2.1.3_4
[INFO     ] Master Subnet Manager is set to gxx2sw-ib2 in all Switches

[INFO     ] ---------- Starting with IBSwitch gxx2sw-ib2
[SUCCESS  ] Disable Subnet Manager on gxx2sw-ib2
[SUCCESS  ] Copy firmware packages to gxx2sw-ib2
[SUCCESS  ] gxx2sw-ib2 is at 1.3.3-2. Meets minimal patching level 1.3.3-2
[SUCCESS  ] Verifying that /tmp has 120M in gxx2sw-ib2, found 139M
[SUCCESS  ] Verifying that / has 80M in gxx2sw-ib2, found 200M
[SUCCESS  ] Verifying that gxx2sw-ib2 has 120M free memory, found 299M
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib2
[SUCCESS  ] Verifying that gxx2sw-ib2 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 20:14:12
[SUCCESS  ] Pre-update validation on gxx2sw-ib2
[INFO     ] Starting upgrade on gxx2sw-ib2 to 2.1.3_4. Please give upto 10 mins for the process to complete. DO NOT INTERRUPT or HIT CTRL+C during the upgrade
[SUCCESS  ] Load firmware 2.1.3_4 onto gxx2sw-ib2
[SUCCESS  ] Disable Subnet Manager on gxx2sw-ib2
[SUCCESS  ] Verify that /conf/configvalid is set to 1 on gxx2sw-ib2
[SUCCESS  ] Set SMPriority to 5 on gxx2sw-ib2
[INFO     ] Rebooting gxx2sw-ib2. Wait for 240 secs before continuing
[SUCCESS  ] Reboot gxx2sw-ib2
[SUCCESS  ] Restart Subnet Manager on gxx2sw-ib2
[INFO     ] Starting post-update validation on gxx2sw-ib2
[SUCCESS  ] Inifiniband switch gxx2sw-ib2 is at target patching level
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib2
[SUCCESS  ] Verifying that gxx2sw-ib2 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 12:27:27
[SUCCESS  ] Firmware verification on InfiniBand switch gxx2sw-ib2
[INFO     ] Post-check validation on IBSwitch gxx2sw-ib2
[SUCCESS  ] Update switch gxx2sw-ib2 to 2.1.3_4

[INFO     ] ---------- Starting with InfiniBand Switch gxx2sw-ib3
[SUCCESS  ] Disable Subnet Manager on gxx2sw-ib3
[SUCCESS  ] Copy firmware packages to gxx2sw-ib3
[SUCCESS  ] gxx2sw-ib3 is at 1.3.3-2. Meets minimal patching level 1.3.3-2
[SUCCESS  ] Verifying that /tmp has 120M in gxx2sw-ib3, found 139M
[SUCCESS  ] Verifying that / has 80M in gxx2sw-ib3, found 200M
[SUCCESS  ] Verifying that gxx2sw-ib3 has 120M free memory, found 300M
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib3
[SUCCESS  ] Verifying that gxx2sw-ib3 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 20:37:06
[SUCCESS  ] Pre-update validation on gxx2sw-ib3
[INFO     ] Starting upgrade on gxx2sw-ib3 to 2.1.3_4. Please give upto 10 mins for the process to complete. DO NOT INTERRUPT or HIT CTRL+C during the upgrade
[SUCCESS  ] Load firmware 2.1.3_4 onto gxx2sw-ib3
[SUCCESS  ] Disable Subnet Manager on gxx2sw-ib3
[SUCCESS  ] Verify that /conf/configvalid is set to 1 on gxx2sw-ib3
[SUCCESS  ] Set SMPriority to 5 on gxx2sw-ib3
[INFO     ] Rebooting gxx2sw-ib3. Wait for 240 secs before continuing
[SUCCESS  ] Reboot gxx2sw-ib3
[SUCCESS  ] Restart Subnet Manager on gxx2sw-ib3
[INFO     ] Starting post-update validation on gxx2sw-ib3
[SUCCESS  ] Inifiniband switch gxx2sw-ib3 is at target patching level
[SUCCESS  ] Verifying host details in /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth1 for gxx2sw-ib3
[SUCCESS  ] Verifying that gxx2sw-ib3 has at least 1 NTP Server, found 1
[INFO     ] Manually validate the following entries Date:(YYYY-MM-DD) 2014-09-07 Time:(HH:MM:SS) 12:49:54
[SUCCESS  ] Firmware verification on InfiniBand switch gxx2sw-ib3
[INFO     ] Post-check validation on IBSwitch gxx2sw-ib3
[SUCCESS  ] Update switch gxx2sw-ib3 to 2.1.3_4
[INFO     ] InfiniBand Switches ( gxx2sw-ib3 gxx2sw-ib2 ) updated to 2.1.3_4
[SUCCESS  ] Overall status

 ----- InfiniBand switch update process ended Sun Sep  7 14:47:43 CST 2014 -----
2014-09-07 14:47:43 +0800 1 of 1 :SUCCESS: DONE: Upgrade InfiniBand switch(es) to 2.1.3-4.

升级后的检查

整个升级成功后,还有一些步骤需要执行,例如我们把ASM的disk repair的时间修改回来,把CRS的状态改成enable,启动dbfs,以及检查IMAGE是否都已经升级到了11.2.3.3.0

----修改ASM的disk_repair_time
SQL> alter diskgroup DATA_GXX2 set attribute 'disk_repair_time'='3.6h';
Diskgroup altered.

----将CRS的状态改成enable
[root@gxx2db01 tmp]dcli -g dbs_group -l root "/u01/app/11.2.0.3/grid/bin/crsctl enable crs"

----检查计算节点和存储节点的IMAGE信息
[root@gxx2db01 mydbfs]# dcli -g /tmp/all_group -l root 'imagehistory'
gxx2db01: Version                              : 11.2.3.1.0.120304
gxx2db01: Image activation date                : 2002-05-03 22:47:44 +0800
gxx2db01: Imaging mode                         : fresh
gxx2db01: Imaging status                       : success
gxx2db01:
gxx2db01: Version                              : 11.2.3.3.0.131014.1
gxx2db01: Image activation date                : 2014-09-07 11:57:33 +0800
gxx2db01: Imaging mode                         : patch
gxx2db01: Imaging status                       : success
gxx2db01:
gxx2db02: Version                              : 11.2.3.1.0.120304
gxx2db02: Image activation date                : 2012-05-03 11:29:41 +0800
gxx2db02: Imaging mode                         : fresh
gxx2db02: Imaging status                       : success
gxx2db02:
gxx2db02: Version                              : 11.2.3.3.0.131014.1
gxx2db02: Image activation date                : 2014-09-06 22:03:14 +0800
gxx2db02: Imaging mode                         : patch
gxx2db02: Imaging status                       : success
gxx2db02:
gxx2cel01: Version                              : 11.2.2.3.5.110815
gxx2cel01: Image activation date                : 2011-10-19 16:15:42 -0700
gxx2cel01: Imaging mode                         : fresh
gxx2cel01: Imaging status                       : success
gxx2cel01:
gxx2cel01: Version                              : 11.2.3.1.0.120304
gxx2cel01: Image activation date                : 2012-05-03 03:00:13 -0700
gxx2cel01: Imaging mode                         : out of partition upgrade
gxx2cel01: Imaging status                       : success
gxx2cel01:
gxx2cel01: Version                              : 11.2.3.3.0.131014.1
gxx2cel01: Image activation date                : 2014-09-06 16:01:21 +0800
gxx2cel01: Imaging mode                         : out of partition upgrade
gxx2cel01: Imaging status                       : success
gxx2cel01:
gxx2cel02: Version                              : 11.2.2.3.5.110815
gxx2cel02: Image activation date                : 2011-10-19 16:26:30 -0700
gxx2cel02: Imaging mode                         : fresh
gxx2cel02: Imaging status                       : success
gxx2cel02:
gxx2cel02: Version                              : 11.2.3.1.0.120304
gxx2cel02: Image activation date                : 2012-05-03 02:59:52 -0700
gxx2cel02: Imaging mode                         : out of partition upgrade
gxx2cel02: Imaging status                       : success
gxx2cel02:
gxx2cel02: Version                              : 11.2.3.3.0.131014.1
gxx2cel02: Image activation date                : 2014-09-06 17:42:01 +0800
gxx2cel02: Imaging mode                         : out of partition upgrade
gxx2cel02: Imaging status                       : success
gxx2cel02:
gxx2cel03: Version                              : 11.2.2.3.5.110815
gxx2cel03: Image activation date                : 2011-10-19 16:26:59 -0700
gxx2cel03: Imaging mode                         : fresh
gxx2cel03: Imaging status                       : success
gxx2cel03:
gxx2cel03: Version                              : 11.2.3.1.0.120304
gxx2cel03: Image activation date                : 2012-05-03 02:58:38 -0700
gxx2cel03: Imaging mode                         : out of partition upgrade
gxx2cel03: Imaging status                       : success
gxx2cel03:
gxx2cel03: Version                              : 11.2.3.3.0.131014.1
gxx2cel03: Image activation date                : 2014-09-06 17:42:08 +0800
gxx2cel03: Imaging mode                         : out of partition upgrade
gxx2cel03: Imaging status                       : success

参考下列文档

Sun_Oracle_Database_Machine_Owner’s_Guide

How to backup / restore Exadata Database Server (Lunix) –社区文档

dbnodeupdate.sh: Exadata Database Server Patching using the DB Node Update Utility (文档 ID 1553103.1)

Exadata 11.2.3.3.0 release and patch (16278923) (文档 ID 1487339.1)

Information Center: Upgrading Oracle Exadata Database Machine [ID 1364356.2]

EXADATA升级—从11.2.3.1.0到11.2.3.3.0–(4)释放Solaris空间和升级计算节点

by buddy on 2014 年 9 月 30 日

释放Solaris空间

Exadata在出厂的时候,默认安装了两个OS系统,一个是Linux,一个是Solaris X86,然后互相做RAID 1,我们在升级计算节点的时候,如果不释放掉Solaris就会报下列错误:

ERROR: Solaris disks are not reclaimed. This needs to be done before the upgrade. See the Exadata Database Machine documentation to claim the Solaris disks

我们可以使用出厂自带的脚本来查看计算节点本地盘的一个情况,这里可以看到,总共物理盘有4块,RDID的级别是1,拥有dual boot。

[root@gxx2db01 oracle.SupportTools]# ./reclaimdisks.sh -check
[INFO] This is SUN FIRE X4170 M2 SERVER machine
[INFO] Number of LSI controllers: 1
[INFO] Physical disks found: 4 (252:0 252:1 252:2 252:3)
[INFO] Logical drives found: 3
[INFO] Dual boot installation: yes
[WARNING] Some lvm logical volume(s) resizes on other than /dev/sda device
[INFO] Linux logical drive: 0
[INFO] RAID Level for the Linux logical drive: 1
[INFO] Physical disks in the Linux logical drive: 2 (252:0 252:1)
[INFO] Dedicated Hot Spares for the Linux logical drive: 0
[INFO] Global Hot Spares: 0
[INFO] Valid dual boot configuration found for Linux: RAID1 from 2 disks

释放solaris操作系统很简单,运行reclaimdisks.sh脚本释放即可,当然在运行的时候我遇到了一个小问题,这个脚本只认系统默认的盘和卷组,而南宁电网自己配置了一个新的VG(就是用作备份的那个datavg),因为我们在前面做了备份的操作,我把这个VG删除,重新运行脚本执行成功,当然你也可以改脚本运行,不过我们做了尝试,还是会把你新建的VG配置信息给清理掉。所以这个动作还是很危险的,我们在做这个之前,一定要做好备份。在运行的过程中,我们可以去监控日志/var/log/cellos/reclaimdisks.bg.log,看它具体都做了些什么操作。

[root@gxx2db02 oracle.SupportTools]# ./reclaimdisks.sh -free -reclaim

Started from ./reclaimdisks.sh
[INFO] Free mode is set
[INFO] Reclaim mode is set
[INFO] This is SUN FIRE X4170 M2 SERVER machine
[INFO] Number of LSI controllers: 1
[INFO] Physical disks found: 4 (252:0 252:1 252:2 252:3)
[INFO] Logical drives found: 3
[INFO] Dual boot installation: yes
[INFO] Linux logical drive: 0
[INFO] RAID Level for the Linux logical drive: 1
[INFO] Physical disks in the Linux logical drive: 2 (252:0 252:1)
[INFO] Dedicated Hot Spares for the Linux logical drive: 0
[INFO] Global Hot Spares: 0
[INFO] Non-linux physical disks that will be reclaimed: 2 (252:2 252:3)
[INFO] Non-linux logical drives that will be reclaimed: 2 (1 2)
Remove logical drive 1

Adapter 0: Deleted Virtual Drive-1(target id-1)
Exit Code: 0x00
Remove logical drive 2

Adapter 0: Deleted Virtual Drive-2(target id-2)

Exit Code: 0x00
[INFO] Remove Solaris entries from /boot/grub/grub.conf
[INFO] Disk reclaiming started in the background with parent process id 17405.
[INFO] Check the log file /var/log/cellos/reclaimdisks.bg.log.
[INFO] This process may take about two hours.
[INFO] DO NOT REBOOT THE NODE.
[INFO] The node will be rebooted automatically upon completion.

升级计算节点

Exadata计算节点的升级很简单,先要在计算节点上关闭CRS和数据库,并把CRS设置成为disable状态,这样在安装过程中发生重启,也不会去启动集群和数据库。

[root@gxx2db01 tmp]dcli -g dbs_group -l root "/u01/app/11.2.0.3/grid/bin/crsctl stop crs -f"
[root@gxx2db01 tmp]dcli -g dbs_group -l root "ps -ef | grep d.bin"
[root@gxx2db01 tmp]dcli -g dbs_group -l root "/u01/app/11.2.0.3/grid/bin/crsctl disable crs"

在这里我们需要使用工具DB Node Update Utility,也就是补丁16486998所提供的脚本。详细内容可以参考文档dbnodeupdate.sh: Exadata Database Server Patching using the DB Node Update Utility (文档 ID 1553103.1)。这篇文档有该脚本很多的使用案例。我们使用的方式是ISO IMAGE的方式,还可以使用http的方式。升级之前最好先使用-v参数预检测一把。还有一个注意的地方,如果你的solaris系统没有被reclaim掉的话,执行该脚本就会报错。

整个升级过程如下所示:

[root@gxx2db02 u01]# ./dbnodeupdate.sh -u -l /u01/p17809253_112330_Linux-x86-64.zip
##########################################################################################################################
#                                                                                                                        
# Guidelines for using dbnodeupdate.sh (rel. 3.55):                                                                      #                                                                                                              
# - Prerequisites for usage:                                                                                             #
#         1. Refer to dbnodeupdate.sh options. See MOS 1553103.1                                                         
#         2. Use the latest release of dbnodeupdate.sh. See patch 16486998                                               
#         3. Run the prereq check with the '-v' option.                                                                  #                                                                                                                  #
#   I.e.:  ./dbnodeupdate.sh -u -l /u01/my-iso-repo.zip -v                                                               #
#          ./dbnodeupdate.sh -u -l http://my-yum-repo -v                                                                 #
#                                                                                                                        
# - Prerequisite dependency check failures can happen due to customization:                                              #
#     - The prereq check detects dependency issues that need to be addressed prior to running a successful update.       
#     - Customized rpm packages may fail the built-in dependency check and system updates cannot proceed until resolved. 
#                                                                                                                        
#   When upgrading from releases later than 11.2.2.4.2 to releases before 11.2.3.3.0:                                    #
#      - Conflicting packages should be removed before proceeding the update.                                            #                                                                                                                     
#   When upgrading to releases 11.2.3.3.0 or later:                                                                      #
#      - When the 'exact' package dependency check fails 'minimum' package dependency check will be tried.               #
#      - When the 'minimum' package dependency check also fails,                                                         #
#        the conflicting packages should be removed before proceeding.                                                   #                                                                                                                       
# - As part of the prereq checks and as part of the update, a number of rpms will be removed.                            #
#   This removal is required to preserve Exadata functioning. This should not be confused with obsolete packages.        
#      - See /var/log/cellos/packages_to_be_removed.txt for details on what packages will be removed.                                                                                                                                     
# - In case of any problem when filing an SR, upload the following:                                                      #
#      - /var/log/cellos/dbnodeupdate.log                                                                                #
#      - /var/log/cellos/dbnodeupdate..diag                                                                       #
#      - where  is the unique number of the failing run.                                                          #
#                                                                                                                        #
##########################################################################################################################
Continue ? [y/n]
y
  (*) 2014-09-06 21:53:28: Unzipping helpers (/u01/dbupdate-helpers.zip) to /opt/oracle.SupportTools/dbnodeupdate_helpers
  (*) 2014-09-06 21:53:28: Initializing logfile /var/log/cellos/dbnodeupdate.log
  (*) 2014-09-06 21:53:28: Collecting system configuration details. This may take a while...
  (*) 2014-09-06 21:53:41: Validating system details for known issues and best practices. This may take a while...
  (*) 2014-09-06 21:53:41: Checking free space in /u01/iso.stage.060914215326
  (*) 2014-09-06 21:53:41: Unzipping /u01/p17809253_112330_Linux-x86-64.zip to /u01/iso.stage.060914215326, this may take a while
  (*) 2014-09-06 21:54:00: Original /etc/yum.conf moved to /etc/yum.conf.060914215326, generating new yum.conf
  (*) 2014-09-06 21:54:00: Generating Exadata repository file /etc/yum.repos.d/Exadata-computenode.repo

  Warning: Network routing configuration requires change before updating database server. See MOS 1306154.1

Continue ? [y/n]
y

  (*) 2014-09-06 21:54:17: Validating the specified source location.
  (*) 2014-09-06 21:54:18: Cleaning up the yum cache.
  (*) 2014-09-06 21:54:18: Preparing update for releases 11.2.3.3.0 and later
  (*) 2014-09-06 21:54:28: Performing yum package dependency check for 'exact' dependencies. This may take a while...
  (*) 2014-09-06 21:54:32: 'Exact'package dependency check succeeded.
  (*) 2014-09-06 21:54:32: 'Minimum' package dependency check succeeded.

Active Image version   : 11.2.3.1.0.120304
Active Kernel version  : 2.6.18-274.18.1.0.1.el5
Active LVM Name        : /dev/mapper/VGExaDb-LVDbSys1
Inactive Image version : n/a
Inactive LVM Name      : /dev/mapper/VGExaDb-LVDbSys2
Current user id        : root
Action                 : upgrade
Upgrading to           : 11.2.3.3.0.131014.1 (to exadata-sun-computenode-exact)
Baseurl                : file:///var/www/html/yum/unknown/EXADATA/dbserver/060914215326/x86_64/ (iso)
Iso file               : /u01/iso.stage.060914215326/repoimage.iso
Create a backup        : Yes
Shutdown stack         : No (Currently stack is down)
Hotspare exists        : Yes, but will NOT be reclaimed as part of this update)
                       : Raid reconstruction to add the hotspare to be done later when required
RPM exclusion list     : Not in use (add rpms to /etc/exadata/yum/exclusion.lst and restart dbnodeupdate.sh)
RPM obsolete list      : /etc/exadata/yum/obsolete.lst (lists rpms to be removed by the update)
                       : RPM obsolete list is extracted from exadata-sun-computenode-11.2.3.3.0.131014.1-1.x86_64.rpm
Exact dependencies     : No conflicts
Minimum dependencies   : No conflicts
Logfile                : /var/log/cellos/dbnodeupdate.log (runid: 060914215326)
Diagfile               : /var/log/cellos/dbnodeupdate.060914215326.diag
Server model           : SUN FIRE X4170 M2 SERVER
dbnodeupdate.sh rel.   : 3.55 (always check MOS 1553103.1 for the latest release of dbnodeupdate)
Note                   : After upgrading and rebooting run './dbnodeupdate.sh -c' to finish post steps.

The following known issues will be checked for and automatically corrected by dbnodeupdate.sh:
  (*) - Issue 1.7 - Updating database servers with customized partitions may remove partitions already in use
  (*) - Issue - 11.2.3.3.0 and 12.1.1.1.0 require disabling SDP APM settings. See MOS 1623834.1
  (*) - Issue - Incorrect validation name for ExaWatcher in /etc/cron.daily/cellos stops ExaWatcher
  (*) - Issue - tls_checkpeer and tls_crlcheck mis-configured in /etc/ldap.conf

The following known issues will be checked for but require manual follow-up:
  (*) - Issue - Database Server upgrades may hit network routing issues after the upgrade
  (*) - Issue - Yum rolling update requires fix for 11768055 when Grid Infrastructure is below 11.2.0.2 BP12
  (*) - Updates from releases earlier than 11.2.3.3.0 may hang during reboot. See MOS 1620826.1 for more details

Continue ? [y/n]
y
  (*) 2014-09-06 21:54:57: Verifying GI and DB's are shutdown
  (*) 2014-09-06 21:54:59: Collecting console history for diag purposes
  (*) 2014-09-06 21:55:15: Unmount of /boot successful
  (*) 2014-09-06 21:55:15: Check for /dev/sda1 successful
  (*) 2014-09-06 21:55:15: Mount of /boot successful
  (*) 2014-09-06 21:55:15: Disabling stack from starting
  (*) 2014-09-06 21:55:15: Performing filesystem backup to /dev/mapper/VGExaDb-LVDbSys2. Avg. 30 minutes (maximum 120) depends per environment.....
  (*) 2014-09-06 21:59:26: Backup successful
  (*) 2014-09-06 21:59:26: OSWatcher stopped successful
  (*) 2014-09-06 21:59:27: Validating the specified source location.
  (*) 2014-09-06 21:59:28: Cleaning up the yum cache.
  (*) 2014-09-06 21:59:28: Preparing update for releases 11.2.3.3.0 and later
  (*) 2014-09-06 21:59:32: Performing yum update. Node is expected to reboot when finished.
  (*) 2014-09-06 22:01:56: Waiting for post rpm script to finish. Sleeping another 60 seconds (60 / 900)

Remote broadcast message (Sat Sep  6 22:02:02 2014):

Exadata post install steps started.
It may take up to 5 minutes.
  (*) 2014-09-06 22:02:56: Waiting for post rpm script to finish. Sleeping another 60 seconds (120 / 900)
Remote broadcast message (Sat Sep  6 22:03:15 2014):
Exadata post install steps completed with success

整个Update需要40-50分钟,系统会重启几次。在这中间可以观察到系统重启后,能ping通,但是ssh是不通的,然后要等待最后一次自动重启之后才能SSH连上。这期间最好不要着急。

EXADATA升级—从11.2.3.1.0到11.2.3.3.0–(3)升级存储节点

by buddy on 2014 年 9 月 29 日

升级存储节点的IMAGE之前,需要对环境做check。这里选择使用计算节点作为主要操作对象。

1.检查各个cells节点之间root用户的安全信任关系

[root@gxx2db01 tmp]# dcli -g all_group -l root date
gxx2db01: Sat Sep  6 12:14:41 CST 2014
gxx2db02: Sat Sep  6 12:14:40 CST 2014
gxx2cel01: Sat Sep  6 12:14:41 CST 2014
gxx2cel02: Sat Sep  6 12:14:41 CST 2014
gxx2cel03: Sat Sep  6 12:14:41 CST 2014
[root@gxx2db01 tmp]# dcli -g cell_group -l root 'hostname -i'
gxx2cel01: 10.100.84.104
gxx2cel02: 10.100.84.105
gxx2cel03: 10.100.84.106
2.检测磁盘组属性disk_repair_time配置

[grid@gxx2db02 ~]$ sqlplus / as sysasm
SQL*Plus: Release 11.2.0.3.0 Production on Sat Sep 6 12:20:14 2014
Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
SQL> select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a
where dg.group_number=a.group_number and a.name='disk_repair_time';  
NAME              VALUE
-------          -----
DATA_GXX2       3.6h
DBFS_DG         3.6h
RECO_GXX2       3.6h

这里的时间是3.6个小时,修改这个主要是为了避免升级过程中达到缺省的3.6小时后在cell节点执行删除griddisk的操作。如果发生删除了griddisk的情况,那么,需要升级完成后手工添加这些磁盘组。这里先把它修改成24个小时吧。

SQL> alter diskgroup DATA_GXX2 set attribute 'disk_repair_time'='24h';
Diskgroup altered.

SQL> alter diskgroup DBFS_DG set attribute 'disk_repair_time'='24h';
Diskgroup altered.

SQL> alter diskgroup RECO_GXX2 set attribute 'disk_repair_time'='24h';
Diskgroup altered.

SQL> select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a
where dg.group_number=a.group_number and a.name='disk_repair_time';  
NAME              VALUE
-------          -----
DATA_GXX2        24h
DBFS_DG          24h
RECO_GXX2        24h
3.检查操作系统的内核版本

root@gxx2db01 tmp]# dcli -g all_group -l root 'uname -a'
gxx2db01: Linux gxx2db01.gx.csg.cn 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
gxx2db02: Linux gxx2db02.gx.csg.cn 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
gxx2cel01: Linux gxx2cel01.gx.csg.cn 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
gxx2cel02: Linux gxx2cel02.gx.csg.cn 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
gxx2cel03: Linux gxx2cel03.gx.csg.cn 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
4.检查操作系统版本

[root@gxx2db01 tmp]# dcli -g all_group -l root 'cat /etc/oracle-release'
gxx2db01: Oracle Linux Server release 5.7
gxx2db02: Oracle Linux Server release 5.7
gxx2cel01: Oracle Linux Server release 5.7
gxx2cel02: Oracle Linux Server release 5.7
gxx2cel03: Oracle Linux Server release 5.7
5.检查IMAGE版本

[root@gxx2db01 tmp]# dcli -g all_group -l root 'imageinfo'
gxx2db01:
gxx2db01: Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
gxx2db01: Image version: 11.2.3.1.0.120304
gxx2db01: Image activated: 2002-05-03 22:47:44 +0800
gxx2db01: Image status: success
gxx2db01: System partition on device: /dev/mapper/VGExaDb-LVDbSys1
gxx2db01:
gxx2db02:
gxx2db02: Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
gxx2db02: Image version: 11.2.3.1.0.120304
gxx2db02: Image activated: 2012-05-03 11:29:41 +0800
gxx2db02: Image status: success
gxx2db02: System partition on device: /dev/mapper/VGExaDb-LVDbSys1
gxx2db02:
gxx2cel01:
gxx2cel01: Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
gxx2cel01: Cell version: OSS_11.2.3.1.0_LINUX.X64_120304
gxx2cel01: Cell rpm version: cell-11.2.3.1.0_LINUX.X64_120304-1
gxx2cel01:
gxx2cel01: Active image version: 11.2.3.1.0.120304
gxx2cel01: Active image activated: 2012-05-03 03:00:13 -0700
gxx2cel01: Active image status: success
gxx2cel01: Active system partition on device: /dev/md6
gxx2cel01: Active software partition on device: /dev/md8
gxx2cel01:
gxx2cel01: In partition rollback: Impossible
gxx2cel01:
gxx2cel01: Cell boot usb partition: /dev/sdm1
gxx2cel01: Cell boot usb version: 11.2.3.1.0.120304
gxx2cel01:
gxx2cel01: Inactive image version: 11.2.2.3.5.110815
gxx2cel01: Inactive image activated: 2011-10-19 16:15:42 -0700
gxx2cel01: Inactive image status: success
gxx2cel01: Inactive system partition on device: /dev/md5
gxx2cel01: Inactive software partition on device: /dev/md7
gxx2cel01:
gxx2cel01: Boot area has rollback archive for the version: 11.2.2.3.5.110815
gxx2cel01: Rollback to the inactive partitions: Possible
gxx2cel02:
gxx2cel02: Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
gxx2cel02: Cell version: OSS_11.2.3.1.0_LINUX.X64_120304
gxx2cel02: Cell rpm version: cell-11.2.3.1.0_LINUX.X64_120304-1
gxx2cel02:
gxx2cel02: Active image version: 11.2.3.1.0.120304
gxx2cel02: Active image activated: 2012-05-03 02:59:52 -0700
gxx2cel02: Active image status: success
gxx2cel02: Active system partition on device: /dev/md6
gxx2cel02: Active software partition on device: /dev/md8
gxx2cel02:
gxx2cel02: In partition rollback: Impossible
gxx2cel02:
gxx2cel02: Cell boot usb partition: /dev/sdm1
gxx2cel02: Cell boot usb version: 11.2.3.1.0.120304
gxx2cel02:
gxx2cel02: Inactive image version: 11.2.2.3.5.110815
gxx2cel02: Inactive image activated: 2011-10-19 16:26:30 -0700
gxx2cel02: Inactive image status: success
gxx2cel02: Inactive system partition on device: /dev/md5
gxx2cel02: Inactive software partition on device: /dev/md7
gxx2cel02:
gxx2cel02: Boot area has rollback archive for the version: 11.2.2.3.5.110815
gxx2cel02: Rollback to the inactive partitions: Possible
gxx2cel03:
gxx2cel03: Kernel version: 2.6.18-274.18.1.0.1.el5 #1 SMP Thu Feb 9 19:07:16 EST 2012 x86_64
gxx2cel03: Cell version: OSS_11.2.3.1.0_LINUX.X64_120304
gxx2cel03: Cell rpm version: cell-11.2.3.1.0_LINUX.X64_120304-1
gxx2cel03:
gxx2cel03: Active image version: 11.2.3.1.0.120304
gxx2cel03: Active image activated: 2012-05-03 02:58:38 -0700
gxx2cel03: Active image status: success
gxx2cel03: Active system partition on device: /dev/md6
gxx2cel03: Active software partition on device: /dev/md8
gxx2cel03:
gxx2cel03: In partition rollback: Impossible
gxx2cel03:
gxx2cel03: Cell boot usb partition: /dev/sdm1
gxx2cel03: Cell boot usb version: 11.2.3.1.0.120304
gxx2cel03:
gxx2cel03: Inactive image version: 11.2.2.3.5.110815
gxx2cel03: Inactive image activated: 2011-10-19 16:26:59 -0700
gxx2cel03: Inactive image status: success
gxx2cel03: Inactive system partition on device: /dev/md5
gxx2cel03: Inactive software partition on device: /dev/md7
gxx2cel03:
gxx2cel03: Boot area has rollback archive for the version: 11.2.2.3.5.110815
gxx2cel03: Rollback to the inactive partitions: Possible

[root@gxx2db01 tmp]# dcli -g all_group -l root 'imagehistory'
gxx2db01: Version                              : 11.2.3.1.0.120304
gxx2db01: Image activation date                : 2002-05-03 22:47:44 +0800
gxx2db01: Imaging mode                         : fresh
gxx2db01: Imaging status                       : success
gxx2db01:
gxx2db02: Version                              : 11.2.3.1.0.120304
gxx2db02: Image activation date                : 2012-05-03 11:29:41 +0800
gxx2db02: Imaging mode                         : fresh
gxx2db02: Imaging status                       : success
gxx2db02:
gxx2cel01: Version                              : 11.2.2.3.5.110815
gxx2cel01: Image activation date                : 2011-10-19 16:15:42 -0700
gxx2cel01: Imaging mode                         : fresh
gxx2cel01: Imaging status                       : success
gxx2cel01:
gxx2cel01: Version                              : 11.2.3.1.0.120304
gxx2cel01: Image activation date                : 2012-05-03 03:00:13 -0700
gxx2cel01: Imaging mode                         : out of partition upgrade
gxx2cel01: Imaging status                       : success
gxx2cel01:
gxx2cel02: Version                              : 11.2.2.3.5.110815
gxx2cel02: Image activation date                : 2011-10-19 16:26:30 -0700
gxx2cel02: Imaging mode                         : fresh
gxx2cel02: Imaging status                       : success
gxx2cel02:
gxx2cel02: Version                              : 11.2.3.1.0.120304
gxx2cel02: Image activation date                : 2012-05-03 02:59:52 -0700
gxx2cel02: Imaging mode                         : out of partition upgrade
gxx2cel02: Imaging status                       : success
gxx2cel02:
gxx2cel03: Version                              : 11.2.2.3.5.110815
gxx2cel03: Image activation date                : 2011-10-19 16:26:59 -0700
gxx2cel03: Imaging mode                         : fresh
gxx2cel03: Imaging status                       : success
gxx2cel03:
gxx2cel03: Version                              : 11.2.3.1.0.120304
gxx2cel03: Image activation date                : 2012-05-03 02:58:38 -0700
gxx2cel03: Imaging mode                         : out of partition upgrade
gxx2cel03: Imaging status                       : success
gxx2cel03:
6.检查ofa版本

[root@gxx2db01 tmp]# dcli -g all_group -l root 'rpm -qa | grep ofa'
gxx2db01: ofa-2.6.18-274.18.1.0.1.el5-1.5.1-4.0.58
gxx2db02: ofa-2.6.18-274.18.1.0.1.el5-1.5.1-4.0.58
gxx2cel01: ofa-2.6.18-274.18.1.0.1.el5-1.5.1-4.0.58
gxx2cel02: ofa-2.6.18-274.18.1.0.1.el5-1.5.1-4.0.58
gxx2cel03: ofa-2.6.18-274.18.1.0.1.el5-1.5.1-4.0.58
7.检查硬件设备

[root@gxx2db01 tmp]# dcli -g all_group -l root 'dmidecode -s system-product-name'
gxx2db01: SUN FIRE X4170 M2 SERVER
gxx2db02: SUN FIRE X4170 M2 SERVER
gxx2cel01: SUN FIRE X4270 M2 SERVER
gxx2cel02: SUN FIRE X4270 M2 SERVER
gxx2cel03: SUN FIRE X4270 M2 SERVER
8.检查cells节点的日志

gxx2cel01: 36    2014-08-29T08:54:27+08:00       info            "This is a test trap"
gxx2cel02: 40_1  2014-08-28T20:01:24+08:00       warning         "Oracle Exadata Storage Server failed to auto-create cell disk and grid disks on the newly inserted physical disk. Physical Disk : 20:4  Status        : normal  Manufacturer  : SEAGATE  Model Number  : ST360057SSUN600G  Size          : 600G  Serial Number : E4CK7V  Firmware      : 0B25  Slot Number   : 4  "
gxx2cel02: 41    2014-08-29T08:54:04+08:00       info            "This is a test trap"gxx2cel03: 27_3  2014-08-13T18:28:11+08:00       clear           "Hard disk replaced.  Status        : NORMAL  Manufacturer  : HITACHI  Model Number  : HUS1560SCSUN600G  Size          : 600G  Serial Number : K7UL6N  Firmware      : A700  Slot Number   : 11  Cell Disk     : CD_11_gxx2cel03  Grid Disk     : DATA_GXX2_CD_11_gxx2cel03, RECO_GXX2_CD_11_gxx2cel03, DBFS_DG_CD_11_gxx2cel03"
gxx2cel03: 28    2014-08-29T08:54:43+08:00       info            "This is a test trap"
9.检查是否存在offline的grid盘

[root@gxx2db01 tmp]# dcli -g cell_group -l root "cellcli -e "LIST GRIDDISK ATTRIBUTES name WHERE asmdeactivationoutcome != 'Yes'" "
10. 验证cell节点网络配置信息与cell.conf保持一致

[root@gxx2db01 tmp]# dcli -g cell_group -l root /opt/oracle.cellos/ipconf -verify
gxx2cel01: Verifying of Exadata configuration file /opt/oracle.cellos/cell.conf
gxx2cel01: Done. Configuration file /opt/oracle.cellos/cell.conf passed all verification checks
gxx2cel02: Verifying of Exadata configuration file /opt/oracle.cellos/cell.conf
gxx2cel02: Done. Configuration file /opt/oracle.cellos/cell.conf passed all verification checks
gxx2cel03: Verifying of Exadata configuration file /opt/oracle.cellos/cell.conf
gxx2cel03: Done. Configuration file /opt/oracle.cellos/cell.conf passed all verification checks
11.停止CRS和存储节点的服务

[root@gxx2db01 tmp]dcli -g dbs_group -l root "/u01/app/11.2.0.3/grid/bin/crsctl stop crs -f"
[root@gxx2db01 tmp]dcli -g dbs_group -l root "ps -ef | grep d.bin"
[root@gxx2db01 tmp]dcli -g cell_group -l root "cellcli -e alter cell shutdown services all"
12.解压安装介质和解压插件

[root@gxx2db01 ExaImage]# unzip p16278923_112330_Linux-x86-64.zip
[root@gxx2db01 ExaImage]# unzip -d patch_11.2.3.3.0.131014.1/plugins/ p17938410_112330_Linux-x86-64.zip -x Readme.txt
[root@gxx2db01 ExaImage]# chmod +x patch_11.2.3.3.0.131014.1/plugins/*
13. 清理之前patchmgr运行后的环境

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -cells /tmp/cell_group -reset_force
2014-09-06 13:48:44 +0800 DONE: reset_force

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -cells  /tmp/cell_group -cleanup
2014-09-06 13:49:51 +0800 DONE: Cleanup
14.预安装检查

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -cells /tmp/cell_group -patch_check_prereq
2014-09-06 14:27:26 +0800        :Working: DO: Check cells have ssh equivalence for root user. Up to 10 seconds per cell ...
2014-09-06 14:27:27 +0800        :SUCCESS: DONE: Check cells have ssh equivalence for root user.
2014-09-06 14:27:27 +0800        :Working: DO: Initialize files, check space and state of cell services. Up to 1 minute ...
2014-09-06 14:27:49 +0800        :SUCCESS: DONE: Initialize files, check space and state of cell services.
2014-09-06 14:27:49 +0800        :Working: DO: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction. Up to 40 minutes ...
2014-09-06 14:28:17 +0800 Wait correction of degraded md11 due to md partner size mismatch. Up to 30 minutes.

2014-09-06 14:28:18 +0800        :SUCCESS: DONE: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction.
2014-09-06 14:28:18 +0800        :Working: DO: Check prerequisites on all cells. Up to 2 minutes ...
2014-09-06 14:29:01 +0800        :SUCCESS: DONE: Check prerequisites on all cells.
2014-09-06 14:29:01 +0800        :Working: DO: Execute plugin check for Patch Check Prereq ...
2014-09-06 14:29:01 +0800 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 17854520 v1.1. Details in logfile /backup/ExaImage/patch_11.2.3.3.0.131014.1/patchmgr.stdout.
2014-09-06 14:29:01 +0800 :INFO: This plugin checks dbhomes across all nodes with oracle-user ssh equivalence, but only for those known to the local system. dbhomes that exist only on remote nodes must be checked manually.
2014-09-06 14:29:01 +0800 :SUCCESS: No exposure to bug 17854520 with non-rolling patching
2014-09-06 14:29:01 +0800        :SUCCESS: DONE: Execute plugin check for Patch Check Prereq.
15.升级存储节点

[root@gxx2db01 patch_11.2.3.3.0.131014.1]# ./patchmgr -cells /tmp/cell_group -patch
NOTE Cells will reboot during the patch or rollback process.
NOTE For non-rolling patch or rollback, ensure all ASM instances using
NOTE the cells are shut down for the duration of the patch or rollback.
NOTE For rolling patch or rollback, ensure all ASM instances using
NOTE the cells are up for the duration of the patch or rollback.

WARNING Do not start more than one instance of patchmgr.
WARNING Do not interrupt the patchmgr session.
WARNING Do not alter state of ASM instances during patch or rollback.
WARNING Do not resize the screen. It may disturb the screen layout.
WARNING Do not reboot cells or alter cell services during patch or rollback.
WARNING Do not open log files in editor in write mode or try to alter them.

NOTE All time estimates are approximate. Timestamps on the left are real.
NOTE You may interrupt this patchmgr run in next 60 seconds with control-c.


2014-09-06 14:32:49 +0800        :Working: DO: Check cells have ssh equivalence for root user. Up to 10 seconds per cell ...
2014-09-06 14:32:50 +0800        :SUCCESS: DONE: Check cells have ssh equivalence for root user.
2014-09-06 14:32:50 +0800        :Working: DO: Initialize files, check space and state of cell services. Up to 1 minute ...
2014-09-06 14:33:32 +0800        :SUCCESS: DONE: Initialize files, check space and state of cell services.
2014-09-06 14:33:32 +0800        :Working: DO: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction. Up to 40 minutes ...
2014-09-06 14:34:00 +0800 Wait correction of degraded md11 due to md partner size mismatch. Up to 30 minutes.


2014-09-06 14:34:01 +0800        :SUCCESS: DONE: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction.
2014-09-06 14:34:01 +0800        :Working: DO: Check prerequisites on all cells. Up to 2 minutes ...
2014-09-06 14:34:43 +0800        :SUCCESS: DONE: Check prerequisites on all cells.
2014-09-06 14:34:43 +0800        :Working: DO: Copy the patch to all cells. Up to 3 minutes ...
2014-09-06 14:35:15 +0800        :SUCCESS: DONE: Copy the patch to all cells.
2014-09-06 14:35:17 +0800        :Working: DO: Execute plugin check for Patch Check Prereq ...
2014-09-06 14:35:17 +0800 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 17854520 v1.1. Details in logfile /backup/ExaImage/patch_11.2.3.3.0.131014.1/patchmgr.stdout.
2014-09-06 14:35:17 +0800 :INFO: This plugin checks dbhomes across all nodes with oracle-user ssh equivalence, but only for those known to the local system. dbhomes that exist only on remote nodes must be checked manually.
2014-09-06 14:35:17 +0800 :SUCCESS: No exposure to bug 17854520 with non-rolling patching
2014-09-06 14:35:18 +0800        :SUCCESS: DONE: Execute plugin check for Patch Check Prereq.
2014-09-06 14:35:18 +0800 1 of 5 :Working: DO: Initiate patch on cells. Cells will remain up. Up to 5 minutes ...
2014-09-06 14:35:30 +0800 1 of 5 :SUCCESS: DONE: Initiate patch on cells.
2014-09-06 14:35:30 +0800 2 of 5 :Working: DO: Waiting to finish pre-reboot patch actions. Cells will remain up. Up to 45 minutes ...
2014-09-06 14:36:30 +0800 Wait for patch pre-reboot procedures


2014-09-06 15:03:13 +0800 2 of 5 :SUCCESS: DONE: Waiting to finish pre-reboot patch actions.
2014-09-06 15:03:13 +0800        :Working: DO: Execute plugin check for Patching ...
2014-09-06 15:03:13 +0800        :SUCCESS: DONE: Execute plugin check for Patching.
2014-09-06 15:03:13 +0800 3 of 5 :Working: DO: Finalize patch on cells. Cells will reboot. Up to 5 minutes ...
2014-09-06 15:03:33 +0800 3 of 5 :SUCCESS: DONE: Finalize patch on cells.
2014-09-06 15:03:33 +0800 4 of 5 :Working: DO: Wait for cells to reboot and come online. Up to 120 minutes ...
2014-09-06 15:04:33 +0800 Wait for patch finalization and reboot

||||| Minutes left 076

2014-09-06 16:01:39 +0800 4 of 5 :SUCCESS: DONE: Wait for cells to reboot and come online.
2014-09-06 16:01:39 +0800 5 of 5 :Working: DO: Check the state of patch on cells. Up to 5 minutes ...
2014-09-06 16:02:14 +0800 5 of 5 :SUCCESS: DONE: Check the state of patch on cells.
2014-09-06 16:02:14 +0800        :Working: DO: Execute plugin check for Post Patch ...
2014-09-06 16:02:14 +0800 :INFO: /backup/ExaImage/patch_11.2.3.3.0.131014.1/plugins/001-post_11_2_3_3_0 - 17718598: Correct /etc/oracle-release.
2014-09-06 16:02:14 +0800 :INFO: /backup/ExaImage/patch_11.2.3.3.0.131014.1/plugins/001-post_11_2_3_3_0 - 17908298: Preserve password quality policies where applicable.
2014-09-06 16:02:15 +0800        :SUCCESS: DONE: Execute plugin check for Post Patch.

运行完成升级脚本后,系统会在屏幕上输出一系列的WORKING,SUCCESS等,如果运行到某一个地方出现Failed,则升级会中断,此时需要去解决这个问题。存储节点在升级的时候会自动重启,我们在计算节点可以看到下列日志:“SUCCESS: DONE: Wait for cells to reboot and come online.”最终在计算节点升级脚本运行完毕,一般需要1个半小时以上的时间。然后我可以检查下image的版本,判断是否升级成功。这期间要保证网络不断,因为我们是从计算节点发起的升级操作。所以最好使用vnc软件来执行升级,免得终端突然断掉引起不可预知的问题.

EXADATA升级—从11.2.3.1.0到11.2.3.3.0–(2)备份环境和升级LSI Disk Array Controller Firmware

by buddy on 2014 年 9 月 29 日

1.配置NFS环境

为了能够保证升级出错以后,可以回退到升级前的状态。我们需要把整个Exadata的部分环境做一个备份。我们采用的备份方式是NFS方式。我们找到了一台能够ping通的局域网内网的Linux服务器,把这台服务器将作为NFS的服务器,并且这台服务器上事先已经挂载了1T的空间。

在服务端修改/etc/exports,加上下列内容

/media/_data/  10.100.82.1(rw)
/media/_data/  10.100.82.2(rw)

注意:这个IP地址是Exadata映射出来的IP,不是计算节点的物理IP,必须从服务器端/var/log/messages里面可以看到Exadata客户端发起的请求IP,把请求IP配置到/etc/exports才能配置成功。因为客户在不同网段之间访问设置了防火墙,所以还需要通过配置固定端口进行连通。在服务端修改/etc/sysconfig/nfs,增加如下端口。

MOUNTD_PORT="4002"
STATD_PORT="4003"
LOCKD_TCPPORT="4004"
LOCKD_UDPPORT="4004"

操作系统上的防火墙全部都要关闭。

service iptables off

检查NFS是否配置好。

rpcinfo –p    在服务器端执行,查看端口是否正确.
showmount –e  在服务器端执行能查看到nfs文件系统的信息.
showmount -e  服务端ip地址  在客户端执行  能从客户端查看到nfs文件系统的信息.

在exadata的两个计算节点上mount NFS文件系统。

mount -t nfs -o rw,intr,soft,proto=tcp,nolock 10.194.42.11:/media/_data /root/tar

2.备份现有环境

做完NFS的配置之后,我们就可以用来进行备份Exadata计算节点的操作系统,集群软件、数据库软件及数据库的备份,而我们的存储节点因为可以使用CELL BOOT USB Flash Drive来进行恢复,所以无须备份。

2.1备份计算节点操作系统
[root@gxx2db01 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   14G   15G  49% /
/dev/sda1             502M   36M  441M   8% /boot
/dev/mapper/VGExaDb-LVDbOra1
                       99G   55G   39G  59% /u01
tmpfs                  81G   26M   81G   1% /dev/shm
/dev/mapper/datavg-lv_data
                      549G  355G  166G  69% /backup
dbfs-dbfs@dbfs:/      800G  4.9G  796G   1% /data
10.194.42.11:/media/_data
                      985G  199M  935G   1% /root/tar

可以看到当前目录已经挂载了1个T空间的NFS容量,我们的操作系统存在着两个LV,一个是/dev/mapper/VGExaDb-LVDbSys1和/dev/mapper/VGExaDb-LVDbOra1,而datavg-lv-data是我们自己划的用于数据库备份的。所以备份操作系统也就是备份/dev/mapper/VGExaDb-LVDbSys1和/dev/mapper/VGExaDb-LVDbOra1这两个LV,我们使用下面的备份方式。

[root@gxx2db01 ~]# lvcreate -L1G -s -n root_snap /dev/VGExaDb/LVDbSys1
  Logical volume "root_snap" created
[root@gxx2db01 ~]# e2label /dev/VGExaDb/root_snap DBSYS_SNAP
[root@gxx2db01 ~]# mkdir /root/mnt
[root@gxx2db01 ~]# mount /dev/VGExaDb/root_snap /root/mnt -t ext3

[root@gxx2db01 ~]# lvcreate -L5G -s -n u01_snap /dev/VGExaDb/LVDbOra1
  Logical volume "u01_snap" created
[root@gxx2db01 ~]# e2label /dev/VGExaDb/u01_snap DBORA_SNAP
[root@gxx2db01 ~]# mkdir -p /root/mnt/u01
[root@gxx2db01 ~]# mount /dev/VGExaDb/u01_snap /root/mnt/u01 -t ext3

[root@gxx2db01 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VGExaDb-LVDbSys1
                       30G   14G   15G  49% /
/dev/sda1             502M   36M  441M   8% /boot
/dev/mapper/VGExaDb-LVDbOra1
                       99G   55G   39G  59% /u01
tmpfs                  81G   26M   81G   1% /dev/shm
/dev/mapper/datavg-lv_data
                      549G  355G  166G  69% /backup
dbfs-dbfs@dbfs:/      800G  4.9G  796G   1% /data
10.194.42.11:/media/_data
                      985G  199M  935G   1% /root/tar
/dev/mapper/VGExaDb-root_snap
                       30G   14G   15G  49% /root/mnt
/dev/mapper/VGExaDb-u01_snap
                       99G   55G   39G  59% /root/mnt/u01

做完上述步骤之后,可以看到多了两个lv,对VGExaDb-LVDbSys1和VGExaDb-LVDbOra1做了一个备份并挂载成了文件系统。接下来我们就可以把我们备份的文件系统tar到NFS上面。

[root@gxx2db01 ~]# cd /root/mnt
[root@gxx2db01 ~]#  tar -pjcvf /root/tar/mybackup.tar.bz2 * /boot --exclude \
tar/mybackup.tar.bz2 --exclude  /root/tar > \
/tmp/backup_tar.stdout 2> /tmp/backup_tar.stderr

做完tar之后可以查看/tmp/backup_tar.stderr文件检查是否有错误。如果无误,我们就可以把刚刚建的文件系统挂载点进行卸载,创建的LV进行删除。

[root@gxx2db01 ~]# cd /
[root@gxx2db01 ~]# umount /root/mnt/u01
[root@gxx2db01 ~]# umount /root/mnt
[root@gxx2db01 ~]# /bin/rm -rf /root/mnt
[root@gxx2db01 ~]# lvremove /dev/VGExaDb/u01_snap
[root@gxx2db01 ~]# lvremove /dev/VGExaDb/root_snap

以上操作分别在两个节点进行。

2.2备份计算节点数据库

计算节点上运行了三套数据库实例,分别是gxypdb,orcl,jjscpd等,而gxypdb和orcl采用了RMAN备份,而jjscpd采用了exp备份,是放在计算节点的dbfs文件系统里面的。对于使用RMAN备份的数据库,我们采用下列脚本,把数据备份到了/backup/orcl和/backup/gxypdb下面。我们只需要把备份出的文件夹拷贝到NFS目录下即可完成对数据库的备份,而对于exp的备份,我们也只需要把dbfs文件系统里面的dmp文件copy到NFS目录下。

--->备份数据库
export ORACLE_SID=orcl2
source /home/oracle/.bash_profile
$ORACLE_HOME/bin/rman log=/backup/log/full_`date +%Y%m%d%H%M`.log <<EOF
connect target /
run
{
# Backup Database full
BACKUP
     SKIP INACCESSIBLE
     TAG hot_db_bk_level
     FORMAT '/backup/orcl/bk_s%s_p%p_t%T'
    DATABASE
    INCLUDE CURRENT CONTROLFILE;
}
run
{
# Backup Archived Logs

sql 'alter system archive log current';
change archivelog all crosscheck;
BACKUP
    FORMAT '/backup/orcl/ar_s%s_p%p_t%T'
    ARCHIVELOG ALL;

# Control file backup
BACKUP
    FORMAT '/backup/orcl/cf_s%s_p%p_t%T'
    CURRENT CONTROLFILE;
}
delete noprompt archivelog until time "sysdate - 5";
crosscheck backup;
delete force noprompt expired backup;
allocate channel for maintenance type disk;
delete force noprompt obsolete device type disk;
list backup summary;
exit;
EOF
--->拷贝备份集到NFS
[root@gxx2db01 ~]# cp  -rp /backup/orcl/ /root/tar
[root@gxx2db01 ~]# cp  -rp /backup/gxypdb/ /root/tar
[root@gxx2db01 ~]# cp –rp /data/*.dmp  /root/tar
2.3备份计算节点集群软件和数据库软件

备份计算节点集群软件和数据库软件,主要是为了防止安装QUARTERLY DATABASE PATCH FOR EXADATA (BP 23),也就是GI和DB的Patch出现不可预知的错误,方便我们能够进行回退。此操作最好是要先停止掉数据库软件和GI软件。

[oracle@gxx2db01 ~]$ srvctl stop instance –i orcl1 –d orcl
[oracle@gxx2db01 ~]$ srvctl stop instance –i orcl2 –d orcl
[oracle@gxx2db01 ~]$ srvctl stop instance –i gxypdb1 –d gxypdb
[oracle@gxx2db01 ~]$ srvctl stop instance –i gxypdb2 –d gxypdb
[oracle@gxx2db01 ~]$ srvctl stop instance –i jjscpd1 –d jjscpd
[oracle@gxx2db01 ~]$ srvctl stop instance –i jjscpd2 –d jjscpd
[root@gxx2db01 ~]# /u01/app/11.2.0.3/grid/bin/crsctl stop crs -f
[root@gxx2db01 ~]# cd /root/tar
[root@gxx2db01 ~]# tar -cvf oraInventory.tar /u01/app/oraInventory 
[root@gxx2db01 ~]# tar -cvf grid.tar /u01/app/11.2.0.3/grid 
[root@gxx2db01 ~]# tar -cvf oracle.tar /u01/app/oracle/product/11.2.0.3/dbhome_1
2.4备份交换机配置文件

任意登陆到一台ILOM的管理界面上,例如:gxx2db01-ilom https://10.100.84.118,通过点击Maintenance标签,再选择Backup/Restore的标签,选择Operation为Backup,而Method为Browser,选择完成之后在Passphrase输入密码,点击Run,即可以在浏览器中生成一个XML的备份文件。

image

3.Reset ILOM和重启Exadata

为了能够顺利的进行升级,升级前最好把整个Exadata重启一次,重启的顺序就是先进入到ILOM 管理界面Reset SP,然后停止CELLS节点的服务,重启所有CELLS,重启成功之后,在重启计算节点。客户的Exadata总共有5个ILOM的管理界面,分别是两台计算节点和三台存储CELLS节点的,需要通过网址访问,因为防火墙的关系,需要找网络管理员开放端口才可以访问。进入到管理界面选择Maintenance,然后选择Reset SP即可。然后要等一会,就可以重新连接上了。ILOM5台的管理地址如下:

gxx2db01-ilom		https://10.100.84.118
gxx2db02-ilom		https://10.100.84.119
gxx2cel01-ilom		https://10.100.84.126
gxx2cel02-ilom		https://10.100.84.127
gxx2cel03-ilom       https://10.100.84.128

对于存储节点,我们需要先停止掉cells的服务,到每一台cells服务器上运行下列命令:

cellcli -e alter cell shutdown services all

停止成功后检查一下cells的服务是否全部停止成功。

cellcli -e list cell attributes msstatus,cellsrvstatus,rsstatus

重启存储节点的主机

sync
reboot

等到存储节点重启完成之后,检查cells服务是否成功启动,成功启动则没有问题。此时可以重启计算节点,在前面做软件备份的时候停止了数据库和集群软件,如果没有停止,需要先考虑停止数据库,然后再停止集群软件,再进行计算节点的重启。

sync
reboot

4.检查信任关系

为了保证顺畅的升级,需要确保在计算节点能够和存储节点建立安全的信任关系,这里主要是通过SSH来实现的。首先在/tmp下建立一个all_group,配置上两个计算节点和三个存储节点的主机名。然后在建立一个cell_group,配置上三个存储节点的主机名,然后执行下列命令,如果不需要输入密码能够直接显示,则信任关系正常。

 [root@gxx2db01 tmp]# dcli -g all_group -l root date
gxx2db01: Sat Sep  6 12:14:41 CST 2014
gxx2db02: Sat Sep  6 12:14:40 CST 2014
gxx2cel01: Sat Sep  6 12:14:41 CST 2014
gxx2cel02: Sat Sep  6 12:14:41 CST 2014
gxx2cel03: Sat Sep  6 12:14:41 CST 2014
[root@gxx2db01 tmp]# dcli -g cell_group -l root 'hostname -i'
gxx2cel01: 10.100.84.104
gxx2cel02: 10.100.84.105
gxx2cel03: 10.100.84.106

如果信任关系有问题,需要使用下列命令,重建信任关系。

ssh-keygen -t rsa
dcli -g cell_group -l root –k

5.升级 LSI Disk Array Controller Firmware

安装LSI DISK Disk Array Controller Firmware可以使用滚动模式和非滚动模式,因为我们申请了停机的时间,所以这个操作使用的是非滚动模式。

1.把安装介质FW12120140.zip 上传到每个cells节点的/tmp目录下.

2.解压FW12120140.zip文件.

[root@gxx2db01 tmp]# unzip FW12120140.zip -d /tmp
[root@gxx2db01 tmp]# mkdir -p /tmp/firmware
[root@gxx2db01 tmp]# tar -pjxf  FW12120140.tbz -C /tmp/firmware

在/tmp/fireware下面应该存在一个这样的文件

12.12.0.0140_AF2108_FW_Image.rom 5ff5650dd92acd4e62530bf72aa9ea83

3.验证FW12120140.sh脚本

#!/bin/ksh
echo date > /tmp/manual_fw_update.log
logfile=/tmp/manual_fw_update.log
HWModel=`dmidecode --string system-product-name | tail -1 | sed -e 's/[ \t]\+$//g;s/ /_/g'`
silicon_ver_lsi_card="`lspci 2>/dev/null | grep 'RAID' | grep LSI | awk '{print $NF}' | sed -e 's/03)/B2/g;s/05)/B4/g;'`"
silicon_ver_lsi_card=`echo $silicon_ver_lsi_card | sed -e 's/B2/B4/g'`
lsi_card_firmware_file="SUNDiskControllerFirmware_${silicon_ver_lsi_card}"
echo $lsi_card_firmware_file
echo "`date '+%F %T'`: Now updating the disk controller firmware ..." | tee -a $logfile
echo "`date '+%F %T'`: Now disabling cache of the disk controller ..." | tee -a $logfile
sync
/opt/MegaRAID/MegaCli/MegaCli64 -AdpCacheFlush -aALL -NoLog | tee -a $logfile
/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp WT -Lall -a0 -NoLog | tee -a $logfile
/opt/MegaRAID/MegaCli/MegaCli64 -AdpCacheFlush -aALL -NoLog | tee -a $logfile
/opt/MegaRAID/MegaCli/MegaCli64 -v | tee -a $logfile
/opt/MegaRAID/MegaCli/MegaCli64 -AdpFwFlash -f /tmp/firmware/12.12.0.0140_AF2108_FW_Image.rom  -NoVerChk -a0 -Silent -AppLogFile /tmp/manual_fw_update.log
if [ $? -ne 0 ]; then
   echo "`date '+%F %T'`: [ERROR] Failed to update the Disk Controller firmware. Will continue anyway ..." | tee -a $logfile
else
   echo "`date '+%F %T'`: [INFO] Disk controller firmware update command completed successfully." | tee -a $logfile
fi

给脚本赋予700的权限。

chmod 700 /tmp/FW12120140.sh

4.停止数据库和CRS

[oracle@gxx2db01 ~]$ srvctl stop instance –i orcl1 –d orcl
[oracle@gxx2db01 ~]$ srvctl stop instance –i orcl2 –d orcl
[oracle@gxx2db01 ~]$ srvctl stop instance –i gxypdb1 –d gxypdb
[oracle@gxx2db01 ~]$ srvctl stop instance –i gxypdb2 –d gxypdb
[oracle@gxx2db01 ~]$ srvctl stop instance –i jjscpd1 –d jjscpd
[oracle@gxx2db01 ~]$ srvctl stop instance –i jjscpd2 –d jjscpd
[root@gxx2db01 ~]# /u01/app/11.2.0.3/grid/bin/crsctl stop crs –f
[root@gxx2db01 ~]# /u01/app/11.2.0.3/grid/bin/crsctl check crs

5.停止所有存储节点的服务

[root@gxx2db01 ~]# dcli -l root -g cell_group "cellcli -e alter cell shutdown services all"

6.创建文件DISABLE_HARDWARE_FIRMWARE_CHECKS

[root@gxx2db01 ~]# #dcli -l root -g cell_group "touch /opt/oracle.cellos/DISABLE_HARDWARE_FIRMWARE_CHECKS"

7.禁用exachkcfg服务

[root@gxx2db01 ~]# #dcli -l root -g cell_group "chkconfig exachkcfg off"

8.在cells节点上执行FW12120140.sh脚本

[root@gxx2cel01 tmp]# /tmp/FW12120140.sh
SUNDiskControllerFirmware_B4
2014-09-06 11:15:31: Now updating the disk controller firmware ...
2014-09-06 11:15:31: Now disabling cache of the disk controller ...

Cache Flush is successfully done on adapter 0.

Exit Code: 0x00
Set Write Policy to WriteThrough on Adapter 0, VD 0 (target id: 0) success
Set Write Policy to WriteThrough on Adapter 0, VD 1 (target id: 1) success
Set Write Policy to WriteThrough on Adapter 0, VD 2 (target id: 2) success
Set Write Policy to WriteThrough on Adapter 0, VD 3 (target id: 3) success
Set Write Policy to WriteThrough on Adapter 0, VD 4 (target id: 4) success
Set Write Policy to WriteThrough on Adapter 0, VD 5 (target id: 5) success
Set Write Policy to WriteThrough on Adapter 0, VD 6 (target id: 6) success
Set Write Policy to WriteThrough on Adapter 0, VD 7 (target id: 7) success
Set Write Policy to WriteThrough on Adapter 0, VD 8 (target id: 8) success
Set Write Policy to WriteThrough on Adapter 0, VD 9 (target id: 9) success
Set Write Policy to WriteThrough on Adapter 0, VD 10 (target id: 10) success
Set Write Policy to WriteThrough on Adapter 0, VD 11 (target id: 11) success

Exit Code: 0x00
Cache Flush is successfully done on adapter 0.
Exit Code: 0x00

      MegaCLI SAS RAID Management Tool  Ver 8.02.21 Oct 21, 2011
    (c)Copyright 2011, LSI Corporation, All Rights Reserved.
Exit Code: 0x00
95%   Completed2014-09-06 11:16:09: [INFO] Disk controller firmware update command completed successfully.

9.脚本执行成功之后,需要重启,这里需要注意的一点是,需要重启两次。

[root@gxx2cel01 tmp]#sync
[root@gxx2cel01 tmp]#shutdown -fr now

10.重启完成之后,可以检查LSI MegaRaid Disk Controller Firmware的版本。

[root@gxx2cel01 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -a0 -NoLog  | grep  'FW
Package Build'
FW Package Build: 12.12.0-0079
FW Version         : 2.120.203-1440
Current Size of FW Cache       : 399 MB

11.升级成功之后,移除文件DISABLE_HARDWARE_FIRMWARE_CHECKS

[root@gxx2cel01 ~]# dcli -l root -g cell_group "rm -fr /opt/oracle.cellos/DISABLE_HARDWARE_FIRMWARE_CHECKS"

12.开启exachkcfg服务

[root@gxx2cel01 ~]# dcli -l root -g cell_group "chkconfig exachkcfg on"

13.查看cells服务状态

[root@gxx2cel01 ~]# dcli -l root -g cell_group "cellcli -e list cell attributes msstatus,cellsrvstatus,rsstatus"
   running         running         running

从第5步,开始重复上面的步骤在其他存储节点上运行。等所有节点都完成之后,并且验证是有效的LSI MegaRaid Disk Controller Firmware,重启整个存储节点的服务。