Oracle 19c新特性-RESTORE POINTS FROM PRIMARY TO STANDBY和AUTOMATIC FLASHBACK OF STANDBY DATABASE

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:Oracle 19c新特性-RESTORE POINTS FROM PRIMARY TO STANDBY和AUTOMATIC FLASHBACK OF STANDBY DATABASE
在Oracle 19c上的data guard,可以帮助我们自动复制主库上的还原点到备库上。同时还可以帮助我们在主库直接flashback database,open resetlogs之后,备库也能应用应用。这简化了RESETLOGS在主服务器上进行操作后的备用服务器管理。
在使用还原点这个新特性需要注意以下几点限制:
1.主数据库和备用数据库上的兼容参数都必须为19.0.0或更大
2.由于通过重做日志进行了还原点复制,因此主数据库必须是OPEN状态下的。而备用数据库上的MRP进程需要运行才能进行复制。
3.备用数据库上不应有任何同名的还原点。

我们来首先测试一下这些功能,首先在主库创建一个还原点test_respoint。

[oracle@primary ~]$ sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Wed Jun 10 14:49:46 2020
Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0

SQL> select name,database_role from v$database;
NAME					 DATABASE_ROLE
---------------------------------------- ----------------
DB193H1 				 PRIMARY

SQL> create restore point test_respoint guarantee flashback database;
Restore point created.

SQL> select scn,name,time,replicated from v$restore_point;

       SCN NAME 			  TIME					   REP
---------- ------------------------------ ---------------------------------------- ---
   3379448 TEST_RESPOINT		  11-JUN-20 08.24.55.000000000 AM	   NO

在查看备库,可以发现test_respoint已经被复制过来。在v$restore_point视图中有个字段叫REPLICATED,主库是NO,备库显示YES。而且备库的还原点名字自动加上了“_PRIMARY”后缀。

SQL> set linesize 175 pagesize 1000
SQL> col name format a40
SQL> col time format a40

SQL> select name,database_role from v$database;
NAME					 DATABASE_ROLE
---------------------------------------- ----------------
DB193H1 				 PHYSICAL STANDBY

SQL> select scn,name,time,replicated from v$restore_point;

       SCN NAME 				    TIME				     REP
---------- ---------------------------------------- ---------------------------------------- ---
   3379448 TEST_RESPOINT_PRIMARY		    11-JUN-20 08.24.55.000000000 AM	     YES

接下来我们来试试flashback database的功能。首先我们需要看一下主库和备库的FLASHBACK是否都处于打开状态,如果不是我们可以先打开这个功能。

SQL> select FLASHBACK_ON from v$database;

FLASHBACK_ON
------------------
YES

停止我们的主库,执行lashback database to RESTORE POINT操作,open resetlogs一切顺利。

[oracle@primary admin]$ sqlplus / as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Wed Jun 10 15:20:41 2020
Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.3.0.0.0

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount;
ORACLE instance started.

Total System Global Area 2147481656 bytes
Fixed Size		    8898616 bytes
Variable Size		  486539264 bytes
Database Buffers	 1644167168 bytes
Redo Buffers		    7876608 bytes
Database mounted.

SQL> flashback database to RESTORE POINT test_respoint;
Flashback complete.

SQL> alter database open resetlogs;
Database altered.

此时查看备库一直报错ORA-19909。日志也没办法应用。通过查看ora-19909发现基本上都是data guard产生了新的化身(new incarnation),一般需这种错误是主库中的化身和备库不一致导致的,需要要我们到RMAN下面指定和主库一样的化身(incarnation),

2020-06-11T08:33:36.277971+08:00
Errors in file /u01/app/oracle/diag/rdbms/db193h1_stdby/DB193H1/trace/DB193H1_mrp0_7375.trc:
ORA-19909: datafile 1 belongs to an orphan incarnation
ORA-01110: data file 1: '/u02/oradata/DB193H1_STDBY/datafile/o1_mf_system_hg2y0646_.dbf'

[oracle@primary ~]$ rman target /
RMAN> list incarnation of database;

using target database control file instead of recovery catalog

List of Database Incarnations
DB Key  Inc Key DB Name  DB ID            STATUS  Reset SCN  Reset Time
------- ------- -------- ---------------- --- ---------- ----------
1       1       DB193H1  2407161464       PARENT  1          17-APR-19guangfang
2       2       DB193H1  2407161464       PARENT  1920977    03-JUN-20
3       3       DB193H1  2407161464       PARENT  3158799    10-JUN-20
4       4       DB193H1  2407161464       PARENT  3269801    10-JUN-20
5       5       DB193H1  2407161464       CURRENT 3379450    11-JUN-20

[oracle@standby onlinelog]$ rman target /
RMAN> list incarnation of database;

using target database control file instead of recovery catalog

List of Database Incarnations
DB Key  Inc Key DB Name  DB ID            STATUS  Reset SCN  Reset Time
------- ------- -------- ---------------- --- ---------- ----------
1       1       DB193H1  2407161464       PARENT  1          17-APR-19
2       2       DB193H1  2407161464       PARENT  1920977    03-JUN-20
3       3       DB193H1  2407161464       PARENT  3158799    10-JUN-20
4       4       DB193H1  2407161464       PARENT  3269801    10-JUN-20
5       5       DB193H1  2407161464       CURRENT 3379450    11-JUN-20

而此时怪异就在主库和备库的化身(incarnation)基本是一致的。于是我们又到官方文档上查了一下,发现文档中说“A standby database that is in a mounted state can automatically follow the primary database after a RESETLOGS operation on the primary. ”此时我们的主库不是mount状态。对啊,只有在mount的状态我们才能执行flashback database命令。我真是脑子不好使,于是我把备库停下来再启动到mount状态,神奇的事情发生了。观察日志上面的报错已经不报了,此时会显示Flashback Restore Start和Flashback Media Recovery Start。

2020-06-11T08:33:56.283705+08:00
MRP0 (PID:7375): Recovery coordinator performing automatic flashback of database to SCN:0x00000000003390f8 (3379448)
Flashback Restore Start
Flashback Restore Complete
Flashback Media Recovery Start
2020-06-11T08:33:56.508797+08:00
Setting recovery target incarnation to 4
2020-06-11T08:33:56.518677+08:00
 Started logmerger process
2020-06-11T08:33:56.556674+08:00
Parallel Media Recovery started with 2 slaves
2020-06-11T08:33:56.666413+08:00
stopping change tracking
2020-06-11T08:33:56.715869+08:00
Media Recovery Log /u01/app/oracle/DB193H1_STDBY/archivelog/2020_06_11/o1_mf_1_13_hg2ythg4_.arc
2020-06-11T08:33:56.875276+08:00
Incomplete Recovery applied until change 3379448 time 06/11/2020 08:24:55
2020-06-11T08:33:56.878279+08:00
Flashback Media Recovery Complete
2020-06-11T08:33:56.933695+08:00
stopping change tracking
2020-06-11T08:33:56.949951+08:00
Setting recovery target incarnation to 5
2020-06-11T08:33:56.964968+08:00
 Started logmerger process
2020-06-11T08:33:56.974818+08:00
PR00 (PID:7410): Managed Standby Recovery starting Real Time Apply
2020-06-11T08:33:57.009092+08:00
Parallel Media Recovery started with 2 slaves
2020-06-11T08:33:57.117122+08:00
Media Recovery start incarnation depth : 1, target inc# : 5, irscn : 3379449
stopping change tracking
2020-06-11T08:33:57.137816+08:00
TT02 (PID:7416): LOG_FILE_NAME_CONVERT is not defined, stop clearing ORLs
2020-06-11T08:33:57.282084+08:00
PR00 (PID:7410): Media Recovery Log /u01/app/oracle/DB193H1_STDBY/archivelog/2020_06_11/o1_mf_1_13_hg2ythg4_.arc
2020-06-11T08:33:57.394111+08:00
PR00 (PID:7410): Media Recovery Log /u01/app/oracle/DB193H1_STDBY/archivelog/2020_06_11/o1_mf_1_1_hg2ytgrw_.arc
2020-06-11T08:33:57.480722+08:00
PR00 (PID:7410): Media Recovery Log /u01/app/oracle/DB193H1_STDBY/archivelog/2020_06_11/o1_mf_1_2_hg2yyrwy_.arc
PR00 (PID:7410): Media Recovery Waiting for T-1.S-3 (in transit)
2020-06-11T08:33:57.678427+08:00
Recovery of Online Redo Log: Thread 1 Group 5 Seq 3 Reading mem 0
  Mem# 0: /u02/oradata/DB193H1_STDBY/onlinelog/o1_mf_5_hg2y142y_.log

成功之后,我在开启了实时应用和read only,此时在查询v$dataguard_stats,apply lag和transport lag都显示正常。

SQL> select * from v$dataguard_stats;
SOURCE_DBID SOURCE_DB_UNIQU NAME		      VALUE		   UNIT 			  TIME_COMPUTED 		 DATUM_TIME		  CON_ID
----------- --------------- ------------------------- -------------------- ------------------------------ ------------------------------ -------------------- ----------
 2407161464 DB193H1	    transport lag	      +00 00:00:00	   day(2) to second(0) interval   06/11/2020 08:41:19		 06/11/2020 08:41:18	       0
 2407161464 DB193H1	    apply lag		      +00 00:00:00	   day(2) to second(0) interval   06/11/2020 08:41:19		 06/11/2020 08:41:18	       0
 2407161464 DB193H1	    apply finish time	      +00 00:00:00.000	   day(2) to second(3) interval   06/11/2020 08:41:19					       0
	  0		    estimated startup time    8 		   second			  06/11/2020 08:41:19					       0

此时在查询DG Broker的状态,也显示正常。

DGMGRL> SHOW CONFIGURATION;

Configuration - db_broker_config

  Protection Mode: MaxPerformance
  Members:
  db193h1       - Primary database
    db193h1_stdby - Physical standby database 

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 30 seconds ago)

具体的细节可以参考官方文档https://docs.oracle.com/en/database/oracle/oracle-database/19/sbydb/managing-oracle-data-guard-physical-standby-databases.html#GUID-252097AC-3070-43B6-88D8-919AE27F97AD

Oracle 12c新特性 Attribute Clustering及其对Index查询带来的性能改善

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:Oracle 12c新特性 Attribute Clustering及其对Index查询带来的性能改善
Oracle 12c中有一个新功能,叫Attribute Clustering,这个特性可以帮助我们改善一些索引查询性能。之前我们学习过索引的聚簇因子,如果索引的聚簇因子很高的话,当我们使用索引扫描1000行的数据,可能需要访问不同的数据块中的每一行数据。这本来也没什么,但是如果这个语句频繁的执行,则极大的影响数据库的效率,会消耗更多的逻辑读和CPU。如果我们把表的聚簇。我们来做个小试验验证一下新的功能Attribute Clustering。

1、首先我们创建一个表,插入100万数据。并在id列上创建索引。

SQL> create table testcluster(id number,name varchar2(30));
Table created.

SQL> insert into testcluster select dbms_random.value(1,1000000),dbms_random.string('Y',trunc(dbms_random.value(1,10)))  from dual connect by level <=1000000; 1000000 rows created. 

SQL> commit;
Commit complete.

SQL> create index idx_t1 on testcluster(id);
Index created.

2.收集统计信息,查看聚簇因子。可以看到当前的聚簇因子是999809。

SQL>  exec dbms_stats.gather_table_stats(ownname=>'A1',tabname=>'TESTCLUSTER',method_opt=>'for all columns size auto',cascade=>true,estimate_percent=>100);

SQL> select index_name,clustering_factor from dba_indexes where index_name='IDX_T1';

INDEX_NAME		       CLUSTERING_FACTOR
------------------------------ -----------------
IDX_T1					  999809

3、执行索引扫描,检索1000行数据。此时我们可以看到逻辑读是1019,如果频繁执行这个SQL,将会消耗系统的CPU。

SQL> set autotrace traceonly  
SQL> select * from testcluster t where t.id between 11000 and 12000;

947 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2344056994

---------------------------------------------------------------------------------------------------
| Id  | Operation			    | Name	  | Rows  | Bytes | Cost (%CPU)| Time	  |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		    |		  |  1002 | 28056 |  1009   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TESTCLUSTER |  1002 | 28056 |  1009   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN		    | IDX_T1	  |  1002 |	  |	7   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T"."ID">=11000 AND "T"."ID"<=12000)


Statistics
----------------------------------------------------------
	  9  recursive calls
	  6  db block gets
       1019  consistent gets
	  0  physical reads
       1000  redo size
      44135  bytes sent via SQL*Net to client
       1117  bytes received via SQL*Net from client
	 65  SQL*Net roundtrips to/from client
	  0  sorts (memory)
	  0  sorts (disk)
	947  rows processed

4.在12c之前,我们解决这个问题的做法是重新CTAS这张表,并在select的时候按照id进行排序,在创建索引。那么在12c当中我们就是可以简单的使用语法add clustering by linear,在move table online就能实现了。

SQL> alter table TESTCLUSTER add clustering by linear order(id);
Table altered.

SQL> alter table TESTCLUSTER move online;
Table altered.

SQL> select * from testcluster t where t.id between 11000 and 12000;

947 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2344056994

---------------------------------------------------------------------------------------------------
| Id  | Operation			    | Name	  | Rows  | Bytes | Cost (%CPU)| Time	  |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		    |		  |  1002 | 28056 |    12   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TESTCLUSTER |  1002 | 28056 |    12   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN		    | IDX_T1	  |  1002 |	  |	7   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T"."ID">=11000 AND "T"."ID"<=12000)


Statistics
----------------------------------------------------------
	  0  recursive calls
	  0  db block gets
	141  consistent gets
	  0  physical reads
	  0  redo size
      44135  bytes sent via SQL*Net to client
       1117  bytes received via SQL*Net from client
	 65  SQL*Net roundtrips to/from client
	  0  sorts (memory)
	  0  sorts (disk)
	947  rows processed

我们可以发现逻辑读从1019下降到141,很大程度的改善了性能。大约提升了7.5倍的性能。此时我们在看下聚簇因子已经变成了4571。

SQL> select index_name,STATUS,CLUSTERING_FACTOR from dba_indexes where index_name='IDX_T1';

INDEX_NAME		       STATUS	CLUSTERING_FACTOR
------------------------------ -------- -----------------
IDX_T1			       VALID		     4571

如果我们用12c之前的方法来实现。我们重建创建表和索引,其实效果是差不多的。

SQL> create table TESTCLUSTER_1 as select * from TESTCLUSTER order by id;
Table created.

SQL> create index idx_t1_1 on testcluster_1(id);
Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>'A1',tabname=>'TESTCLUSTER',method_opt=>'for all columns size auto',cascade=>true,estimate_percent=>100);
PL/SQL procedure successfully completed.


SQL> select index_name,clustering_factor from dba_indexes where index_name='IDX_T1_1';
INDEX_NAME		       CLUSTERING_FACTOR
------------------------------ -----------------
IDX_T1_1				    4570

虽然这么做可以显著的帮助我们提升某些执行频次高的语句的性能,但是我们仍然要考虑其他列在表上查询的影响。当然我演示的只是linear ordering,,还有一种方式叫interleaved ordering。两者的区别可以参考官方文档的图。

当然我们还可以在建表的时候使用,如下所示:

 linear ordering
CREATE TABLE sales (
  prod_id        NUMBER(6) NOT NULL,
  cust_id        NUMBER NOT NULL,
  time_id        DATE NOT NULL,
  channel_id     CHAR(1) NOT NULL,
  promo_id       NUMBER(6) NOT NULL,
  quantity_sold  NUMBER(3) NOT NULL,
  amount_sold    NUMBER(10,2) NOT NULL
)
CLUSTERING
  BY LINEAR ORDER (cust_id, prod_id);

Interleaved Ordering
  CREATE TABLE sales (
  prod_id        NUMBER(6) NOT NULL,
  cust_id        NUMBER NOT NULL,
  time_id        DATE NOT NULL,
  channel_id     CHAR(1) NOT NULL,
  promo_id       NUMBER(6) NOT NULL,
  quantity_sold  NUMBER(3) NOT NULL,
  amount_sold    NUMBER(10,2) NOT NULL
)
CLUSTERING
  BY INTERLEAVED ORDER (time_id, prod_id);

如果我们要修改可以使用modify CLUSTERING,删除可以使用drop CLUSTERING。

ALTER TABLE sales MODIFY CLUSTERING YES ON DATA MOVEMENT;
ALTER TABLE sales DROP CLUSTERING;

更多详细的细节及操作方法可以参考官方文档:http://devel.hotpilot.cz/ora-12cR1-lin-64-inst/E50529_01/DWHSG/attcluster.htm#CHDBGCEE

ORACLE 19c新特性-Active Data Guard-DML Redirection

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:ORACLE 19c新特性-Active Data Guard-DML Redirection
要说19c版本,这个Active Data Guard-DML Redirection是一个不错的发明,例如在某些情况下,我们的报表应用程序需要在备库中更新一些表和少量数据时非常适用。如果大规模的更新,肯定是不合适的,因为都需要转发到主库上去执行,这很大程序上会产生锁和并发等一系列的问题。

整个Active Data Guard-DML Redirection 的工作原理,我们可以参考官方的介绍。总共分成以下几个步骤。
1.备库的客户端发起DML操作。
2.DML操作被重定向到主库。
3.DML在主库被实施。
4.生成更改的redo信息流回到备用数据库。
5.完成DML重定向,客户端显示修改后的数据信息。

我们可以通过设置ADG_REDIRECT_DML为true,开启这个功能,或者是使用alter session命令开启会话级别的DML重定向。接下来我们来实际演示一下这个功能。首先我们要打开Real Time Query,这里已经是打开的。

DGMGRL> SHOW CONFIGURATION;

Configuration - db_broker_config

  Protection Mode: MaxPerformance
  Members:
  db193h1       - Primary database
    db193h1_stdby - Physical standby database 

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 41 seconds ago)

DGMGRL> show database db193h1_stdby

Database - db193h1_stdby

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          0 seconds (computed 0 seconds ago)
  Average Apply Rate: 0 Byte/s
  Real Time Query:    ON
  Instance(s):
    DB193H1

Database Status:
SUCCESS

接下来修改主库和备库的参数ADG_REDIRECT_DML。

alter system set adg_redirect_dml = true scope = both sid = '*';

现在在主库创建一个表。

create table testdml (id number,name varchar2(50),start_date date);

然后在备库上执行。

SQL> insert into testdml values (1,'name',sysdate);
insert into testdml values (1,'name',sysdate)
            *
ERROR at line 1:
ORA-16397: statement redirection from Oracle Active Data Guard standby database
to primary database failed

[oracle@standby ~]$ oerr ora 16397
16397, 00000, "statement redirection from Oracle Active Data Guard standby database to primary database failed"
// *Cause:  The statement redirection failed because of one of the following reasons:
//          1. The primary database connect string was not established.
//          2. The primary database could not be reached.
//          3. The undo-mode or incarnation were not the same.
//          4. The current user and logged-in user were not the same.
//          5. Redirecting CREATE TABLE AS SELECT (CTAS) of the global temporary
//             table was not supported.
//          6. Redirecting PL/SQL execution having bind variable was not supported.
// *Action: Run the statement after fixing the condition that caused the failure.

报错研究了一番,这主要是不支持sys用户使用这个功能。这是一个限制。我重新创建了一个用户就执行成功了。

SQL> create user a1 identified by a1;
User created.

SQL> grant dba to a1;
Grant succeeded.

SQL> connect a1/a1
Connected.

SQL> create table testdml (id number,name varchar2(50),start_date date);
Table created.

再次在备库上执行dml操作。这次成功了。

SQL> connect a1/a1
Connected.

SQL> select database_role from v$database;
DATABASE_ROLE
----------------
PHYSICAL STANDBY

SQL> insert into testdml values (1,'name',sysdate);
1 row created.

SQL> commit;
Commit complete.

SQL> select * from testdml;
	ID NAME 					      START_DAT
---------- -------------------------------------------------- ---------
	 1 name 					      08-JUN-20

接下来,我们在主库上执行一个事务,然后在备库上也执行一个事务。我们来看一下锁机制。

主库上执行不提交事务。

SQL> update testdml set id=2 where name='name';
3 rows updated.

备库上执行。

update testdml set id=2 where name='name';

主库上查看锁的阻塞情况。这里可以看到会话49被阻塞,正在等待enq: TX – row lock contention,注意看machine,这里是standby主机。执行的sql语句文本我们也能看到。

SQL> select USERNAME, FINAL_BLOCKING_SESSION, FINAL_BLOCKING_INSTANCE, EVENT LOCKWAIT, STATUS, machine, service_name , sql_id, con_id from gv$session where username is not null;

USERNAME	FINAL_BLOCKING_SESSION FINAL_BLOCKING_INSTANCE LOCKWAIT 		      STATUS   MACHINE		    SERVICE_NAME		   SQL_ID	     CON_ID
--------------- ---------------------- ----------------------- ------------------------------ -------- -------------------- ------------------------------ ------------- ----------
SYS							       SQL*Net message from client    INACTIVE primary		    DB193H1						  0
SYS							       OFS idle 		      ACTIVE   primary		    SYS$BACKGROUND					  0
A1							       SQL*Net message to client      ACTIVE   primary		    SYS$USERS			   cssctts2u81n4	  0
PUBLIC							       SQL*Net message from client    INACTIVE standby		    DB193H1						  0
A1				    49			     1 enq: TX - row lock contention  ACTIVE   standby		    SYS$USERS			   du4ukzh4mjns1	  0

SQL> select sql_text from v$sql where sql_id = 'du4ukzh4mjns1';
SQL_TEXT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
UPDATE "TESTDML" "A1" SET "ID" = 2 WHERE "A1"."NAME"='name'

大约过了1分钟,在备库上就会出现下面的错误。分布式事务等待锁超时了。

SQL> update testdml set id=2 where name='name';
update testdml set id=2 where name='name'
*
ERROR at line 1:
ORA-02049: timeout: distributed transaction waiting for lock
ORA-02063: preceding line from ADGREDIRECT

Elapsed: 00:01:00.05

如果我们现在把这个反过来操作一下。看看会不会出现这个错误ORA-02049:。也就是在备库上执行事务不提交,在主库上更新。

备库上执行
update testdml set id=2 where name='name';

主库上执行
update testdml set id=2 where name='name';

这次是永远也没有发生分布式事务等待锁超时的情况。

当然官方文档还告诉我们可以在standby上执行Top-level PL/SQL操作,但是不能包含绑定变量。这个功能需要我们在会话级别执行ALTER SESSION ENABLE ADG_REDIRECT_PLSQL
Connected.
参考文档:Oracle (Active) Data Guard 19c Real-Time Data Protection and Availability

Oracle Database 19c使用Data Guard Broker进行Data Guard物理备库设置

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:Oracle Database 19c使用Data Guard Broker进行Data Guard物理备库设置
今天准备研究学习一下19c的data guard。由于想快速搭建,所以选择了vagrant去拖镜像,但是发现拖下来的东西还是没有实现dg(这里的情况不是很了解,感觉是执行脚本有个地方报错了),不过它里面的脚本我觉得没什么大问题。我自己凑合手动执行脚本搭了一个。

首先看一下先决条件。
primary主机,操作系统是RedHat 7.8,安装了oracle 19c数据库软件,创建了一个叫DB193H1的数据库。
standby主机,操作系统是RedHat 7.8,安装了oracle 19c数据库软件。

1.我们先来看一下vargant中主库需要执行的脚本primary_DB_setup.sh。创建完主库之后,需要修改一些相关参数。同时打开归档,添加STANDBY LOGFILE,并且打开FLASHBACK。由于两边数据文件保持路径一致。因此设置STANDBY_FILE_MANAGEMENT为AUTO即可。这里还把dg_broker_start打开了。

${DB_HOME}/bin/sqlplus / as sysdba 
ALTER SYSTEM SET db_create_file_dest='/u02/oradata';
ALTER SYSTEM SET db_create_online_log_dest_1='/u02/oradata';
ALTER SYSTEM SET local_listener='LISTENER';
ALTER SYSTEM SET db_recovery_file_dest_size=20G;
ALTER SYSTEM SET db_recovery_file_dest='/u01/app/oracle';
exit;


${DB_HOME}/bin/sqlplus / as sysdba 
SHUTDOWN IMMEDIATE;
STARTUP MOUNT;
ALTER DATABASE ARCHIVELOG;
ALTER DATABASE OPEN;

ALTER DATABASE FORCE LOGGING;
ALTER SYSTEM SWITCH LOGFILE;

ALTER DATABASE ADD STANDBY LOGFILE SIZE 50M;
ALTER DATABASE ADD STANDBY LOGFILE SIZE 50M;
ALTER DATABASE ADD STANDBY LOGFILE SIZE 50M;
ALTER DATABASE ADD STANDBY LOGFILE SIZE 50M;

ALTER DATABASE FLASHBACK ON;

ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT=AUTO;
exit;



${DB_HOME}/bin/sqlplus / as sysdba 
ALTER SYSTEM SET dg_broker_start=TRUE;
exit;

2.主库设置完毕之后,就可以配置tns链接串了,这个配置需要在主库和备库均设置。设置如下。

DB193H1 =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = primary)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = DB193H1)
    )
  )

DB193H1_STDBY =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = standby)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SID = DB193H1)
    )
  )

当然还需要我们配置一下静态注册。在监听的listener.ora文件中增加下列配置,这个配置需要在主库和备库均设置

SID_LIST_LISTENER =
  (SID_LIST =
    (SID_DESC =
      (GLOBAL_DBNAME = DB193H1_DGMGRL)
      (DB_HOME = /u01/app/oracle/product/19.3.0.0/dbhome_1)
      (SID_NAME = DB193H1)
    )
  )

3.接下来是备库的设置。我们可以按照standby_DB_setup.sh的脚本来设置。先创建相关文件夹(这里需要注意的是,如果你使用了CDB和PDB,就需要创建上面那pdbseed和pdb相关文件夹,没有则不需要),然后生成密码文件。这里的DB_NAME取名为DB193H1_STDBY

if [ "${CDB}" == "true" ]
then
  mkdir -p /u02/oradata/${DB_NAME}/pdbseed
  mkdir -p /u02/oradata/${DB_NAME}/pdb1
fi

mkdir -p ${DB_BASE}/fast_recovery_area/${DB_NAME}
mkdir -p ${DB_BASE}/admin/${DB_NAME}/adump

orapwd file=$ORACLE_HOME/dbs/orapw${DB_NAME} password=${SYS_PASSWORD} entries=10 format=12

4.上述配置完成后,创建参数文件,启动备库到nomount状态。

cat > /tmp/init_standby.ora 
*.db_name='${DB_NAME}'
*.local_listener='LISTENER'


echo "-----------------------------------------------------------------"
echo -e "${INFO}`date +%F' '%T`: Making auxillary instance"
echo "-----------------------------------------------------------------"
export ORACLE_SID=${DB_NAME}
${DB_HOME}/bin/sqlplus / as sysdba 
--SHUTDOWN IMMEDIATE;
STARTUP NOMOUNT PFILE='/tmp/init_standby.ora';
exit;

5.在主库上执行rman连接到备库实例,执行DUPLICATE操作,这里很简单我们的参数只需要修改一个db_unique_name即可,如果文件路径不一致,就需要在这里设置db_file_name_convert和log_file_name_convert

${DB_HOME}/bin/rman TARGET sys/${SYS_PASSWORD}@${DB_NAME} AUXILIARY sys/${SYS_PASSWORD}@${DB_NAME}_STDBY 
DUPLICATE TARGET DATABASE
  FOR STANDBY
  FROM ACTIVE DATABASE
  DORECOVER
  SPFILE
    SET db_unique_name='${DB_NAME}_STDBY' COMMENT 'Standby'
  NOFILENAMECHECK;
exit;

6.上述操作完成之后就可以进入到Data Guard Broker的配置了。先进入到备库设置参数。

ALTER SYSTEM SET dg_broker_start = true;

7.接下来在主库配置Data Guard Broker

${DB_HOME}/bin/dgmgrl sys/${SYS_PASSWORD}@${DB_NAME} 
CREATE CONFIGURATION db_broker_config AS PRIMARY DATABASE IS ${DB_NAME} CONNECT IDENTIFIER IS ${DB_NAME};
ADD DATABASE ${DB_NAME}_STDBY AS CONNECT IDENTIFIER IS ${DB_NAME}_STDBY MAINTAINED AS PHYSICAL;
ENABLE CONFIGURATION;

我按照这些步骤操作完成之后,执行发现报ORA-16810错误。

[oracle@primary ~]$ dgmgrl sys/welcome1@DB193H1
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Mon Jun 8 12:49:56 2020
Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "DB193H1"
Connected as SYSDBA.
DGMGRL> 
DGMGRL>   CREATE CONFIGURATION db_broker_config AS PRIMARY DATABASE IS  DB193H1 CONNECT IDENTIFIER IS  DB193H1;
Configuration "db_broker_config" created with primary database "db193h1"
DGMGRL> 
DGMGRL> ADD DATABASE DB193H1_STDBY AS CONNECT IDENTIFIER IS DB193H1_STDBY MAINTAINED AS PHYSICAL;
Database "db193h1_stdby" added
DGMGRL> ENABLE CONFIGURATION;

Enabled.
DGMGRL> DGMGRL> 
DGMGRL> 
DGMGRL> 
DGMGRL> SHOW CONFIGURATION;

Configuration - db_broker_config

  Protection Mode: MaxPerformance
  Members:
  db193h1       - Primary database
    db193h1_stdby - Physical standby database 
      Error: ORA-16810: multiple errors or warnings detected for the member

经过详细查看备库的状态,发现是ORA-16766: Redo Apply is stopped,原来是一直没有启动备库的redo apply。

DGMGRL> show database verbose db193h1_stdby

Database - db193h1_stdby

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          (unknown)
  Average Apply Rate: (unknown)
  Active Apply Rate:  (unknown)
  Maximum Apply Rate: (unknown)
  Real Time Query:    OFF
  Instance(s):
    DB193H1

  Database Error(s):
    ORA-16766: Redo Apply is stopped

  Database Warning(s):
    ORA-16854: apply lag could not be determined

  Properties:
    DGConnectIdentifier             = 'db19h1_stdby'
    ObserverConnectIdentifier       = ''
    FastStartFailoverTarget         = ''
    PreferredObserverHosts          = ''
    LogShipping                     = 'ON'
    RedoRoutes                      = ''
    LogXptMode                      = 'ASYNC'
    DelayMins                       = '0'
    Binding                         = 'optional'
    MaxFailure                      = '0'
    ReopenSecs                      = '300'
    NetTimeout                      = '30'
    RedoCompression                 = 'DISABLE'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyLagThreshold               = '30'
    TransportLagThreshold           = '30'
    TransportDisconnectedThreshold  = '30'
    ApplyParallel                   = 'AUTO'
    ApplyInstances                  = '0'
    StandbyFileManagement           = ''
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '0'
    LogArchiveMinSucceedDest        = '0'
    DataGuardSyncLatency            = '0'
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = ''
    DbFileNameConvert               = ''
    LogFileNameConvert              = ''
    ArchiveLocation                 = ''
    AlternateLocation               = ''
    StandbyArchiveLocation          = ''
    StandbyAlternateLocation        = ''
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    LogXptStatus                    = '(monitor)'
    SendQEntries                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = 'standby'
    StaticConnectIdentifier         = '(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=primary)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=DB193H1_STDBY_DGMGRL)(INSTANCE_NAME=DB193H1)(SERVER=DEDICATED)))'
    TopWaitEvents                   = '(monitor)'
    SidName                         = '(monitor)'

  Log file locations:
    Alert log               : /u01/app/oracle/diag/rdbms/db193h1_stdby/DB193H1/trace/alert_DB193H1.log
    Data Guard Broker log   : /u01/app/oracle/diag/rdbms/db193h1_stdby/DB193H1/trace/drcDB193H1.log

Database Status:
ERROR

切换到备库,执行应用,再次查看dg broker配置就正常了

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION;

DGMGRL> SHOW CONFIGURATION;

Configuration - db_broker_config

  Protection Mode: MaxPerformance
  Members:
  db193h1       - Primary database
    db193h1_stdby - Physical standby database 

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 46 seconds ago)

DGMGRL> SHOW DATABASE db193h1

Database - db193h1

  Role:               PRIMARY
  Intended State:     TRANSPORT-ON
  Instance(s):
    DB193H1

Database Status:
SUCCESS

DGMGRL> SHOW DATABASE db193h1_stdby

Database - db193h1_stdby

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 1 second ago)
  Apply Lag:          0 seconds (computed 1 second ago)
  Average Apply Rate: 3.00 KByte/s
  Real Time Query:    ON
  Instance(s):
    DB193H1

Database Status:
SUCCESS

安装顺利成功,接下来我们就可以来研究19c dataguard的新特性了。感兴趣的小伙伴可以自行研究下vargant中的脚本。

一次enq:TX-index contention故障分析

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:一次enq:TX-index contention故障分析
这个故障是头一天晚上9点多开始,客户发现索引enq:TX-index contention分裂很严重,系统程序插入的时候有大量的enq:TX-index contention等待,同是伴随着还有gc方面的等待。在12点多的时候,重建了索引最后恢复了。现在需要我们分析根本原因。

首先通过当天晚上10点多的AWR报告可以发现,enq:TX-index contention分裂很严重,两个节点之间的网络延迟非常严重。


 

 

通过对比正常时间段和异常时间段的leaf node splits,发现每秒种索引分裂的个数并没有异常增长,也就侧面说明业务量没有增长。

 

 

我们做个相关插入sql语句的sqlhc报告,发现引起索引争用的语句。执行的频次没有发现很大的增长。

 

 
通过分析,我们可以认定这个插入SQL是被影响的。那么究竟是被什么影响的呢?

 
通过分析7天的awr情况发现在5月20日从早上11点开始,数据库CPU就一直增长,到下午18点已经达到了80%,最终一直涨到100%。继续查询AWR报告,发现占用CPU高的语句就一条。占比达到95%。

 
对该语句做SQLHC报告,发现该语句的执行计划并没变,执行次数也没有明显的差异,但是逻辑读在问题期间大幅增长了。从而导致执行语句需要更多的cpu。

 

 

 
分析该语句的执行计划,可以发现,走了全表扫描。而它使用的谓词条件上正好有索引,优化器可以使用索引INX_OWN_O_I_REQ_ORDER_NO,但是没有使用,通过查看报告,发现索引的order_no上缺乏列的统计信息。

 
此时处理的办法有两种,一种是对表进行统计信息收集,在收集的时候可以选择,method_opt=> ‘for all indexed columns’。第二种方法就是对SQL语句进行绑定。客户选择了第二种方式,我们将SQL绑定之后,系统消耗的cpu逐步降低,上面那条插入语句瞬间就好了很多,客户反映插入速度变快了。最终这个问题被定义为CPU资源繁忙导致网络软中断,引起了gc,最终导致enq:TX-index contention争用严重的问题。

一次”qmxdpls_subhea”内存耗尽的问题分析

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:一次”QMXDPLS_SUBHEA”内存耗尽的问题分析
一套RAC数据库其中的一个节点在17点05分左发生了重启。通过分析发现是内存耗尽导致了交换分区的使用,同时system cpu 100%导致的crash。首先观察2节点16点-16点30的awr报告,可以发现主机内存256GB,SGA使用了80GB,PGA使用了90GB左右。

从系统残余内存来看,仍然有256-80-90=76G可以用,而操作系统正常大概能占用50GB,应该还剩余26GB可用。但是问题就在于这个机器的大页设置不正确(当前设置为60GB)。导致一部分大页的内存是空闲没法使用的。

HugePages_Total: 31748
HugePages_Free: 5831
HugePages_Rsvd: 5828
HugePages_Surp: 0
Hugepagesize: 2048 kB

在从oswatch来看,8号11点还剩30GB内存,而到了9号11点则剩下16GB内存,到了下午的15点左右下降到了1.68GB。可见内存是持续缓慢的下降耗尽的。不是一个并发或者大的事务上来导致的。

zzz ***Sun Dec 8 11:00:03 CST 2019
MemTotal: 264410232 kB
MemFree: 31472452 kB

zzz ***Mon Dec 9 10:00:15 CST 2019
MemTotal: 264410232 kB
MemFree: 17231808 kB

zzz ***Mon Dec 9 16:00:19 CST 2019
MemTotal: 264410232 kB
MemFree: 1765052 kB

那么内存缓慢耗尽是什么原因呢?通过观察节点1没重启的进程情况,发现以下问题:

DEDICATED 7062 RM bea 405.396004 405383755 428595614 3145728 428595614
DEDICATED 7856 RM bea 431.021004 432892859 457824670 5505024 457824670
DEDICATED 13649 RM bea 491.771004 496860795 516807070 786432 516807070
DEDICATED 14834 RM bea 503.458504 509014107 531356062 3080192 531356062
DEDICATED 12960 RM bea 539.208504 548221595 570612126 4849664 570612126
DEDICATED 3089 RM bea 541.458504 548346875 572774814 4653056 572774814
DEDICATED 3619 RM bea 565.958504 575983915 599382430 5570560 599382430
DEDICATED 231 RM bea 666.458504 680193475 704633246 5439488 704633246
DEDICATED 21156 RM bea 703.333504 718921091 738646430 786432 738646430
DEDICATED 19747 RM bea 820.896004 839278371 861723038 589824 861723038

SQL> select sid,event,program,machine,sql_id,status from v$session where sid=19747;

SID EVENT PROGRAM MACHINE SQL_ID STATUS
---------- ------------------------------ ------------------------------------------------ ---------------------------------------- ------------- --------
19747 SQL*Net message from client JDBC Thin Client whrm_app6 INACTIVE

SQL> select sid,event,program,machine,sql_id,PREV_SQL_ID,status from v$session where sid=19747;

SID EVENT PROGRAM MACHINE SQL_ID PREV_SQL_ID STATUS
---------- ------------------------------ ------------------------------------------------ ---------------------------------------- ------------- ------------- --------
19747 SQL*Net message from client JDBC Thin Client whrm_app6 bunvx480ynf57 INACTIVE

SQL> select sid,event,program,machine,sql_id,PREV_SQL_ID,status from v$session where sid=21156;
SID EVENT PROGRAM MACHINE SQL_ID PREV_SQL_ID STATUS
---------- ------------------------------ ------------------------------------------------ ---------------------------------------- ------------- ------------- --------
21156 SQL*Net message from client JDBC Thin Client whrm_app8 bunvx480ynf57 INACTIVE

SQL> select sid,event,program,machine,sql_id,PREV_SQL_ID,status from v$session where sid=231;

SID EVENT PROGRAM MACHINE SQL_ID PREV_SQL_ID STATUS
---------- ------------------------------ ------------------------------------------------ ---------------------------------------- ------------- ------------- --------
231 SQL*Net message from client JDBC Thin Client whrm_app6 bunvx480ynf57 INACTIVE

可以看到相关进程占用的内存很高,已经是inactive了,但是内存并没发生回收。通过oradebug dump heapdump 命令dump其中的800MB的进程,发现进程分配了12000个extent,而这些extend中的chunk都是”qmxdpls_subhea“。

而关于这个内存区域的问题,通过mos搜索,可以发现出现大量的memory leak(内存泄露)的情况。

我们把这个情况反映给了应用人员,应用人员根据程序定位到是没关闭xmltype相关对象导致的内存泄露。对于XMLTYPE这类的操作,需要打开后执行free()进行关闭,释放内存。

参考文档:
Best Practises for XMLType Temporary LOB Usage (Doc ID 1955135.1)
How to Release Temporary LOB Segments without Closing the JDBC Connection (Doc ID 1384829.1)

四个节点实例异常重启故障分析

今天发生了一个故障,居然是四个RAC节点全部同一时间发生重启了,这种情况还是很少见的。以下是问题分析过程

节点1日志

2020-05-12T04:50:13.955295+08:00
Errors in file /u01/app/oracle/diag/rdbms/dwdb/dwdb1/trace/dwdb1_ora_118868.trc  (incident=4418543):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
skgxpvfynet: mtype: 61 process 123980 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
opidrv aborting process M003 ospid (123980) as a result of ORA-603
2020-05-12T05:07:29.342931+08:00
Reconfiguration started (old inc 8, new inc 10)
List of instances (total 1) :
1
Dead instances (total 3) :
2 3 4
My inst 1
2020-05-12T05:07:31.305816+08:00
NOTE: ASMB process state dumped to trace file /u01/app/oracle/diag/rdbms/dwdb/dwdb1/trace/dwdb1_gen0_178160.trc
2020-05-12T05:07:32.229320+08:00
skgxpvfynet: mtype: 61 process 124166 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
opiodr aborting process unknown ospid (124166) as a result of ORA-603
2020-05-12T05:09:27.659099+08:00
skgxpvfynet: mtype: 61 process 124818 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
2020-05-12T05:10:08.457843+08:00
Process m003 died, see its trace file
2020-05-12T05:10:34.164320+08:00
DDE: Problem Key 'ORA 603' was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2020-05-12T05:10:39.931465+08:00
skgxpvfynet: mtype: 61 process 125146 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
2020-05-12T05:10:39.931458+08:00
opiodr aborting process unknown ospid (124818) as a result of ORA-603
2020-05-12T05:10:39.931454+08:00
GCR0[178417]: LMHB process succesfully killed
2020-05-12T05:11:40.473632+08:00
opiodr aborting process unknown ospid (125146) as a result of ORA-603
2020-05-12T05:18:01.842419+08:00
ksxp_exafusion_enabled_dcf: ipclw_enabled=0
Starting ORACLE instance (normal) (OS id: 108120)

节点2日志

2020-05-12T04:47:56.141213+08:00
Incremental checkpoint up to RBA [0x254f.3988e5.0], current log tail at RBA [0x254f.39cc3a.0]
2020-05-12T05:05:18.334836+08:00
NOTE: ASMB0 registering with ASM instance as Flex client 0x10006 (reg:3205650803) (reconnect)
2020-05-12T05:06:11.234166+08:00
Dumping diagnostic data in directory=[cdmp_20200512050610], requested by (instance=4, osid=605758 (GEN0)), summary=[abnormal instance termination].
2020-05-12T05:06:12.197007+08:00
ERROR: terminating instance because ASMB is stuck for 241 seconds
GEN0 (ospid: 110374): terminating the instance due to error 15082
2020-05-12T05:06:18.242304+08:00
Instance terminated by GEN0, pid = 110374
2020-05-12T05:15:33.418428+08:00
ksxp_exafusion_enabled_dcf: ipclw_enabled=0

节点3日志

2020-05-12T05:05:49.290986+08:00
LGWR (ospid: 364374) waits for event 'enq: CF - contention' for 74 secs.
2020-05-12T05:05:49.291033+08:00
LGWR (ospid: 364374) is hung in an acceptable location (cfio 0x11.00).
2020-05-12T05:06:05.433427+08:00
NOTE: ASMB0 registering with ASM instance as Flex client 0x10007 (reg:4119066454) (reconnect)
2020-05-12T05:06:11.239521+08:00
Dumping diagnostic data in directory=[cdmp_20200512050610], requested by (instance=4, osid=605758 (GEN0)), summary=[abnormal instance termination].
2020-05-12T05:06:16.340150+08:00
Dumping diagnostic data in directory=[cdmp_20200512050613], requested by (instance=2, osid=110374 (GEN0)), summary=[abnormal instance termination].
2020-05-12T05:06:34.967473+08:00
ERROR: terminating instance because ASMB is stuck for 242 seconds
GEN0 (ospid: 364027): terminating the instance due to error 15082
2020-05-12T05:06:38.182059+08:00
System state dump requested by (instance=3, osid=364027 (GEN0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/dwdb/dwdb3/trace/dwdb3_diag_364051_20200512050638.trc
2020-05-12T05:06:40.996565+08:00
Instance terminated by GEN0, pid = 364027
2020-05-12T05:15:36.490770+08:00
ksxp_exafusion_enabled_dcf: ipclw_enabled=0 
Starting ORACLE instance (normal) (OS id: 10317)
2020-05-12T05:15:36.499203+08:00
CLI notifier numLatches:131 maxDescs:19888
2020-05-12T05:15:36.501049+08:00
**********************************************************************
2020-05-12T05:15:36.501093+08:00
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)

节点4日志

2020-05-12T05:04:12.064862+08:00
CKPT (ospid: 605951) waits for event 'enq: CF - contention' for 63 secs.
2020-05-12T05:04:12.064910+08:00
CKPT (ospid: 605951) is hung in an acceptable location (cfio 0x11.00).
2020-05-12T05:05:18.151878+08:00
NOTE: ASMB0 registering with ASM instance as Flex client 0x10008 (reg:2344414256) (reconnect)
2020-05-12T05:05:25.774637+08:00
CKPT (ospid: 605951) waits for event 'enq: CF - contention' for 137 secs.
2020-05-12T05:05:25.774701+08:00
CKPT (ospid: 605951) is hung in an acceptable location (inwait 0x201.00).
2020-05-12T05:06:10.637714+08:00
ERROR: terminating instance because ASMB is stuck for 242 seconds
GEN0 (ospid: 605758): terminating the instance due to error 15082
2020-05-12T05:06:10.646458+08:00
DWPUB(3):opiodr aborting process unknown ospid (220276) as a result of ORA-1092
2020-05-12T05:06:10.646808+08:00
DWPUB(3):opiodr aborting process unknown ospid (330270) as a result of ORA-1092
2020-05-12T05:06:10.647256+08:00
DWPUB(3):opiodr aborting process unknown ospid (371195) as a result of ORA-1092
2020-05-12T05:06:10.651213+08:00
DWPUB(3):opiodr aborting process unknown ospid (220280) as a result of ORA-1092
2020-05-12T05:06:11.233314+08:00
System state dump requested by (instance=4, osid=605758 (GEN0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/dwdb/dwdb4/trace/dwdb4_diag_605768_20200512050611.trc
2020-05-12T05:06:16.690258+08:00
Instance terminated by GEN0, pid = 605758

从上述日志可以看出,节点4是先宕机的,然后节点2接着宕机,节点3在宕机,最后是节点1最后宕机。节点1最后认为节点234都已经Dead。

List of instances (total 1) :
 1
Dead instances (total 3) :
 2 3 4
My inst 1

继续分析节点1,在宕机之前,节点1后台一直在报错。

ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2

配合节点1 oswatch的内存图,可以发现问题期间内存free值非常低。

进一步分析空闲内存低的时候top情况,发现系统开启大量的并行进程。结合当时的addm报告,可以发现应用当时create table的时候开了很多并行。

内存不足导致ORA-错误,通过mos文档。
“ORA-27301: OS failure message: No buffer space available” occurs on OPC RAC (Doc ID 2397062.1)

需要设置参数vm.min_free_kbytes。
当请求分配内存的时候,如果有足够的内存,则可以成功分配,当没有足够内存的时候,操作就会阻塞。他需要等待系统先去释放内存,再分配内存。而对于系统级别的一些原子性的请求,它是不能被阻塞的,如果分配不到内存的话,就会出现失败。内核为了避免原子请求失败,必须设置一块保留的内存。而这个就是通过这个参数来设置的。
节点4的oswwatch内存图

这里可以看到节点4的内存也是free持续变低。


从top进程中可以看到,节点4的情况和节点1不一样,osysmond进程占用内存逐步增加,内存free越来越少。通过mos查询,可以发现osysmond进程在12c版本存在内存泄露的情况,Bug 18701017 – Memory Leak with osysmond.bin (Doc ID 18701017.8)

优化建议
1.建议所有节点设置参数vm.min_free_kbytes,必须设置一块保留的内存,该参数可以让内核避免原子请求失败。具体值需要主机工程师评估。
2.节点1建表语句开启并行太高,建议将单个语句并行个数控制在160以内。

3.所有集群节点关闭System Monitor Service(osysmond)。osysmond会将每个节点的资源使用情况发送给cluster logger service,后者将会把所有节点的信息都接收并保存到CHM的资料库。由于我们已经安装oswatch,所以这个可以关闭。

On each node, as root user:
# /bin/crsctl stop res ora.crf -init
# /bin/crsctl modify res ora.crf -attr ENABLED=0 -init