Oracle 12c新特性 Attribute Clustering及其对Index查询带来的性能改善

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:Oracle 12c新特性 Attribute Clustering及其对Index查询带来的性能改善
Oracle 12c中有一个新功能,叫Attribute Clustering,这个特性可以帮助我们改善一些索引查询性能。之前我们学习过索引的聚簇因子,如果索引的聚簇因子很高的话,当我们使用索引扫描1000行的数据,可能需要访问不同的数据块中的每一行数据。这本来也没什么,但是如果这个语句频繁的执行,则极大的影响数据库的效率,会消耗更多的逻辑读和CPU。如果我们把表的聚簇。我们来做个小试验验证一下新的功能Attribute Clustering。

1、首先我们创建一个表,插入100万数据。并在id列上创建索引。

SQL> create table testcluster(id number,name varchar2(30));
Table created.

SQL> insert into testcluster select dbms_random.value(1,1000000),dbms_random.string('Y',trunc(dbms_random.value(1,10)))  from dual connect by level <=1000000; 1000000 rows created. 

SQL> commit;
Commit complete.

SQL> create index idx_t1 on testcluster(id);
Index created.

2.收集统计信息,查看聚簇因子。可以看到当前的聚簇因子是999809。

SQL>  exec dbms_stats.gather_table_stats(ownname=>'A1',tabname=>'TESTCLUSTER',method_opt=>'for all columns size auto',cascade=>true,estimate_percent=>100);

SQL> select index_name,clustering_factor from dba_indexes where index_name='IDX_T1';

INDEX_NAME		       CLUSTERING_FACTOR
------------------------------ -----------------
IDX_T1					  999809

3、执行索引扫描,检索1000行数据。此时我们可以看到逻辑读是1019,如果频繁执行这个SQL,将会消耗系统的CPU。

SQL> set autotrace traceonly  
SQL> select * from testcluster t where t.id between 11000 and 12000;

947 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2344056994

---------------------------------------------------------------------------------------------------
| Id  | Operation			    | Name	  | Rows  | Bytes | Cost (%CPU)| Time	  |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		    |		  |  1002 | 28056 |  1009   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TESTCLUSTER |  1002 | 28056 |  1009   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN		    | IDX_T1	  |  1002 |	  |	7   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T"."ID">=11000 AND "T"."ID"<=12000)


Statistics
----------------------------------------------------------
	  9  recursive calls
	  6  db block gets
       1019  consistent gets
	  0  physical reads
       1000  redo size
      44135  bytes sent via SQL*Net to client
       1117  bytes received via SQL*Net from client
	 65  SQL*Net roundtrips to/from client
	  0  sorts (memory)
	  0  sorts (disk)
	947  rows processed

4.在12c之前,我们解决这个问题的做法是重新CTAS这张表,并在select的时候按照id进行排序,在创建索引。那么在12c当中我们就是可以简单的使用语法add clustering by linear,在move table online就能实现了。

SQL> alter table TESTCLUSTER add clustering by linear order(id);
Table altered.

SQL> alter table TESTCLUSTER move online;
Table altered.

SQL> select * from testcluster t where t.id between 11000 and 12000;

947 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2344056994

---------------------------------------------------------------------------------------------------
| Id  | Operation			    | Name	  | Rows  | Bytes | Cost (%CPU)| Time	  |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		    |		  |  1002 | 28056 |    12   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TESTCLUSTER |  1002 | 28056 |    12   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN		    | IDX_T1	  |  1002 |	  |	7   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T"."ID">=11000 AND "T"."ID"<=12000)


Statistics
----------------------------------------------------------
	  0  recursive calls
	  0  db block gets
	141  consistent gets
	  0  physical reads
	  0  redo size
      44135  bytes sent via SQL*Net to client
       1117  bytes received via SQL*Net from client
	 65  SQL*Net roundtrips to/from client
	  0  sorts (memory)
	  0  sorts (disk)
	947  rows processed

我们可以发现逻辑读从1019下降到141,很大程度的改善了性能。大约提升了7.5倍的性能。此时我们在看下聚簇因子已经变成了4571。

SQL> select index_name,STATUS,CLUSTERING_FACTOR from dba_indexes where index_name='IDX_T1';

INDEX_NAME		       STATUS	CLUSTERING_FACTOR
------------------------------ -------- -----------------
IDX_T1			       VALID		     4571

如果我们用12c之前的方法来实现。我们重建创建表和索引,其实效果是差不多的。

SQL> create table TESTCLUSTER_1 as select * from TESTCLUSTER order by id;
Table created.

SQL> create index idx_t1_1 on testcluster_1(id);
Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>'A1',tabname=>'TESTCLUSTER',method_opt=>'for all columns size auto',cascade=>true,estimate_percent=>100);
PL/SQL procedure successfully completed.


SQL> select index_name,clustering_factor from dba_indexes where index_name='IDX_T1_1';
INDEX_NAME		       CLUSTERING_FACTOR
------------------------------ -----------------
IDX_T1_1				    4570

虽然这么做可以显著的帮助我们提升某些执行频次高的语句的性能,但是我们仍然要考虑其他列在表上查询的影响。当然我演示的只是linear ordering,,还有一种方式叫interleaved ordering。两者的区别可以参考官方文档的图。

当然我们还可以在建表的时候使用,如下所示:

 linear ordering
CREATE TABLE sales (
  prod_id        NUMBER(6) NOT NULL,
  cust_id        NUMBER NOT NULL,
  time_id        DATE NOT NULL,
  channel_id     CHAR(1) NOT NULL,
  promo_id       NUMBER(6) NOT NULL,
  quantity_sold  NUMBER(3) NOT NULL,
  amount_sold    NUMBER(10,2) NOT NULL
)
CLUSTERING
  BY LINEAR ORDER (cust_id, prod_id);

Interleaved Ordering
  CREATE TABLE sales (
  prod_id        NUMBER(6) NOT NULL,
  cust_id        NUMBER NOT NULL,
  time_id        DATE NOT NULL,
  channel_id     CHAR(1) NOT NULL,
  promo_id       NUMBER(6) NOT NULL,
  quantity_sold  NUMBER(3) NOT NULL,
  amount_sold    NUMBER(10,2) NOT NULL
)
CLUSTERING
  BY INTERLEAVED ORDER (time_id, prod_id);

如果我们要修改可以使用modify CLUSTERING,删除可以使用drop CLUSTERING。

ALTER TABLE sales MODIFY CLUSTERING YES ON DATA MOVEMENT;
ALTER TABLE sales DROP CLUSTERING;

更多详细的细节及操作方法可以参考官方文档:http://devel.hotpilot.cz/ora-12cR1-lin-64-inst/E50529_01/DWHSG/attcluster.htm#CHDBGCEE

ORACLE 19c新特性-Active Data Guard-DML Redirection

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:ORACLE 19c新特性-Active Data Guard-DML Redirection
要说19c版本,这个Active Data Guard-DML Redirection是一个不错的发明,例如在某些情况下,我们的报表应用程序需要在备库中更新一些表和少量数据时非常适用。如果大规模的更新,肯定是不合适的,因为都需要转发到主库上去执行,这很大程序上会产生锁和并发等一系列的问题。

整个Active Data Guard-DML Redirection 的工作原理,我们可以参考官方的介绍。总共分成以下几个步骤。
1.备库的客户端发起DML操作。
2.DML操作被重定向到主库。
3.DML在主库被实施。
4.生成更改的redo信息流回到备用数据库。
5.完成DML重定向,客户端显示修改后的数据信息。

我们可以通过设置ADG_REDIRECT_DML为true,开启这个功能,或者是使用alter session命令开启会话级别的DML重定向。接下来我们来实际演示一下这个功能。首先我们要打开Real Time Query,这里已经是打开的。

DGMGRL> SHOW CONFIGURATION;

Configuration - db_broker_config

  Protection Mode: MaxPerformance
  Members:
  db193h1       - Primary database
    db193h1_stdby - Physical standby database 

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 41 seconds ago)

DGMGRL> show database db193h1_stdby

Database - db193h1_stdby

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          0 seconds (computed 0 seconds ago)
  Average Apply Rate: 0 Byte/s
  Real Time Query:    ON
  Instance(s):
    DB193H1

Database Status:
SUCCESS

接下来修改主库和备库的参数ADG_REDIRECT_DML。

alter system set adg_redirect_dml = true scope = both sid = '*';

现在在主库创建一个表。

create table testdml (id number,name varchar2(50),start_date date);

然后在备库上执行。

SQL> insert into testdml values (1,'name',sysdate);
insert into testdml values (1,'name',sysdate)
            *
ERROR at line 1:
ORA-16397: statement redirection from Oracle Active Data Guard standby database
to primary database failed

[oracle@standby ~]$ oerr ora 16397
16397, 00000, "statement redirection from Oracle Active Data Guard standby database to primary database failed"
// *Cause:  The statement redirection failed because of one of the following reasons:
//          1. The primary database connect string was not established.
//          2. The primary database could not be reached.
//          3. The undo-mode or incarnation were not the same.
//          4. The current user and logged-in user were not the same.
//          5. Redirecting CREATE TABLE AS SELECT (CTAS) of the global temporary
//             table was not supported.
//          6. Redirecting PL/SQL execution having bind variable was not supported.
// *Action: Run the statement after fixing the condition that caused the failure.

报错研究了一番,这主要是不支持sys用户使用这个功能。这是一个限制。我重新创建了一个用户就执行成功了。

SQL> create user a1 identified by a1;
User created.

SQL> grant dba to a1;
Grant succeeded.

SQL> connect a1/a1
Connected.

SQL> create table testdml (id number,name varchar2(50),start_date date);
Table created.

再次在备库上执行dml操作。这次成功了。

SQL> connect a1/a1
Connected.

SQL> select database_role from v$database;
DATABASE_ROLE
----------------
PHYSICAL STANDBY

SQL> insert into testdml values (1,'name',sysdate);
1 row created.

SQL> commit;
Commit complete.

SQL> select * from testdml;
	ID NAME 					      START_DAT
---------- -------------------------------------------------- ---------
	 1 name 					      08-JUN-20

接下来,我们在主库上执行一个事务,然后在备库上也执行一个事务。我们来看一下锁机制。

主库上执行不提交事务。

SQL> update testdml set id=2 where name='name';
3 rows updated.

备库上执行。

update testdml set id=2 where name='name';

主库上查看锁的阻塞情况。这里可以看到会话49被阻塞,正在等待enq: TX – row lock contention,注意看machine,这里是standby主机。执行的sql语句文本我们也能看到。

SQL> select USERNAME, FINAL_BLOCKING_SESSION, FINAL_BLOCKING_INSTANCE, EVENT LOCKWAIT, STATUS, machine, service_name , sql_id, con_id from gv$session where username is not null;

USERNAME	FINAL_BLOCKING_SESSION FINAL_BLOCKING_INSTANCE LOCKWAIT 		      STATUS   MACHINE		    SERVICE_NAME		   SQL_ID	     CON_ID
--------------- ---------------------- ----------------------- ------------------------------ -------- -------------------- ------------------------------ ------------- ----------
SYS							       SQL*Net message from client    INACTIVE primary		    DB193H1						  0
SYS							       OFS idle 		      ACTIVE   primary		    SYS$BACKGROUND					  0
A1							       SQL*Net message to client      ACTIVE   primary		    SYS$USERS			   cssctts2u81n4	  0
PUBLIC							       SQL*Net message from client    INACTIVE standby		    DB193H1						  0
A1				    49			     1 enq: TX - row lock contention  ACTIVE   standby		    SYS$USERS			   du4ukzh4mjns1	  0

SQL> select sql_text from v$sql where sql_id = 'du4ukzh4mjns1';
SQL_TEXT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
UPDATE "TESTDML" "A1" SET "ID" = 2 WHERE "A1"."NAME"='name'

大约过了1分钟,在备库上就会出现下面的错误。分布式事务等待锁超时了。

SQL> update testdml set id=2 where name='name';
update testdml set id=2 where name='name'
*
ERROR at line 1:
ORA-02049: timeout: distributed transaction waiting for lock
ORA-02063: preceding line from ADGREDIRECT

Elapsed: 00:01:00.05

如果我们现在把这个反过来操作一下。看看会不会出现这个错误ORA-02049:。也就是在备库上执行事务不提交,在主库上更新。

备库上执行
update testdml set id=2 where name='name';

主库上执行
update testdml set id=2 where name='name';

这次是永远也没有发生分布式事务等待锁超时的情况。

当然官方文档还告诉我们可以在standby上执行Top-level PL/SQL操作,但是不能包含绑定变量。这个功能需要我们在会话级别执行ALTER SESSION ENABLE ADG_REDIRECT_PLSQL
Connected.
参考文档:Oracle (Active) Data Guard 19c Real-Time Data Protection and Availability

Oracle Database 19c使用Data Guard Broker进行Data Guard物理备库设置

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:Oracle Database 19c使用Data Guard Broker进行Data Guard物理备库设置
今天准备研究学习一下19c的data guard。由于想快速搭建,所以选择了vagrant去拖镜像,但是发现拖下来的东西还是没有实现dg(这里的情况不是很了解,感觉是执行脚本有个地方报错了),不过它里面的脚本我觉得没什么大问题。我自己凑合手动执行脚本搭了一个。

首先看一下先决条件。
primary主机,操作系统是RedHat 7.8,安装了oracle 19c数据库软件,创建了一个叫DB193H1的数据库。
standby主机,操作系统是RedHat 7.8,安装了oracle 19c数据库软件。

1.我们先来看一下vargant中主库需要执行的脚本primary_DB_setup.sh。创建完主库之后,需要修改一些相关参数。同时打开归档,添加STANDBY LOGFILE,并且打开FLASHBACK。由于两边数据文件保持路径一致。因此设置STANDBY_FILE_MANAGEMENT为AUTO即可。这里还把dg_broker_start打开了。

${DB_HOME}/bin/sqlplus / as sysdba 
ALTER SYSTEM SET db_create_file_dest='/u02/oradata';
ALTER SYSTEM SET db_create_online_log_dest_1='/u02/oradata';
ALTER SYSTEM SET local_listener='LISTENER';
ALTER SYSTEM SET db_recovery_file_dest_size=20G;
ALTER SYSTEM SET db_recovery_file_dest='/u01/app/oracle';
exit;


${DB_HOME}/bin/sqlplus / as sysdba 
SHUTDOWN IMMEDIATE;
STARTUP MOUNT;
ALTER DATABASE ARCHIVELOG;
ALTER DATABASE OPEN;

ALTER DATABASE FORCE LOGGING;
ALTER SYSTEM SWITCH LOGFILE;

ALTER DATABASE ADD STANDBY LOGFILE SIZE 50M;
ALTER DATABASE ADD STANDBY LOGFILE SIZE 50M;
ALTER DATABASE ADD STANDBY LOGFILE SIZE 50M;
ALTER DATABASE ADD STANDBY LOGFILE SIZE 50M;

ALTER DATABASE FLASHBACK ON;

ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT=AUTO;
exit;



${DB_HOME}/bin/sqlplus / as sysdba 
ALTER SYSTEM SET dg_broker_start=TRUE;
exit;

2.主库设置完毕之后,就可以配置tns链接串了,这个配置需要在主库和备库均设置。设置如下。

DB193H1 =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = primary)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = DB193H1)
    )
  )

DB193H1_STDBY =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (ADDRESS = (PROTOCOL = TCP)(HOST = standby)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SID = DB193H1)
    )
  )

当然还需要我们配置一下静态注册。在监听的listener.ora文件中增加下列配置,这个配置需要在主库和备库均设置

SID_LIST_LISTENER =
  (SID_LIST =
    (SID_DESC =
      (GLOBAL_DBNAME = DB193H1_DGMGRL)
      (DB_HOME = /u01/app/oracle/product/19.3.0.0/dbhome_1)
      (SID_NAME = DB193H1)
    )
  )

3.接下来是备库的设置。我们可以按照standby_DB_setup.sh的脚本来设置。先创建相关文件夹(这里需要注意的是,如果你使用了CDB和PDB,就需要创建上面那pdbseed和pdb相关文件夹,没有则不需要),然后生成密码文件。这里的DB_NAME取名为DB193H1_STDBY

if [ "${CDB}" == "true" ]
then
  mkdir -p /u02/oradata/${DB_NAME}/pdbseed
  mkdir -p /u02/oradata/${DB_NAME}/pdb1
fi

mkdir -p ${DB_BASE}/fast_recovery_area/${DB_NAME}
mkdir -p ${DB_BASE}/admin/${DB_NAME}/adump

orapwd file=$ORACLE_HOME/dbs/orapw${DB_NAME} password=${SYS_PASSWORD} entries=10 format=12

4.上述配置完成后,创建参数文件,启动备库到nomount状态。

cat > /tmp/init_standby.ora 
*.db_name='${DB_NAME}'
*.local_listener='LISTENER'


echo "-----------------------------------------------------------------"
echo -e "${INFO}`date +%F' '%T`: Making auxillary instance"
echo "-----------------------------------------------------------------"
export ORACLE_SID=${DB_NAME}
${DB_HOME}/bin/sqlplus / as sysdba 
--SHUTDOWN IMMEDIATE;
STARTUP NOMOUNT PFILE='/tmp/init_standby.ora';
exit;

5.在主库上执行rman连接到备库实例,执行DUPLICATE操作,这里很简单我们的参数只需要修改一个db_unique_name即可,如果文件路径不一致,就需要在这里设置db_file_name_convert和log_file_name_convert

${DB_HOME}/bin/rman TARGET sys/${SYS_PASSWORD}@${DB_NAME} AUXILIARY sys/${SYS_PASSWORD}@${DB_NAME}_STDBY 
DUPLICATE TARGET DATABASE
  FOR STANDBY
  FROM ACTIVE DATABASE
  DORECOVER
  SPFILE
    SET db_unique_name='${DB_NAME}_STDBY' COMMENT 'Standby'
  NOFILENAMECHECK;
exit;

6.上述操作完成之后就可以进入到Data Guard Broker的配置了。先进入到备库设置参数。

ALTER SYSTEM SET dg_broker_start = true;

7.接下来在主库配置Data Guard Broker

${DB_HOME}/bin/dgmgrl sys/${SYS_PASSWORD}@${DB_NAME} 
CREATE CONFIGURATION db_broker_config AS PRIMARY DATABASE IS ${DB_NAME} CONNECT IDENTIFIER IS ${DB_NAME};
ADD DATABASE ${DB_NAME}_STDBY AS CONNECT IDENTIFIER IS ${DB_NAME}_STDBY MAINTAINED AS PHYSICAL;
ENABLE CONFIGURATION;

我按照这些步骤操作完成之后,执行发现报ORA-16810错误。

[oracle@primary ~]$ dgmgrl sys/welcome1@DB193H1
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Mon Jun 8 12:49:56 2020
Version 19.3.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "DB193H1"
Connected as SYSDBA.
DGMGRL> 
DGMGRL>   CREATE CONFIGURATION db_broker_config AS PRIMARY DATABASE IS  DB193H1 CONNECT IDENTIFIER IS  DB193H1;
Configuration "db_broker_config" created with primary database "db193h1"
DGMGRL> 
DGMGRL> ADD DATABASE DB193H1_STDBY AS CONNECT IDENTIFIER IS DB193H1_STDBY MAINTAINED AS PHYSICAL;
Database "db193h1_stdby" added
DGMGRL> ENABLE CONFIGURATION;

Enabled.
DGMGRL> DGMGRL> 
DGMGRL> 
DGMGRL> 
DGMGRL> SHOW CONFIGURATION;

Configuration - db_broker_config

  Protection Mode: MaxPerformance
  Members:
  db193h1       - Primary database
    db193h1_stdby - Physical standby database 
      Error: ORA-16810: multiple errors or warnings detected for the member

经过详细查看备库的状态,发现是ORA-16766: Redo Apply is stopped,原来是一直没有启动备库的redo apply。

DGMGRL> show database verbose db193h1_stdby

Database - db193h1_stdby

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          (unknown)
  Average Apply Rate: (unknown)
  Active Apply Rate:  (unknown)
  Maximum Apply Rate: (unknown)
  Real Time Query:    OFF
  Instance(s):
    DB193H1

  Database Error(s):
    ORA-16766: Redo Apply is stopped

  Database Warning(s):
    ORA-16854: apply lag could not be determined

  Properties:
    DGConnectIdentifier             = 'db19h1_stdby'
    ObserverConnectIdentifier       = ''
    FastStartFailoverTarget         = ''
    PreferredObserverHosts          = ''
    LogShipping                     = 'ON'
    RedoRoutes                      = ''
    LogXptMode                      = 'ASYNC'
    DelayMins                       = '0'
    Binding                         = 'optional'
    MaxFailure                      = '0'
    ReopenSecs                      = '300'
    NetTimeout                      = '30'
    RedoCompression                 = 'DISABLE'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyLagThreshold               = '30'
    TransportLagThreshold           = '30'
    TransportDisconnectedThreshold  = '30'
    ApplyParallel                   = 'AUTO'
    ApplyInstances                  = '0'
    StandbyFileManagement           = ''
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '0'
    LogArchiveMinSucceedDest        = '0'
    DataGuardSyncLatency            = '0'
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = ''
    DbFileNameConvert               = ''
    LogFileNameConvert              = ''
    ArchiveLocation                 = ''
    AlternateLocation               = ''
    StandbyArchiveLocation          = ''
    StandbyAlternateLocation        = ''
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    LogXptStatus                    = '(monitor)'
    SendQEntries                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = 'standby'
    StaticConnectIdentifier         = '(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=primary)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=DB193H1_STDBY_DGMGRL)(INSTANCE_NAME=DB193H1)(SERVER=DEDICATED)))'
    TopWaitEvents                   = '(monitor)'
    SidName                         = '(monitor)'

  Log file locations:
    Alert log               : /u01/app/oracle/diag/rdbms/db193h1_stdby/DB193H1/trace/alert_DB193H1.log
    Data Guard Broker log   : /u01/app/oracle/diag/rdbms/db193h1_stdby/DB193H1/trace/drcDB193H1.log

Database Status:
ERROR

切换到备库,执行应用,再次查看dg broker配置就正常了

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION;

DGMGRL> SHOW CONFIGURATION;

Configuration - db_broker_config

  Protection Mode: MaxPerformance
  Members:
  db193h1       - Primary database
    db193h1_stdby - Physical standby database 

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 46 seconds ago)

DGMGRL> SHOW DATABASE db193h1

Database - db193h1

  Role:               PRIMARY
  Intended State:     TRANSPORT-ON
  Instance(s):
    DB193H1

Database Status:
SUCCESS

DGMGRL> SHOW DATABASE db193h1_stdby

Database - db193h1_stdby

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 1 second ago)
  Apply Lag:          0 seconds (computed 1 second ago)
  Average Apply Rate: 3.00 KByte/s
  Real Time Query:    ON
  Instance(s):
    DB193H1

Database Status:
SUCCESS

安装顺利成功,接下来我们就可以来研究19c dataguard的新特性了。感兴趣的小伙伴可以自行研究下vargant中的脚本。

一次enq:TX-index contention故障分析

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:一次enq:TX-index contention故障分析
这个故障是头一天晚上9点多开始,客户发现索引enq:TX-index contention分裂很严重,系统程序插入的时候有大量的enq:TX-index contention等待,同是伴随着还有gc方面的等待。在12点多的时候,重建了索引最后恢复了。现在需要我们分析根本原因。

首先通过当天晚上10点多的AWR报告可以发现,enq:TX-index contention分裂很严重,两个节点之间的网络延迟非常严重。


 

 

通过对比正常时间段和异常时间段的leaf node splits,发现每秒种索引分裂的个数并没有异常增长,也就侧面说明业务量没有增长。

 

 

我们做个相关插入sql语句的sqlhc报告,发现引起索引争用的语句。执行的频次没有发现很大的增长。

 

 
通过分析,我们可以认定这个插入SQL是被影响的。那么究竟是被什么影响的呢?

 
通过分析7天的awr情况发现在5月20日从早上11点开始,数据库CPU就一直增长,到下午18点已经达到了80%,最终一直涨到100%。继续查询AWR报告,发现占用CPU高的语句就一条。占比达到95%。

 
对该语句做SQLHC报告,发现该语句的执行计划并没变,执行次数也没有明显的差异,但是逻辑读在问题期间大幅增长了。从而导致执行语句需要更多的cpu。

 

 

 
分析该语句的执行计划,可以发现,走了全表扫描。而它使用的谓词条件上正好有索引,优化器可以使用索引INX_OWN_O_I_REQ_ORDER_NO,但是没有使用,通过查看报告,发现索引的order_no上缺乏列的统计信息。

 
此时处理的办法有两种,一种是对表进行统计信息收集,在收集的时候可以选择,method_opt=> ‘for all indexed columns’。第二种方法就是对SQL语句进行绑定。客户选择了第二种方式,我们将SQL绑定之后,系统消耗的cpu逐步降低,上面那条插入语句瞬间就好了很多,客户反映插入速度变快了。最终这个问题被定义为CPU资源繁忙导致网络软中断,引起了gc,最终导致enq:TX-index contention争用严重的问题。

一次”qmxdpls_subhea”内存耗尽的问题分析

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:一次”QMXDPLS_SUBHEA”内存耗尽的问题分析
一套RAC数据库其中的一个节点在17点05分左发生了重启。通过分析发现是内存耗尽导致了交换分区的使用,同时system cpu 100%导致的crash。首先观察2节点16点-16点30的awr报告,可以发现主机内存256GB,SGA使用了80GB,PGA使用了90GB左右。

从系统残余内存来看,仍然有256-80-90=76G可以用,而操作系统正常大概能占用50GB,应该还剩余26GB可用。但是问题就在于这个机器的大页设置不正确(当前设置为60GB)。导致一部分大页的内存是空闲没法使用的。

HugePages_Total: 31748
HugePages_Free: 5831
HugePages_Rsvd: 5828
HugePages_Surp: 0
Hugepagesize: 2048 kB

在从oswatch来看,8号11点还剩30GB内存,而到了9号11点则剩下16GB内存,到了下午的15点左右下降到了1.68GB。可见内存是持续缓慢的下降耗尽的。不是一个并发或者大的事务上来导致的。

zzz ***Sun Dec 8 11:00:03 CST 2019
MemTotal: 264410232 kB
MemFree: 31472452 kB

zzz ***Mon Dec 9 10:00:15 CST 2019
MemTotal: 264410232 kB
MemFree: 17231808 kB

zzz ***Mon Dec 9 16:00:19 CST 2019
MemTotal: 264410232 kB
MemFree: 1765052 kB

那么内存缓慢耗尽是什么原因呢?通过观察节点1没重启的进程情况,发现以下问题:

DEDICATED 7062 RM bea 405.396004 405383755 428595614 3145728 428595614
DEDICATED 7856 RM bea 431.021004 432892859 457824670 5505024 457824670
DEDICATED 13649 RM bea 491.771004 496860795 516807070 786432 516807070
DEDICATED 14834 RM bea 503.458504 509014107 531356062 3080192 531356062
DEDICATED 12960 RM bea 539.208504 548221595 570612126 4849664 570612126
DEDICATED 3089 RM bea 541.458504 548346875 572774814 4653056 572774814
DEDICATED 3619 RM bea 565.958504 575983915 599382430 5570560 599382430
DEDICATED 231 RM bea 666.458504 680193475 704633246 5439488 704633246
DEDICATED 21156 RM bea 703.333504 718921091 738646430 786432 738646430
DEDICATED 19747 RM bea 820.896004 839278371 861723038 589824 861723038

SQL> select sid,event,program,machine,sql_id,status from v$session where sid=19747;

SID EVENT PROGRAM MACHINE SQL_ID STATUS
---------- ------------------------------ ------------------------------------------------ ---------------------------------------- ------------- --------
19747 SQL*Net message from client JDBC Thin Client whrm_app6 INACTIVE

SQL> select sid,event,program,machine,sql_id,PREV_SQL_ID,status from v$session where sid=19747;

SID EVENT PROGRAM MACHINE SQL_ID PREV_SQL_ID STATUS
---------- ------------------------------ ------------------------------------------------ ---------------------------------------- ------------- ------------- --------
19747 SQL*Net message from client JDBC Thin Client whrm_app6 bunvx480ynf57 INACTIVE

SQL> select sid,event,program,machine,sql_id,PREV_SQL_ID,status from v$session where sid=21156;
SID EVENT PROGRAM MACHINE SQL_ID PREV_SQL_ID STATUS
---------- ------------------------------ ------------------------------------------------ ---------------------------------------- ------------- ------------- --------
21156 SQL*Net message from client JDBC Thin Client whrm_app8 bunvx480ynf57 INACTIVE

SQL> select sid,event,program,machine,sql_id,PREV_SQL_ID,status from v$session where sid=231;

SID EVENT PROGRAM MACHINE SQL_ID PREV_SQL_ID STATUS
---------- ------------------------------ ------------------------------------------------ ---------------------------------------- ------------- ------------- --------
231 SQL*Net message from client JDBC Thin Client whrm_app6 bunvx480ynf57 INACTIVE

可以看到相关进程占用的内存很高,已经是inactive了,但是内存并没发生回收。通过oradebug dump heapdump 命令dump其中的800MB的进程,发现进程分配了12000个extent,而这些extend中的chunk都是”qmxdpls_subhea“。

而关于这个内存区域的问题,通过mos搜索,可以发现出现大量的memory leak(内存泄露)的情况。

我们把这个情况反映给了应用人员,应用人员根据程序定位到是没关闭xmltype相关对象导致的内存泄露。对于XMLTYPE这类的操作,需要打开后执行free()进行关闭,释放内存。

参考文档:
Best Practises for XMLType Temporary LOB Usage (Doc ID 1955135.1)
How to Release Temporary LOB Segments without Closing the JDBC Connection (Doc ID 1384829.1)

四个节点实例异常重启故障分析

今天发生了一个故障,居然是四个RAC节点全部同一时间发生重启了,这种情况还是很少见的。以下是问题分析过程

节点1日志

2020-05-12T04:50:13.955295+08:00
Errors in file /u01/app/oracle/diag/rdbms/dwdb/dwdb1/trace/dwdb1_ora_118868.trc  (incident=4418543):
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2
skgxpvfynet: mtype: 61 process 123980 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
opidrv aborting process M003 ospid (123980) as a result of ORA-603
2020-05-12T05:07:29.342931+08:00
Reconfiguration started (old inc 8, new inc 10)
List of instances (total 1) :
1
Dead instances (total 3) :
2 3 4
My inst 1
2020-05-12T05:07:31.305816+08:00
NOTE: ASMB process state dumped to trace file /u01/app/oracle/diag/rdbms/dwdb/dwdb1/trace/dwdb1_gen0_178160.trc
2020-05-12T05:07:32.229320+08:00
skgxpvfynet: mtype: 61 process 124166 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
opiodr aborting process unknown ospid (124166) as a result of ORA-603
2020-05-12T05:09:27.659099+08:00
skgxpvfynet: mtype: 61 process 124818 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
2020-05-12T05:10:08.457843+08:00
Process m003 died, see its trace file
2020-05-12T05:10:34.164320+08:00
DDE: Problem Key 'ORA 603' was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
2020-05-12T05:10:39.931465+08:00
skgxpvfynet: mtype: 61 process 125146 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4)
2020-05-12T05:10:39.931458+08:00
opiodr aborting process unknown ospid (124818) as a result of ORA-603
2020-05-12T05:10:39.931454+08:00
GCR0[178417]: LMHB process succesfully killed
2020-05-12T05:11:40.473632+08:00
opiodr aborting process unknown ospid (125146) as a result of ORA-603
2020-05-12T05:18:01.842419+08:00
ksxp_exafusion_enabled_dcf: ipclw_enabled=0
Starting ORACLE instance (normal) (OS id: 108120)

节点2日志

2020-05-12T04:47:56.141213+08:00
Incremental checkpoint up to RBA [0x254f.3988e5.0], current log tail at RBA [0x254f.39cc3a.0]
2020-05-12T05:05:18.334836+08:00
NOTE: ASMB0 registering with ASM instance as Flex client 0x10006 (reg:3205650803) (reconnect)
2020-05-12T05:06:11.234166+08:00
Dumping diagnostic data in directory=[cdmp_20200512050610], requested by (instance=4, osid=605758 (GEN0)), summary=[abnormal instance termination].
2020-05-12T05:06:12.197007+08:00
ERROR: terminating instance because ASMB is stuck for 241 seconds
GEN0 (ospid: 110374): terminating the instance due to error 15082
2020-05-12T05:06:18.242304+08:00
Instance terminated by GEN0, pid = 110374
2020-05-12T05:15:33.418428+08:00
ksxp_exafusion_enabled_dcf: ipclw_enabled=0

节点3日志

2020-05-12T05:05:49.290986+08:00
LGWR (ospid: 364374) waits for event 'enq: CF - contention' for 74 secs.
2020-05-12T05:05:49.291033+08:00
LGWR (ospid: 364374) is hung in an acceptable location (cfio 0x11.00).
2020-05-12T05:06:05.433427+08:00
NOTE: ASMB0 registering with ASM instance as Flex client 0x10007 (reg:4119066454) (reconnect)
2020-05-12T05:06:11.239521+08:00
Dumping diagnostic data in directory=[cdmp_20200512050610], requested by (instance=4, osid=605758 (GEN0)), summary=[abnormal instance termination].
2020-05-12T05:06:16.340150+08:00
Dumping diagnostic data in directory=[cdmp_20200512050613], requested by (instance=2, osid=110374 (GEN0)), summary=[abnormal instance termination].
2020-05-12T05:06:34.967473+08:00
ERROR: terminating instance because ASMB is stuck for 242 seconds
GEN0 (ospid: 364027): terminating the instance due to error 15082
2020-05-12T05:06:38.182059+08:00
System state dump requested by (instance=3, osid=364027 (GEN0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/dwdb/dwdb3/trace/dwdb3_diag_364051_20200512050638.trc
2020-05-12T05:06:40.996565+08:00
Instance terminated by GEN0, pid = 364027
2020-05-12T05:15:36.490770+08:00
ksxp_exafusion_enabled_dcf: ipclw_enabled=0 
Starting ORACLE instance (normal) (OS id: 10317)
2020-05-12T05:15:36.499203+08:00
CLI notifier numLatches:131 maxDescs:19888
2020-05-12T05:15:36.501049+08:00
**********************************************************************
2020-05-12T05:15:36.501093+08:00
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)

节点4日志

2020-05-12T05:04:12.064862+08:00
CKPT (ospid: 605951) waits for event 'enq: CF - contention' for 63 secs.
2020-05-12T05:04:12.064910+08:00
CKPT (ospid: 605951) is hung in an acceptable location (cfio 0x11.00).
2020-05-12T05:05:18.151878+08:00
NOTE: ASMB0 registering with ASM instance as Flex client 0x10008 (reg:2344414256) (reconnect)
2020-05-12T05:05:25.774637+08:00
CKPT (ospid: 605951) waits for event 'enq: CF - contention' for 137 secs.
2020-05-12T05:05:25.774701+08:00
CKPT (ospid: 605951) is hung in an acceptable location (inwait 0x201.00).
2020-05-12T05:06:10.637714+08:00
ERROR: terminating instance because ASMB is stuck for 242 seconds
GEN0 (ospid: 605758): terminating the instance due to error 15082
2020-05-12T05:06:10.646458+08:00
DWPUB(3):opiodr aborting process unknown ospid (220276) as a result of ORA-1092
2020-05-12T05:06:10.646808+08:00
DWPUB(3):opiodr aborting process unknown ospid (330270) as a result of ORA-1092
2020-05-12T05:06:10.647256+08:00
DWPUB(3):opiodr aborting process unknown ospid (371195) as a result of ORA-1092
2020-05-12T05:06:10.651213+08:00
DWPUB(3):opiodr aborting process unknown ospid (220280) as a result of ORA-1092
2020-05-12T05:06:11.233314+08:00
System state dump requested by (instance=4, osid=605758 (GEN0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/dwdb/dwdb4/trace/dwdb4_diag_605768_20200512050611.trc
2020-05-12T05:06:16.690258+08:00
Instance terminated by GEN0, pid = 605758

从上述日志可以看出,节点4是先宕机的,然后节点2接着宕机,节点3在宕机,最后是节点1最后宕机。节点1最后认为节点234都已经Dead。

List of instances (total 1) :
 1
Dead instances (total 3) :
 2 3 4
My inst 1

继续分析节点1,在宕机之前,节点1后台一直在报错。

ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sendmsg failed with status: 105
ORA-27301: OS failure message: No buffer space available
ORA-27302: failure occurred at: sskgxpsnd2

配合节点1 oswatch的内存图,可以发现问题期间内存free值非常低。

进一步分析空闲内存低的时候top情况,发现系统开启大量的并行进程。结合当时的addm报告,可以发现应用当时create table的时候开了很多并行。

内存不足导致ORA-错误,通过mos文档。
“ORA-27301: OS failure message: No buffer space available” occurs on OPC RAC (Doc ID 2397062.1)

需要设置参数vm.min_free_kbytes。
当请求分配内存的时候,如果有足够的内存,则可以成功分配,当没有足够内存的时候,操作就会阻塞。他需要等待系统先去释放内存,再分配内存。而对于系统级别的一些原子性的请求,它是不能被阻塞的,如果分配不到内存的话,就会出现失败。内核为了避免原子请求失败,必须设置一块保留的内存。而这个就是通过这个参数来设置的。
节点4的oswwatch内存图

这里可以看到节点4的内存也是free持续变低。


从top进程中可以看到,节点4的情况和节点1不一样,osysmond进程占用内存逐步增加,内存free越来越少。通过mos查询,可以发现osysmond进程在12c版本存在内存泄露的情况,Bug 18701017 – Memory Leak with osysmond.bin (Doc ID 18701017.8)

优化建议
1.建议所有节点设置参数vm.min_free_kbytes,必须设置一块保留的内存,该参数可以让内核避免原子请求失败。具体值需要主机工程师评估。
2.节点1建表语句开启并行太高,建议将单个语句并行个数控制在160以内。

3.所有集群节点关闭System Monitor Service(osysmond)。osysmond会将每个节点的资源使用情况发送给cluster logger service,后者将会把所有节点的信息都接收并保存到CHM的资料库。由于我们已经安装oswatch,所以这个可以关闭。

On each node, as root user:
# /bin/crsctl stop res ora.crf -init
# /bin/crsctl modify res ora.crf -attr ENABLED=0 -init

一次ORA-12547: TNS:lost contact问题分析

版权声明:本文为Buddy Yuan原创文章,转载请注明出处。原文地址:一次ORA-12547: TNS:LOST CONTACT问题分析

今天无意中连一套数据库,连接的时候直接报ORA-12547: TNS:lost contact

这种问题也是很奇怪的,我随手就做了一个strace进行分析了一下。

[oracle@itwgbx2 admin]$ strace -f -o /tmp/trace.1.log $ORACLE_HOME/bin/sqlplus / as sysdba

SQL*Plus: Release 10.2.0.4.0 - Production on Fri May 29 20:30:17 2020

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

ERROR:
ORA-12547: TNS:lost contact

通过strace生成的trace文件进一步分析,如下所示:

可以看到这里产生了一个trace文件,打开trace文件,可以发现这里报”opiino: Attach failed! error=-1 “错误。

查看MOS文档”opiino: Attach failed! error=-1 ifvp=0″ written in Database UDUMP file (Doc ID 399727.1),里面说明了原因:
1.这个错误是由数据库在客户端进程的OPI / OCI(两次任务)上异常断开引起的。如果在数据库可以完成两次任务通信功能之前(异常)终止了客户端,则会发生这种情况。
2.如果Oracle二进制文件位于使用NFS挂载,使用了noac参数/选项,则也可能引发此错误。
3.另一个可能的原因,以及可能导致连接大量增加或断开的原因是,是10g引入的INBOUND_CONNECT_TIMEOUT参数,设置的比默认值还要低。另一个错误可能在数据库警报日志中报告的异常断开,特别是与11g中经常出现ORA-609 “opiodr aborting process unknown ospid。这两种情况非常类似,都是从“opi”数据库层生成的错误最终导致客户端异常。
4.另一个潜在(但罕见)的原因是,用于客户端访问的监听版本比正在访问的数据库的版本低。例如,使用9.2.x监听器访问未经认证或支持的10.x数据库,这将导致连接异常
上述几种错误的形式确实很多需要去验证,我基本上确认了2,3,4在我的系统中不可能,难道是1这种情况。1这种情况需要收集更多的信息来确认。
受影响的客户端和数据库之间的网络是否处于稳定状态?
连接活动是否突然增加并且异常增加,可能会使数据库连接数爆满?
客户端是否异常终止(例如机器故障)?
我又仔细的check了一遍,确认不存在上述问题。就在不知道如何在查下去的时候,我手动又把刚刚的strace文件又做了一遍grep,这次grep发现了一个新的方向。
[oracle@itwgbx2 tmp]$ cat trace.1.log | grep -i TNS-
26554 write(7, “TNS-12649: Unknown encryption or”…, 58
26553 write(8, “TNS-12547: TNS:lost contact\n”, 28
26553 write(8, “TNS-00517: Lost contact\n”, 24
可以看到,这里出现了一个错误”TNS-12649: Unknown encryption or”。这肯定是监听加密有问题。我查看了一下sqlnet.ora文件,确实上面设置了一些监听加密的参数,我们这些加密的参数全部取消掉,再次登录就恢复了正常。这波真是山穷水复疑无路,柳暗花明又一村。