High memory allocations of KGLHD引发故障,memory_imm_mode_without_autosga,进一步禁用SGA自动内存管理。

客户一套11.2.0.3的RAC数据库系统,运行一些很简单的查询语句,发现速度超级慢。然后进行查询数据字典的操作,也慢的难以忍受。这套库所在的机器配置很好,内存有100多个G,按照道理来说查询一个数据字典不能慢的如此离谱。大概50多秒才能出来。于是对数据字典进行了统计信息收集,发现速度依旧很慢。把语句的执行计划拿出来和正常数据库的执行计划进行比较,执行计划不变。最后通过10046进行观察,发现执行慢的过程中,出现大量的db file scattered read和db file sequential read等待事件。也就是在该数据库上执行查数据字典,物理读很高。这就非常值得怀疑了。看了一下buffer cache的当前大小,发现只有3G。而我们的数据库启动参数设置的buffer cache是50多G,查看alert日志,发现并没有对SGA组件大小的操作。这里需要介绍一下:该数据库memory_target,memory_max_targe,sga_target都设置为0。也就是禁用了11g AMM的特性。在SGA上对各个组件进行了详细的设置。这是一个完全禁用了自动内存管理的数据库。那么疑问是:既然设置了50G的buffer cache怎么只有3G大小。剩下的内存去哪儿了?通过检查发现Shared pool增长到了60G。

POOL         NAME                            BYTES
------------ -------------------------- ----------
java pool    free memory                1879048192
large pool   free memory                1879048192
shared pool  kscdnfyinitflags                    8
shared pool  parameter table block        22068760
shared pool  SQLP                         23932720
shared pool  KQR L PO                     26014192
shared pool  FileIdentificatonBlock       30523832
shared pool  KGLNA                        31756456
shared pool  gcs res hash bucket          33554432
shared pool  dirty object counts array    33554432
shared pool  dbwriter coalesce buffer     33587200
shared pool  ksunfy : SSO free list       37453824
shared pool  ges big msg buffers          39921208
shared pool  event statistics per sess    40108032
shared pool  KKSSP                        57755752
shared pool  KGLDA                        58894136
shared pool  object queue                101699520
shared pool  ges resource                107943072
shared pool  ASH buffers                 134217728
shared pool  dbktb: trace buffer         139575296
shared pool  ges enqueues                151274072
shared pool  gc name table               201326592
shared pool  db_block_hash_buckets       373297152
shared pool  state objects               506913016
shared pool  Checkpoint queue            536903680
shared pool  kglsim heap                 549089856
shared pool  FileOpenBlock               597301440
shared pool  gcs shadows                 819820736
shared pool  kglsim object batch        1040722032
shared pool  gcs resources              1112613856
shared pool  SQLA                       1129550336
shared pool  KGLH0                      1256899928
shared pool  free memory                1.1934E+10
shared pool  KGLHD                      4.2610E+10
streams pool spilled:kwqbm                     176
streams pool deqtree_kgqmctx                   304
streams pool recov_kgqmsub                  315032
streams pool free memory                 267720696
fixed_sga                     2228392
log_buffer                  170323968
buffer_cache               3489660928

通过观察,我们发现shared pool的KGLHD增长到了一个很大的值。达到了42个G。所以留给我们的疑问是:
1.在禁用了SGA自动调整后,为什么buffer cache会变小,而shared pool会变大?
2.为什么shared pool的KGLHD会增长到40多G?
要解释第一个问题,我们可以从文档SGA Re-Sizes Occurring Despite AMM/ASMM Being Disabled (MEMORY_TARGET/SGA_TARGET=0) [ID 1269139.1]上找到答案。这里面有一段话如下:

SGA re-sizes are occurring after upgrade to 11.2 despite the fact that automatic memory management (AMM/ASMM) is disabled via the MEMORY_TARGET and SGA_TARGET parameters being set to zero.  This typically appears as growth in the __SHARED_POOL_SIZE value and a reduction in the __DB_CACHE_SIZE value being used in the instance, such that __DB_CACHE_SIZE may be shrunk below the DB_CACHE_SIZE value specified in the init.ora/spfile. 
This is expected behavior in 11.2 for immediate memory allocation requests, which added this as a new feature when automatic memory management was disabled.  Enhancement Bug:13340694 has been logged to provide a warning message in the alertlog when such allocations occur to provide the DBA with an indication of a memory problem, as otherwise this can result in the buffer cache being shrunk below the specified minimum value, and hence lead to performance degradation. 
Ignore the resize operation which is performed to prevent an ORA-4031 error from being raised.

这里告诉我们,在11.2以后,数据库发现如果shared pool不够用,会爆发ORA-04031错误的话,它就会去“偷下”buffer cache的内存。防止ORA-04031问题的产生。虽然Oracle设想的比较好,但是我们发现如果出现Bug导致shared pool使用暴增的话,会一直”偷取”buffer cache的内存,最终导致buffer cache严重不足,引发更为严重的I/O问题。
这里不得不说下Oracle的巧妙设计,当我们设置MEMORY_TARGET和SGA_TARGET参数为0的时候,SGA自动调整将禁用DEFERRED mode的自动调整请求,但是它允许IMMEDIATE mode的自动请求,那么我们怎么分辨请求是DEFERRED mode or IMMEDIATE mode呢?通过查询V$MEMORY_RESIZE_OPS和V$SGA_RESIZE_OPS视图。

SQL> select * from V$MEMORY_RESIZE_OPS;

COMPONENT                OPER_TYPE     OPER_MODE PARAMETER             INITIAL_SIZE TARGET_SIZE FINAL_SIZE STATUS    START_TIM END_TIME
-------------------------------------- --------- ---------------------------------- ----------- ---------- --------- --------- ---------
shared pool              STATIC                  shared_pool_size                 0   255852544  255852544 COMPLETE  08-JUN-13 08-JUN-13
RECYCLE buffer cache     STATIC                  db_recycle_cache_size            0           0          0 COMPLETE  08-JUN-13 08-JUN-13
PGA Target               STATIC                  pga_aggregate_target             0   293601280  293601280 COMPLETE  08-JUN-13 08-JUN-13
large pool               STATIC                  large_pool_size                  0     4194304    4194304 COMPLETE  08-JUN-13 08-JUN-13
java pool                STATIC                  java_pool_size                   0     4194304    4194304 COMPLETE  08-JUN-13 08-JUN-13
streams pool             STATIC                  streams_pool_size                0     4194304    4194304 COMPLETE  08-JUN-13 08-JUN-13
SGA Target               STATIC                  sga_target                       0   549453824  549453824 COMPLETE  08-JUN-13 08-JUN-13
DEFAULT buffer cache     INITIALIZING            db_cache_size            272629760   272629760  272629760 COMPLETE  08-JUN-13 08-JUN-13
ASM Buffer Cache         STATIC                  db_cache_size                    0           0          0 COMPLETE  08-JUN-13 08-JUN-13
DEFAULT buffer cache     STATIC                  db_cache_size                    0   272629760  272629760 COMPLETE  08-JUN-13 08-JUN-13
DEFAULT 2K buffer cache  STATIC                  db_2k_cache_size                 0           0          0 COMPLETE  08-JUN-13 08-JUN-13
DEFAULT 4K buffer cache  STATIC                  db_4k_cache_size                 0           0          0 COMPLETE  08-JUN-13 08-JUN-13
DEFAULT 8K buffer cache  STATIC                  db_8k_cache_size                 0           0          0 COMPLETE  08-JUN-13 08-JUN-13
DEFAULT 16K buffer cache STATIC                  db_16k_cache_size                0           0          0 COMPLETE  08-JUN-13 08-JUN-13
DEFAULT 32K buffer cache STATIC                  db_32k_cache_size                0           0          0 COMPLETE  08-JUN-13 08-JUN-13
KEEP buffer cache        STATIC                  db_keep_cache_size               0           0          0 COMPLETE  08-JUN-13 08-JUN-13
DEFAULT buffer cache     SHRINK        DEFERRED  db_cache_size            272629760   260046848  260046848 COMPLETE  09-JUN-13 09-JUN-13
shared pool              GROW          DEFERRED  shared_pool_size         255852544   268435456  268435456 COMPLETE  09-JUN-13 09-JUN-13

这里我自己的测试机,我们可以看到后面两条信息,buffer cache拿出来了一些内存划到了shared pool里面。不过是DEFERRED模式的。因为我设置了memory_target的值,所以这种划是可以的。如果我们把其设置成0的话,就只能支持immediate模式的动态调整。
那么我们怎么阻止这种immediate模式的自动调整呢?这里MOS给出了一个隐含参数:_memory_imm_mode_without_autosga。通过设置该参数为false,我们就能够彻底的阻止immediate模式的SGA自动调整。

alter system set "_memory_imm_mode_without_autosga"=FALSE scope=both;

搞清楚了第一个问题,我们来看第二个问题,为什么我们shared pool中的KGLHD会增长到40多G?我们可以从文档High memory allocations of KGLHD in Shared Pool [ID 1435186.1]找到答案。这个问题是个bug,bug编号是:13250244。

Kgl handles and associated shared memory remained unfreed after flushing the library cache in cases of multiple shared pool subpools. 
The problem is easily triggered if stored outlines are used, but the use of stored outlines is not required to hit this problem.

产生这个bug的可能原因有两种:
1.使用了多个shared pool subpools,然后在flush library cache之后,Kgl handles和相关联的共享内存不释放。
2.使用了outline.在我们的环境中,我们并没有使用outline。
所以我们几乎可以断定是第一种原因引起的。所以正是因为这些内存不释放,导致Oracle认为将要出现ORA-04031错误,才会去偷buffer cache的内存。Oracle也给出两条SQL语句,让我们去监控KGLHD和outline引起的内存增长。如果我们使用了outline,我们将会看到kglhdnsp中类型为22的将高速的增长。

select count(*), kglhdnsp from x$kglob group by kglhdnsp order by kglhdnsp;
select inst_id, bytes/1024/1024 KGLHD from gv$sgastat where name='KGLHD' and pool='shared pool';

最终Oracle给出的workground是:
1.设置”_kghdsidx_count”=1,这样就只需要使用一个shared pool子池。但是这么设置之后,在高并发的环境下面,会引发一些shared pool和library cache latch的一些争用。其实为了保险起见,我建议先把这个参数往下调整,不要直接设置成1。然后慢慢观察。
2.使用11g新出的SPM来代替outline.
到此该问题已经水落石出。我们找到了根本的原因,我们只需要把_kghdsidx_count参数适当进行调整,避免KGLHD持续占用内存。就能有效的控制shared pool对buffer cache的“偷取”。
参考文档:
SGA Re-Sizes Occurring Despite AMM/ASMM Being Disabled (MEMORY_TARGET/SGA_TARGET=0) [ID 1269139.1]
High memory allocations of KGLHD in Shared Pool [ID 1435186.1]

分享到: 更多

Post a Comment

Your email is never published nor shared. Required fields are marked *