Skip to content

AIX平台Rac Listener offline问题分析

 

接到电话时候是下午4点多,长沙的天气有点闷热,外面正准备下雷雨。客户来电咨询集群2个节点的listener自动offline,问我大概的会是什么原因造成的,现场有个其他公司的Oracle工程师回复可能是Oracle bug导致,用户觉得不靠谱直接联系我了。从用户的反馈的信息来看,不能直接说明就是bug导致,但是第一反应就是vip offline 裙带关系把listener也带下去了,Listener与vip 这一对基友出的问题还少?但是也不排除其他的bug因为listener auto offline的bug虽然不是很多,但是也不是1只手数的过来的,从用户手上获取相关的log后着手进行分析。

分析思路:

一般碰到这类问题,首先要获取查看的日志就是
1.数据库告警日志(2个实例)
2.RAC的告警日志(其实这个不是很重要)
3.CRSD日志
4.RACG相关日志(VIP,ONS)
5.系统日志

因为是2个节点之间的都发生了listener offline,首先检测了2个节点的告警日志,只发现了现场工程师重启集群的log记录如下:
节点1

Wed Aug  6 10:59:19 2014
Thread 1 advanced to log sequence 227 (LGWR switch)
  Current log# 13 seq# 227 mem# 0: +DATA/orcl/onlinelog/redo13.log
Wed Aug  6 13:58:02 2014
Thread 1 advanced to log sequence 228 (LGWR switch)
  Current log# 18 seq# 228 mem# 0: +DATA/orcl/onlinelog/redo10.log
Wed Aug  6 14:47:15 2014
Shutting down instance: further logons disabled
Wed Aug  6 14:47:15 2014
Stopping background process CJQ0
Wed Aug  6 14:47:15 2014

节点2

Wed Aug 6 08:17:37 2014
Thread 2 advanced to log sequence 205 (LGWR switch)
Current log# 14 seq# 205 mem# 0: +DATA/orcl/onlinelog/redo14.log
Wed Aug 6 11:00:24 2014
Thread 2 advanced to log sequence 206 (LGWR switch)
Current log# 15 seq# 206 mem# 0: +DATA/orcl/onlinelog/redo15.log
Wed Aug 6 14:48:12 2014
Reconfiguration started (old inc 12, new inc 14)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Aug 6 14:48:13 2014
LMS 0: 0 GCS shadows cancelled, 0 closed
Wed Aug 6 14:48:13 2014
LMS 1: 0 GCS shadows cancelled, 0 closed
Wed Aug 6 14:48:13 2014
LMS 3: 0 GCS shadows cancelled, 0 closed
Wed Aug 6 14:48:13 2014
LMS 2: 0 GCS shadows cancelled, 0 closed
...

告警日志并未给我们有用的信息,下来按老习惯直接分析CRSD日志了,具体的日志信息以及分析如下:

1.节点一日志分析

2014-08-06 14:13:57.778: [  CRSAPP][11095]32CheckResource error for ora.his-ora1.vip error code = 1
==>可以发现在14点13分57秒vip服务异常
2014-08-06 14:13:57.818: [  CRSRES][11095]32In stateChanged, ora.his-ora1.vip target is ONLINE
2014-08-06 14:13:57.819: [  CRSRES][11095]32ora.his-ora1.vip on his-ora1 went OFFLINE unexpectedly
==>集群将vip服务状态标记为offline
2014-08-06 14:13:57.820: [  CRSRES][11095]32StopResource: setting CLI values
2014-08-06 14:13:57.825: [  CRSRES][11095]32Attempting to stop `ora.his-ora1.vip` on member `his-ora1`
2014-08-06 14:13:58.418: [  CRSRES][11095]32Stop of `ora.his-ora1.vip` on member `his-ora1` succeeded.
==>集群停止ora1的vip服务
2014-08-06 14:13:58.419: [  CRSRES][11095]32ora.his-ora1.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2014-08-06 14:13:58.423: [  CRSRES][11095]32ora.his-ora1.vip failed on his-ora1 relocating.
==>ora1.vip漂移失败
2014-08-06 14:13:58.520: [  CRSRES][11095]32StopResource: setting CLI values
2014-08-06 14:13:58.525: [  CRSRES][11095]32Attempting to stop `ora.his-ora1.LISTENER_HIS-ORA1.lsnr` on member `his-ora1`
==>由于裙带关系crs把listener资源也关闭了
2014-08-06 14:15:15.951: [  CRSRES][11095]32Stop of `ora.his-ora1.LISTENER_HIS-ORA1.lsnr` on member `his-ora1` succeeded.
==>同上
2014-08-06 14:15:16.022: [  CRSRES][11095]32Attempting to start `ora.his-ora1.vip` on member `his-ora2`
2014-08-06 14:15:17.882: [  CRSRES][11095]32Start of `ora.his-ora1.vip` on member `his-ora2` succeeded.
2014-08-06 14:15:24.997: [  CRSRES][11185]32startRunnable: setting CLI values

2.节点二日志分析

2014-07-14 17:35:59.469: [  CRSRES][11648]32Start of `ora.orcl.orcl2.inst` on member `his-ora2` succeeded.
2014-08-06 14:14:06.866: [  CRSAPP][11469]32CheckResource error for ora.his-ora2.vip error code = 1
==>同样可以发现在14点14分06秒vip服务异常,和节点1前后相差9秒
2014-08-06 14:14:06.879: [  CRSRES][11469]32In stateChanged, ora.his-ora2.vip target is ONLINE
2014-08-06 14:14:06.881: [  CRSRES][11469]32ora.his-ora2.vip on his-ora2 went OFFLINE unexpectedly
==>集群将ora2.vip服务状态标记为offline
2014-08-06 14:14:06.881: [  CRSRES][11469]32StopResource: setting CLI values
2014-08-06 14:14:06.886: [  CRSRES][11469]32Attempting to stop `ora.his-ora2.vip` on member `his-ora2`
2014-08-06 14:14:07.473: [  CRSRES][11469]32Stop of `ora.his-ora2.vip` on member `his-ora2` succeeded.
==>集群停止ora2.vip服务
2014-08-06 14:14:07.474: [  CRSRES][11469]32ora.his-ora2.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2014-08-06 14:14:07.479: [  CRSRES][11469]32ora.his-ora2.vip failed on his-ora2 relocating.
==>ora2.vip漂移失败
2014-08-06 14:14:07.524: [  CRSRES][11469]32StopResource: setting CLI values
2014-08-06 14:14:07.528: [  CRSRES][11469]32Attempting to stop `ora.his-ora2.LISTENER_HIS-ORA2.lsnr` on member `his-ora2`
2014-08-06 14:15:16.008: [  CRSRES][11548]32startRunnable: setting CLI values
2014-08-06 14:15:24.982: [  CRSRES][11469]32Stop of `ora.his-ora2.LISTENER_HIS-ORA2.lsnr` on member `his-ora2` succeeded.
==>由于裙带关系crs把listener资源也关闭了
2014-08-06 14:15:25.013: [  CRSRES][11469]32Attempting to start `ora.his-ora2.vip` on member `his-ora1`
2014-08-06 14:15:27.911: [  CRSRES][11469]32Start of `ora.his-ora2.vip` on member `his-ora1` succeeded.

从以上的CRSD日志可以发现,故障的发生确实是因为VIP的异常offline导致listener也被offline导致;故障的发生时间点为14:13分后,两个节点前后vip服务出现异常,导致offline的产生,并且并未restart,vip在漂移失败后crs将LISTENER进程offline。由此可以判断出监听资源offline的根源为vip的原因导致,这与我在客户通电话时候获取的信息得出判断是一致的。

那么现在的问题就比较明显了,原因是vip异常offline导致,那么vip offline的根源是什么?而且是2个vip几乎同时都offline了,一般vip的offline的原因,我个人是个归纳如下:

1.不少bug会导致vip offline
2.网络波动导致ping timeout超时
3.网络故障,比如交换机或者网路,网卡问题
4.网关设错等

但是这里集群的日志信息已经没有其他的信息可以告诉我们其他有助于我们定位vip offline的根源了,下来从系统层面判断是否存在异常信息的告警,这里是AIX所有采用errpt -a的方式查找最新的报错,2个主机最新的错误信息都是2014年7月3号,离现在太久了,而且这期间并未发生类似的故障,基本初步排除系统方面的问题。那到底是不是网络的问题?还是bug的问题?
这个时候一方面可以对VIP服务采用crsctl debug log resource ora.his-ora2.vip:1的方式在两个节点上开启trace跟踪,然后对trace的日志进行分析,另一方面通过加大ping time out时间的方式来避免偶尔的网络波动导致的vip 资源进程offline的情况,这2方面的执行命令如下:

1.debug vip 进程

To increase the trace level, issue
crsctl debug log resource :1

To turn off this tracing use:
crsctl debug log res :0

2.延长vip检测time out时间

修改racgvip文件,将PING_TIMEOUT=" -c 1 -w 1"改为PING_TIMEOUT=" -c 1 -w 3"

日志方面已经没有可以给我们有帮助的信息了,此时电话用户告知目前的诊断情况并建议对网络方面进行排查,检测整体rac相关网络的波动异常情况,其次对当前的vip进程进行trace根据trace文件进行分析以及增加time out时间。庆幸的是正在准备做trace的时候,用户来电反应下午14:13分与rac相关的交换机发生failover,整个过程持续了一段时间,由此可以判断在此过程由于无法ping通导致vip offine,进而出现客户反应的现象。但是我还是不放心,给vip进程做了trace,等些时间再关掉再关掉检测trace。

在后续的检测中我也发现一个问题,那就是节点的priv网关和pub网关是一样的,在rac环境中我是不建议这种设置的,如下2图:
节点1公有网卡节点1私有网卡

Rac Listener offline问题分析

介绍一个快速检查近日redo产生量以及等待判断是否需要增加redo size的脚本

通常我们定位一个数据库的redolog是否需要增加,都是通过观察online redolog的切换频率是否频繁,观察告警是否存在checkpoint incompleted,观察awr中的log file等待时间是否正常来判断是否需要增加日志大小或者日志组数,今天找到一个非常不错的脚本,分析过去几天内平均1小时产生的redo量,lgwr的等待方面的统计,也可以作为评估redo产生量的一个统计。

执行结果类似如下图:
redo_adjust

最后列为建议增加的值,还有ckpt的等待,总体的cpu开销,物理读和写消耗,日志等待事件,进程数等统计。

具体脚本内容如下:

 

set pagesize 50000 linesize 300

col instance_number             format 99                   head 'In|st'
col tim                                                     head 'Period end'
col cpu_sec                     format 999,999,999.9        head 'CPU used|sec'
col phy_reads                   format 999,999,999          head 'Physical|reads'
col phy_writes                  format 999,999,999          head 'Physical|writes'
col cr_served                   format 999,999,999          head 'CR blocks|served'
col current_served              format 999,999,999          head 'CUR blocks|served'
col redo_mb                     format 999,999,999.9        head 'Redo, MB'
col processes                   format 999,999              head 'Proc|esses'
col avg_df_seq                  format 9,999.9              head 'Avg 1|read'
col avg_df_scat                 format 9,999.9              head 'Avg N|read'
col redo_diff_to_md_pct         format 999,999              head 'Redo Diff|to median, %'
col avg_lfpw                    format 999.99               head 'Avg|LFPW'
col avg_log_sync                format 9,999.99             head 'Avg Log|Sync, ms'
col log_ckpt_sec                format 999,999              head 'CKPT|waits, s'
col redo_needed                 format 999,999              head 'Redo to|Add, MB'

compute max of cpu_sec          on instance_number
compute max of phy_reads        on instance_number
compute max of phy_writes       on instance_number
compute max of cr_served        on instance_number
compute max of current_served   on instance_number
compute max of phy_writes       on instance_number
compute max of redo_needed      on instance_number
compute max of log_ckpt_sec     on instance_number
compute max of avg_log_sync     on instance_number
compute max of avg_lfpw         on instance_number
compute max of redo_mb          on instance_number
compute max of processes        on instance_number
compute max of avg_df_seq       on instance_number
compute max of avg_df_scat      on instance_number

break on instance_number skip page

with t_interval as
(
 select /*+ inline */ sysdate-30 begin, sysdate as end from dual
)
select
  stats.dbid                                                                dbid,
  stats.instance_number                                                     instance_number,
  to_char(stats.begin_interval_time, 'YYYYMMDD HH24MI')                     tim,
  stats.cpu_used / 100                                                      cpu_sec,
  stats.phy_reads                                                           phy_reads,
  stats.phy_writes                                                          phy_writes,
  stats.cr_served                                                           cr_served,
  stats.current_served                                                      current_served,
  stats.redo_size / 1024 / 1024                                             redo_mb,
  procs.current_utilization                                                 processes,
--
  waits.df_seq_micro / 1000 / nullif(waits.df_seq_waits,0)                  avg_df_seq,
  waits.df_scat_micro / 1000 / nullif(waits.df_scat_waits,0)                avg_df_scat,
  (stats.redo_size - stats.md_redo_size) * 100 / stats.md_redo_size         redo_diff_to_md_pct,
  stats.redo_write_time*10/stats.redo_writes                                avg_lfpw,
  waits.log_sync_micro/nullif(waits.log_sync_waits, 0) / 1000               avg_log_sync,
  waits.log_ckpt_micro/1e6                                                  log_ckpt_sec,
  ( stats.redo_size /
     ( waits.snap_interval * 86400 ) ) *
   ( waits.log_ckpt_micro/1e6 ) / 1024 / 1024                               redo_needed,
  stats.is_restart
from
  (
   select
     snap_id,
     begin_interval_time,
     snap_interval,
     instance_number,
     dbid,
     log_sync_micro,
     log_sync_waits,
     log_ckpt_micro,
     log_ckpt_waits,
     df_seq_micro,
     df_seq_waits,
     df_scat_micro,
     df_scat_waits,
     direct_micro,
     direct_waits,
     median(log_sync_micro/nullif(log_sync_waits, 0)) over (partition by dbid, instance_number) md_log_sync_micro
   from
   (
      select
        snap_id,
        begin_interval_time,
        instance_number,
        dbid,
        max(snap_interval) snap_interval,
        max(decode(event_name, 'log file sync',                            wait_micro))    log_sync_micro,
        max(decode(event_name, 'log file sync',                            total_waits))   log_sync_waits,
        max(decode(event_name, 'log file switch (checkpoint incomplete)',  wait_micro))    log_ckpt_micro,
        max(decode(event_name, 'log file switch (checkpoint incomplete)',  total_waits))   log_ckpt_waits,
        max(decode(event_name, 'db file sequential read',                  wait_micro))    df_seq_micro,
        max(decode(event_name, 'db file sequential read',                  total_waits))   df_seq_waits,
        max(decode(event_name, 'db file scattered read',                   wait_micro))    df_scat_micro,
        max(decode(event_name, 'db file scattered read',                   total_waits))   df_scat_waits,
        max(decode(event_name, 'direct path read',                         wait_micro))    direct_micro,
        max(decode(event_name, 'direct path read',                         total_waits))   direct_waits
      from
      (
        select
          e.snap_id,
          e.instance_number,
          e.dbid,
          sn.begin_interval_time,
          cast(begin_interval_time as date) - cast(lag(begin_interval_time) over (partition by e.dbid, e.instance_number, e.event_name order by sn.begin_interval_time) as date) snap_interval,
          sn.startup_time,
          e.event_name,
          case when (sn.begin_interval_time >= sn.startup_time and lag(sn.begin_interval_time) over (partition by e.dbid, e.instance_number, e.event_name order by sn.begin_interval_time) < sn.startup_time)
            then e.time_waited_micro
            else e.time_waited_micro - lag(e.time_waited_micro) over (partition by e.dbid, e.instance_number, e.event_name order by sn.begin_interval_time)
          end wait_micro,
          case when (sn.begin_interval_time >= sn.startup_time and lag(sn.begin_interval_time) over (partition by e.dbid, e.instance_number, e.event_name order by sn.begin_interval_time) < sn.startup_time)
            then e.total_waits
            else e.total_waits - lag(e.total_waits) over (partition by e.dbid, e.instance_number, e.event_name order by sn.begin_interval_time)
          end total_waits
        from
          dba_hist_system_event e,
          dba_hist_snapshot     sn,
          t_interval            t
        where
          sn.snap_id = e.snap_id and
          sn.dbid = e.dbid and
          sn.instance_number = e.instance_number and
          sn.begin_interval_time between t.begin and t.end and
          e.event_name in (
            'log file sync',
            'log file switch (checkpoint incomplete)',
            'db file sequential read',
            'db file scattered read',
            'direct path read'
          )
      )
      group by dbid, instance_number, begin_interval_time, snap_id
    )
  ) waits,
  (
    select
      snap_id,
      begin_interval_time,
      instance_number,
      dbid,
      redo_size,
      redo_write_time,
      redo_writes,
      is_restart,
      cpu_used,
      phy_reads,
      phy_reads_cache,
      phy_writes,
      phy_writes_cache,
      cr_served,
      current_served,
      median(redo_size) over (partition by dbid, instance_number) md_redo_size
    from
    (
      select
        snap_id,
        begin_interval_time,
        instance_number,
        dbid,
        max(is_restart) is_restart,
        max(decode(stat_name, 'redo size',                       stat_diff)) redo_size,
        max(decode(stat_name, 'redo write time',                 stat_diff)) redo_write_time,
        max(decode(stat_name, 'redo writes',                     stat_diff)) redo_writes,
        max(decode(stat_name, 'CPU used by this session',        stat_diff)) cpu_used,
        max(decode(stat_name, 'physical read total IO requests', stat_diff)) phy_reads,
        max(decode(stat_name, 'physical reads cache',            stat_diff)) phy_reads_cache,
        max(decode(stat_name, 'physical write total IO requests',stat_diff)) phy_writes,
        max(decode(stat_name, 'physical writes from cache',      stat_diff)) phy_writes_cache,
        max(decode(stat_name, 'gc cr blocks served',             stat_diff)) cr_served,
        max(decode(stat_name, 'gc current blocks served',        stat_diff)) current_served
      from
      (
        select
          stats.snap_id,
          stats.instance_number,
          stats.dbid,
          sn.begin_interval_time,
          sn.startup_time,
          stats.stat_name,
          case when (sn.begin_interval_time >= sn.startup_time and lag(sn.begin_interval_time) over (partition by stats.dbid, stats.instance_number, stats.stat_id order by sn.begin_interval_time) < sn.startup_time)
            then stats.value
            else stats.value - lag(stats.value) over (partition by stats.dbid, stats.instance_number, stats.stat_id order by stats.snap_id)
          end stat_diff,
          case when (sn.begin_interval_time >= sn.startup_time and lag(sn.begin_interval_time) over (partition by stats.dbid, stats.instance_number, stats.stat_id order by sn.begin_interval_time) < sn.startup_time)
            then 'Yes'
          end is_restart
        from
          dba_hist_sysstat   stats,
          dba_hist_snapshot  sn,
          t_interval         t
        where
          sn.snap_id = stats.snap_id and
          sn.dbid = stats.dbid and
          sn.instance_number = stats.instance_number and
          sn.begin_interval_time between t.begin and t.end and
          stats.stat_name in (
            'redo size',
            'redo write time',
            'redo writes',
            'CPU used by this session',
            'physical read total IO requests',
            'physical reads cache',
            'physical write total IO requests',
            'physical writes from cache',
            'gc cr blocks served',
            'gc current blocks served'
          )
      )
      group by dbid, instance_number, begin_interval_time, snap_id
    )
  ) stats,
  (
    select
      stats.snap_id,
      stats.instance_number,
      stats.dbid,
      stats.resource_name,
      stats.current_utilization
    from
      dba_hist_resource_limit stats,
      dba_hist_snapshot       sn,
      t_interval              t
    where
      sn.snap_id = stats.snap_id and
      sn.dbid = stats.dbid and
      sn.instance_number = stats.instance_number and
      sn.begin_interval_time between t.begin and t.end and
      stats.resource_name = 'processes'
  ) procs
where
  waits.dbid = stats.dbid and
  waits.instance_number = stats.instance_number and
  waits.snap_id = stats.snap_id and
  waits.dbid = procs.dbid and
  waits.instance_number = procs.instance_number and
  waits.snap_id = procs.snap_id
order by
 stats.dbid, stats.instance_number, stats.begin_interval_time
;

该脚本转自pythian.

在Oracle监控中如何捕获执行计划变更的语句

在通常的数据库运维中,经常担心数据库一些重要的SQL的性能突然变差,导致整个数据库性能糟糕响应缓慢,甚至让数据库瘫痪的情况时有出现,往往这个时候要在事后通过awrsqrpt.sql脚本从top sql中去抓取分析执行变化的SQL语句,今天脑袋一闪想有没有办法有脚本可以自己去对比呢?回家就开始敲命令测试,测试的平台为11gr1版本,通过dba_hist_sqlstats数据字典中抓去sql id数据进行对比,通过对比获取相关的SQL_ID,以及此SQL的最优执行计划是什么。通过该sql可以设定一个固定的执行时间,通过分别查出来的数据进行监控,对执行计划产生误差的语句及时的通过SQL_PROFILE或者SPM(11g)调整成最优执行计划。

具体语句如下:

spool sql_have_logs_plan.txt
set lines 220 pages 9999 trimspool on
set numformat 999,999,999
column plan_hash_value format 99999999999999
column min_snap format 999999
column max_snap format 999999
column min_avg_ela format 999,999,999,999,999
column avg_ela format 999,999,999,999,999
column ela_gain format 999,999,999,999,999
select sql_id,
       min(min_snap_id) min_snap,
       max(max_snap_id) max_snap,
       max(decode(rw_num,1,plan_hash_value)) plan_hash_value,
       max(decode(rw_num,1,avg_ela)) min_avg_ela,
       avg(avg_ela) avg_ela,
       avg(avg_ela) - max(decode(rw_num,1,avg_ela)) ela_gain,
       -- max(decode(rw_num,1,avg_buffer_gets)) min_avg_buf_gets,
       -- avg(avg_buffer_gets) avg_buf_gets,
       max(decode(rw_num,1,sum_exec))-1 min_exec,
       avg(sum_exec)-1 avg_exec
from (
  select sql_id, plan_hash_value, avg_buffer_gets, avg_ela, sum_exec,
         row_number() over (partition by sql_id order by avg_ela) rw_num , min_snap_id, max_snap_id
  from
  (
    select sql_id, plan_hash_value , sum(BUFFER_GETS_DELTA)/(sum(executions_delta)+1) avg_buffer_gets,
    sum(elapsed_time_delta)/(sum(executions_delta)+1) avg_ela, sum(executions_delta)+1 sum_exec,
    min(snap_id) min_snap_id, max(snap_id) max_snap_id
    from dba_hist_sqlstat a
    where exists  (
       select sql_id from dba_hist_sqlstat b where a.sql_id = b.sql_id
         and  a.plan_hash_value != b.plan_hash_value
         and  b.plan_hash_value > 0)
    and plan_hash_value > 0
    group by sql_id, plan_hash_value
    order by sql_id, avg_ela
  )
  order by sql_id, avg_ela
  )
group by sql_id
having max(decode(rw_num,1,sum_exec)) > 1
order by 7 desc
/
spool off
clear columns
set numformat 9999999999

具体的执行结果参考如下:

SQL> spool sql_with_more_than_1plan.txt
set lines 220 pages 9999 trimspool on
SQL> SQL> set numformat 999,999,999
SQL> column plan_hash_value format 99999999999999
SQL> column min_snap format 999999
SQL> column max_snap format 999999
SQL> column min_avg_ela format 999,999,999,999,999
SQL> column avg_ela format 999,999,999,999,999
SQL> column ela_gain format 999,999,999,999,999
SQL> select sql_id,
2 min(min_snap_id) min_snap,
3 max(max_snap_id) max_snap,
4 max(decode(rw_num,1,plan_hash_value)) plan_hash_value,
5 max(decode(rw_num,1,avg_ela)) min_avg_ela,
6 avg(avg_ela) avg_ela,
7 avg(avg_ela) - max(decode(rw_num,1,avg_ela)) ela_gain,
8 -- max(decode(rw_num,1,avg_buffer_gets)) min_avg_buf_gets,
9 -- avg(avg_buffer_gets) avg_buf_gets,
10 max(decode(rw_num,1,sum_exec))-1 min_exec,
11 avg(sum_exec)-1 avg_exec
12 from (
13 select sql_id, plan_hash_value, avg_buffer_gets, avg_ela, sum_exec,
14 row_number() over (partition by sql_id order by avg_ela) rw_num , min_snap_id, max_snap_id
15 from
16 (
17 select sql_id, plan_hash_value , sum(BUFFER_GETS_DELTA)/(sum(executions_delta)+1) avg_buffer_gets,
18 sum(elapsed_time_delta)/(sum(executions_delta)+1) avg_ela, sum(executions_delta)+1 sum_exec,
19 min(snap_id) min_snap_id, max(snap_id) max_snap_id
20 from dba_hist_sqlstat a
21 where exists (
22 select sql_id from dba_hist_sqlstat b where a.sql_id = b.sql_id
23 and a.plan_hash_value != b.plan_hash_value
24 and b.plan_hash_value > 0)
25 and plan_hash_value > 0
26 group by sql_id, plan_hash_value
27 order by sql_id, avg_ela
28 )
29 order by sql_id, avg_ela
30 )
31 group by sql_id
32 having max(decode(rw_num,1,sum_exec)) > 1
33 order by 7 desc
34 /
spool off
clear columns
set numformat 9999999999
SQL_ID MIN_SNAP MAX_SNAP PLAN_HASH_VALUE MIN_AVG_ELA AVG_ELA ELA_GAIN MIN_EXEC AVG_EXEC
------------- -------- -------- --------------- -------------------- -------------------- -------------------- ------------ ------------
gfjvxb25b773h 192 204 3801470479 6,368 171,152 164,784 1 1
7asddw0r2wykx 192 211 3660107835 10,311 16,448 6,137 57 30
7ng34ruy5awxq 192 209 2606284882 117 320 204 426 1,371
g278wv0zzpn5j 195 211 3142739690 0 189 189 72 63
1gu8t96d0bdmu 192 209 2035254952 67 154 87 826 1,134
bsa0wjtftg3uw 193 211 2020579421 0 18 18 224 191
0kugqg48477gf 201 202 3321035584 0 0 0 19 27

7 rows selected.

通过以上查询我们发现了7个结果,其中MIN_SNAP和MAX_SNAP代表该sql_id所代表最早出现时候的快照和最后一次出现的快照;PLAN_HASH_VALUE代表着最佳的执行计划;ELA_GAIN字段代表如果使用最佳的执行计划,与平均执行时间能够提速多少时间。

在查出结果后,我们可以通过sql profile(参考老刘的《验证sql_profile的执行计划》),也可以通过11g的dbms_spm.load_plans_from_cursor_cache更改相关Sql_id的sql的执行计划,这里以11g为例子做如下变更,对上面查询到的7asddw0r2wykx语句执行计划改为最优:

SQL> variable v_sqlid number
SQL> begin
2 :v_sqlid :=
3 dbms_spm.load_plans_from_cursor_cache(
4 sql_id=>'&sql_id',
5 plan_hash_value=>&plan_hash,
6 fixed=>'YES');
7 end;
8 /
Enter value for sql_id: 7asddw0r2wykx
old 4: sql_id=>'&sql_id',
new 4: sql_id=>'7asddw0r2wykx',
Enter value for plan_hash: 3660107835
old 5: plan_hash_value=>&plan_hash,
new 5: plan_hash_value=>3660107835,

PL/SQL procedure successfully completed.

遭遇ora-00600 [6711]

错误如下:

 

Sat Jul 26 10:22:23 2014
Errors in file /oracle/admin/icdb/udump/icdb_ora_459112.trc:
ORA-00600: internal error code, arguments: [6711], [4257353], [1], [4255900], [0], [], [], []
Sat Jul 26 10:24:07 2014
Errors in file /oracle/admin/icdb/udump/icdb_ora_459112.trc:
ORA-00600: internal error code, arguments: [6711], [4257353], [1], [4255900], [0], [], [], []
Sat Jul 26 10:31:38 2014
Errors in file /oracle/admin/icdb/udump/icdb_ora_56557800.trc:
ORA-00600: 内部错误代码, 参数: [6711], [4257353], [1], [4255900], [0], [], [], []
Sat Jul 26 10:31:59 2014
Errors in file /oracle/admin/icdb/udump/icdb_ora_56557800.trc:
ORA-00600: 内部错误代码, 参数: [6711], [4257353], [1], [4255900], [0], [], [], []
Sat Jul 26 11:00:30 2014
Errors in file /oracle/admin/icdb/bdump/icdb_m000_56557774.trc:
ORA-00600: internal error code, arguments: [kcbz_check_objd_typ], [0], [0], [1], [], [], [], []
Sat Jul 26 11:01:00 2014
Errors in file /oracle/admin/icdb/udump/icdb_ora_57344088.trc:
ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], []
Sat Jul 26 12:00:39 2014
Errors in file /oracle/admin/icdb/bdump/icdb_m000_63307938.trc:
ORA-00600: internal error code, arguments: [kcbz_check_objd_typ], [0], [0], [1], [], [], [], []
Sat Jul 26 12:06:47 2014
Thread 1 advanced to log sequence 32908 (LGWR switch)
  Current log# 1 seq# 32908 mem# 0: /database/icdb/redo01.log
Sat Jul 26 12:09:41 2014
Thread 1 advanced to log sequence 32909 (LGWR switch)
  Current log# 2 seq# 32909 mem# 0: /database/icdb/redo02.log
Sat Jul 26 12:13:06 2014
Thread 1 advanced to log sequence 32910 (LGWR switch)
  Current log# 3 seq# 32910 mem# 0: /database/icdb/redo03.log
Sat Jul 26 12:16:03 2014
Thread 1 advanced to log sequence 32911 (LGWR switch)
  Current log# 1 seq# 32911 mem# 0: /database/icdb/redo01.log
Sat Jul 26 12:18:58 2014
Thread 1 advanced to log sequence 32912 (LGWR switch)
  Current log# 2 seq# 32912 mem# 0: /database/icdb/redo02.log
Sat Jul 26 13:00:41 2014
Errors in file /oracle/admin/icdb/bdump/icdb_m000_7668146.trc:
ORA-00600: internal error code, arguments: [kcbz_check_objd_typ], [0], [0], [1], [], [], [], []
Sat Jul 26 13:00:44 2014
Errors in file /oracle/admin/icdb/bdump/icdb_m000_7668146.trc:
ORA-00600: internal error code, arguments: [13013], [5001], [9001], [12637871], [26], [12613723], [3], []

可以确认
ORA-00600: internal error code, arguments: [qertbFetchByRowID]
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []以及为BUG

7329252 10.2.0.4.4, 10.2.0.5, 11.1.0.7.5, 11.2.0.1 ORA-8102/ORA-1499/OERI[kdsgrp1] Index corruption after rebuild index ONLINE

 

剩下就是

[kcbz_check_objd_typ], [13013],[6711]

先分析6711的错误,根据对整体日志的分析过程如下:

[root@ludatou /]# cat alert_icdb.log | grep 6711 |wc -l
39
[root@ludatou /]# cat alert_icdb.log | grep 4257353 |wc -l
31
[root@ludatou /]# cat alert_icdb.log | grep 4255900 |wc -l
31
[root@ludatou /]# cat alert_icdb.log | grep -v 4257353 |wc -l
171179
[root@ludatou /]# cat alert_icdb.log | grep -v 4257353 |grep 6711 |wc -l
8
[root@ludatou /]# cat alert_icdb.log | grep -v 4257353 |grep 6711
Thread 1 cannot allocate new log, sequence 6711
Thread 1 advanced to log sequence 6711 (LGWR switch)
 Current log# 3 seq# 6711 mem# 0: /database/icdb/redo03.log
 Current log# 3 seq# 6711 mem# 0: /database/icdb/redo03.log
Thread 1 advanced to log sequence 16711 (LGWR switch)
 Current log# 1 seq# 16711 mem# 0: /database/icdb/redo01.log
Thread 1 advanced to log sequence 26711 (LGWR switch)
 Current log# 2 seq# 26711 mem# 0: /database/icdb/redo02.log

通过以上的分析我们可以发现,6711的错误集中的对应数据块都为
[6711], [4257353], [1], [4255900]

SQL> SELECT dbms_utility.data_block_address_block(4257353) "BLOCK",
 2 dbms_utility.data_block_address_file(4257353) "FILE" from dual;

 BLOCK FILE
---------- ----------
 63049 1

SQL> SELECT dbms_utility.data_block_address_block(4255900) "BLOCK",
 2 dbms_utility.data_block_address_file(4255900) "FILE" from dual;

 BLOCK FILE
---------- ----------
 61596 1

set linesize 200;
select segment_name, segment_type
from dba_extents
where relative_fno = 1
and (663049 between block_id and block_id + blocks or
61596 between block_id and block_id + blocks);