AIX平台Rac Listener offline问题分析
接到电话时候是下午4点多,长沙的天气有点闷热,外面正准备下雷雨。客户来电咨询集群2个节点的listener自动offline,问我大概的会是什么原因造成的,现场有个其他公司的Oracle工程师回复可能是Oracle bug导致,用户觉得不靠谱直接联系我了。从用户的反馈的信息来看,不能直接说明就是bug导致,但是第一反应就是vip offline 裙带关系把listener也带下去了,Listener与vip 这一对基友出的问题还少?但是也不排除其他的bug因为listener auto offline的bug虽然不是很多,但是也不是1只手数的过来的,从用户手上获取相关的log后着手进行分析。
分析思路:
一般碰到这类问题,首先要获取查看的日志就是
1.数据库告警日志(2个实例)
2.RAC的告警日志(其实这个不是很重要)
3.CRSD日志
4.RACG相关日志(VIP,ONS)
5.系统日志
因为是2个节点之间的都发生了listener offline,首先检测了2个节点的告警日志,只发现了现场工程师重启集群的log记录如下:
节点1
Wed Aug 6 10:59:19 2014 Thread 1 advanced to log sequence 227 (LGWR switch) Current log# 13 seq# 227 mem# 0: +DATA/orcl/onlinelog/redo13.log Wed Aug 6 13:58:02 2014 Thread 1 advanced to log sequence 228 (LGWR switch) Current log# 18 seq# 228 mem# 0: +DATA/orcl/onlinelog/redo10.log Wed Aug 6 14:47:15 2014 Shutting down instance: further logons disabled Wed Aug 6 14:47:15 2014 Stopping background process CJQ0 Wed Aug 6 14:47:15 2014
节点2
Wed Aug 6 08:17:37 2014 Thread 2 advanced to log sequence 205 (LGWR switch) Current log# 14 seq# 205 mem# 0: +DATA/orcl/onlinelog/redo14.log Wed Aug 6 11:00:24 2014 Thread 2 advanced to log sequence 206 (LGWR switch) Current log# 15 seq# 206 mem# 0: +DATA/orcl/onlinelog/redo15.log Wed Aug 6 14:48:12 2014 Reconfiguration started (old inc 12, new inc 14) List of nodes: 1 Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Wed Aug 6 14:48:13 2014 LMS 0: 0 GCS shadows cancelled, 0 closed Wed Aug 6 14:48:13 2014 LMS 1: 0 GCS shadows cancelled, 0 closed Wed Aug 6 14:48:13 2014 LMS 3: 0 GCS shadows cancelled, 0 closed Wed Aug 6 14:48:13 2014 LMS 2: 0 GCS shadows cancelled, 0 closed ...
告警日志并未给我们有用的信息,下来按老习惯直接分析CRSD日志了,具体的日志信息以及分析如下:
1.节点一日志分析
2014-08-06 14:13:57.778: [ CRSAPP][11095]32CheckResource error for ora.his-ora1.vip error code = 1 ==>可以发现在14点13分57秒vip服务异常 2014-08-06 14:13:57.818: [ CRSRES][11095]32In stateChanged, ora.his-ora1.vip target is ONLINE 2014-08-06 14:13:57.819: [ CRSRES][11095]32ora.his-ora1.vip on his-ora1 went OFFLINE unexpectedly ==>集群将vip服务状态标记为offline 2014-08-06 14:13:57.820: [ CRSRES][11095]32StopResource: setting CLI values 2014-08-06 14:13:57.825: [ CRSRES][11095]32Attempting to stop `ora.his-ora1.vip` on member `his-ora1` 2014-08-06 14:13:58.418: [ CRSRES][11095]32Stop of `ora.his-ora1.vip` on member `his-ora1` succeeded. ==>集群停止ora1的vip服务 2014-08-06 14:13:58.419: [ CRSRES][11095]32ora.his-ora1.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0 2014-08-06 14:13:58.423: [ CRSRES][11095]32ora.his-ora1.vip failed on his-ora1 relocating. ==>ora1.vip漂移失败 2014-08-06 14:13:58.520: [ CRSRES][11095]32StopResource: setting CLI values 2014-08-06 14:13:58.525: [ CRSRES][11095]32Attempting to stop `ora.his-ora1.LISTENER_HIS-ORA1.lsnr` on member `his-ora1` ==>由于裙带关系crs把listener资源也关闭了 2014-08-06 14:15:15.951: [ CRSRES][11095]32Stop of `ora.his-ora1.LISTENER_HIS-ORA1.lsnr` on member `his-ora1` succeeded. ==>同上 2014-08-06 14:15:16.022: [ CRSRES][11095]32Attempting to start `ora.his-ora1.vip` on member `his-ora2` 2014-08-06 14:15:17.882: [ CRSRES][11095]32Start of `ora.his-ora1.vip` on member `his-ora2` succeeded. 2014-08-06 14:15:24.997: [ CRSRES][11185]32startRunnable: setting CLI values
2.节点二日志分析
2014-07-14 17:35:59.469: [ CRSRES][11648]32Start of `ora.orcl.orcl2.inst` on member `his-ora2` succeeded. 2014-08-06 14:14:06.866: [ CRSAPP][11469]32CheckResource error for ora.his-ora2.vip error code = 1 ==>同样可以发现在14点14分06秒vip服务异常,和节点1前后相差9秒 2014-08-06 14:14:06.879: [ CRSRES][11469]32In stateChanged, ora.his-ora2.vip target is ONLINE 2014-08-06 14:14:06.881: [ CRSRES][11469]32ora.his-ora2.vip on his-ora2 went OFFLINE unexpectedly ==>集群将ora2.vip服务状态标记为offline 2014-08-06 14:14:06.881: [ CRSRES][11469]32StopResource: setting CLI values 2014-08-06 14:14:06.886: [ CRSRES][11469]32Attempting to stop `ora.his-ora2.vip` on member `his-ora2` 2014-08-06 14:14:07.473: [ CRSRES][11469]32Stop of `ora.his-ora2.vip` on member `his-ora2` succeeded. ==>集群停止ora2.vip服务 2014-08-06 14:14:07.474: [ CRSRES][11469]32ora.his-ora2.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0 2014-08-06 14:14:07.479: [ CRSRES][11469]32ora.his-ora2.vip failed on his-ora2 relocating. ==>ora2.vip漂移失败 2014-08-06 14:14:07.524: [ CRSRES][11469]32StopResource: setting CLI values 2014-08-06 14:14:07.528: [ CRSRES][11469]32Attempting to stop `ora.his-ora2.LISTENER_HIS-ORA2.lsnr` on member `his-ora2` 2014-08-06 14:15:16.008: [ CRSRES][11548]32startRunnable: setting CLI values 2014-08-06 14:15:24.982: [ CRSRES][11469]32Stop of `ora.his-ora2.LISTENER_HIS-ORA2.lsnr` on member `his-ora2` succeeded. ==>由于裙带关系crs把listener资源也关闭了 2014-08-06 14:15:25.013: [ CRSRES][11469]32Attempting to start `ora.his-ora2.vip` on member `his-ora1` 2014-08-06 14:15:27.911: [ CRSRES][11469]32Start of `ora.his-ora2.vip` on member `his-ora1` succeeded.
从以上的CRSD日志可以发现,故障的发生确实是因为VIP的异常offline导致listener也被offline导致;故障的发生时间点为14:13分后,两个节点前后vip服务出现异常,导致offline的产生,并且并未restart,vip在漂移失败后crs将LISTENER进程offline。由此可以判断出监听资源offline的根源为vip的原因导致,这与我在客户通电话时候获取的信息得出判断是一致的。
那么现在的问题就比较明显了,原因是vip异常offline导致,那么vip offline的根源是什么?而且是2个vip几乎同时都offline了,一般vip的offline的原因,我个人是个归纳如下:
1.不少bug会导致vip offline
2.网络波动导致ping timeout超时
3.网络故障,比如交换机或者网路,网卡问题
4.网关设错等
但是这里集群的日志信息已经没有其他的信息可以告诉我们其他有助于我们定位vip offline的根源了,下来从系统层面判断是否存在异常信息的告警,这里是AIX所有采用errpt -a的方式查找最新的报错,2个主机最新的错误信息都是2014年7月3号,离现在太久了,而且这期间并未发生类似的故障,基本初步排除系统方面的问题。那到底是不是网络的问题?还是bug的问题?
这个时候一方面可以对VIP服务采用crsctl debug log resource ora.his-ora2.vip:1的方式在两个节点上开启trace跟踪,然后对trace的日志进行分析,另一方面通过加大ping time out时间的方式来避免偶尔的网络波动导致的vip 资源进程offline的情况,这2方面的执行命令如下:
1.debug vip 进程
To increase the trace level, issue crsctl debug log resource :1 To turn off this tracing use: crsctl debug log res :0
2.延长vip检测time out时间
修改racgvip文件,将PING_TIMEOUT=" -c 1 -w 1"改为PING_TIMEOUT=" -c 1 -w 3"
日志方面已经没有可以给我们有帮助的信息了,此时电话用户告知目前的诊断情况并建议对网络方面进行排查,检测整体rac相关网络的波动异常情况,其次对当前的vip进程进行trace根据trace文件进行分析以及增加time out时间。庆幸的是正在准备做trace的时候,用户来电反应下午14:13分与rac相关的交换机发生failover,整个过程持续了一段时间,由此可以判断在此过程由于无法ping通导致vip offine,进而出现客户反应的现象。但是我还是不放心,给vip进程做了trace,等些时间再关掉再关掉检测trace。
在后续的检测中我也发现一个问题,那就是节点的priv网关和pub网关是一样的,在rac环境中我是不建议这种设置的,如下2图:
介绍一个快速检查近日redo产生量以及等待判断是否需要增加redo size的脚本
通常我们定位一个数据库的redolog是否需要增加,都是通过观察online redolog的切换频率是否频繁,观察告警是否存在checkpoint incompleted,观察awr中的log file等待时间是否正常来判断是否需要增加日志大小或者日志组数,今天找到一个非常不错的脚本,分析过去几天内平均1小时产生的redo量,lgwr的等待方面的统计,也可以作为评估redo产生量的一个统计。
最后列为建议增加的值,还有ckpt的等待,总体的cpu开销,物理读和写消耗,日志等待事件,进程数等统计。
具体脚本内容如下:
set pagesize 50000 linesize 300 col instance_number format 99 head 'In|st' col tim head 'Period end' col cpu_sec format 999,999,999.9 head 'CPU used|sec' col phy_reads format 999,999,999 head 'Physical|reads' col phy_writes format 999,999,999 head 'Physical|writes' col cr_served format 999,999,999 head 'CR blocks|served' col current_served format 999,999,999 head 'CUR blocks|served' col redo_mb format 999,999,999.9 head 'Redo, MB' col processes format 999,999 head 'Proc|esses' col avg_df_seq format 9,999.9 head 'Avg 1|read' col avg_df_scat format 9,999.9 head 'Avg N|read' col redo_diff_to_md_pct format 999,999 head 'Redo Diff|to median, %' col avg_lfpw format 999.99 head 'Avg|LFPW' col avg_log_sync format 9,999.99 head 'Avg Log|Sync, ms' col log_ckpt_sec format 999,999 head 'CKPT|waits, s' col redo_needed format 999,999 head 'Redo to|Add, MB' compute max of cpu_sec on instance_number compute max of phy_reads on instance_number compute max of phy_writes on instance_number compute max of cr_served on instance_number compute max of current_served on instance_number compute max of phy_writes on instance_number compute max of redo_needed on instance_number compute max of log_ckpt_sec on instance_number compute max of avg_log_sync on instance_number compute max of avg_lfpw on instance_number compute max of redo_mb on instance_number compute max of processes on instance_number compute max of avg_df_seq on instance_number compute max of avg_df_scat on instance_number break on instance_number skip page with t_interval as ( select /*+ inline */ sysdate-30 begin, sysdate as end from dual ) select stats.dbid dbid, stats.instance_number instance_number, to_char(stats.begin_interval_time, 'YYYYMMDD HH24MI') tim, stats.cpu_used / 100 cpu_sec, stats.phy_reads phy_reads, stats.phy_writes phy_writes, stats.cr_served cr_served, stats.current_served current_served, stats.redo_size / 1024 / 1024 redo_mb, procs.current_utilization processes, -- waits.df_seq_micro / 1000 / nullif(waits.df_seq_waits,0) avg_df_seq, waits.df_scat_micro / 1000 / nullif(waits.df_scat_waits,0) avg_df_scat, (stats.redo_size - stats.md_redo_size) * 100 / stats.md_redo_size redo_diff_to_md_pct, stats.redo_write_time*10/stats.redo_writes avg_lfpw, waits.log_sync_micro/nullif(waits.log_sync_waits, 0) / 1000 avg_log_sync, waits.log_ckpt_micro/1e6 log_ckpt_sec, ( stats.redo_size / ( waits.snap_interval * 86400 ) ) * ( waits.log_ckpt_micro/1e6 ) / 1024 / 1024 redo_needed, stats.is_restart from ( select snap_id, begin_interval_time, snap_interval, instance_number, dbid, log_sync_micro, log_sync_waits, log_ckpt_micro, log_ckpt_waits, df_seq_micro, df_seq_waits, df_scat_micro, df_scat_waits, direct_micro, direct_waits, median(log_sync_micro/nullif(log_sync_waits, 0)) over (partition by dbid, instance_number) md_log_sync_micro from ( select snap_id, begin_interval_time, instance_number, dbid, max(snap_interval) snap_interval, max(decode(event_name, 'log file sync', wait_micro)) log_sync_micro, max(decode(event_name, 'log file sync', total_waits)) log_sync_waits, max(decode(event_name, 'log file switch (checkpoint incomplete)', wait_micro)) log_ckpt_micro, max(decode(event_name, 'log file switch (checkpoint incomplete)', total_waits)) log_ckpt_waits, max(decode(event_name, 'db file sequential read', wait_micro)) df_seq_micro, max(decode(event_name, 'db file sequential read', total_waits)) df_seq_waits, max(decode(event_name, 'db file scattered read', wait_micro)) df_scat_micro, max(decode(event_name, 'db file scattered read', total_waits)) df_scat_waits, max(decode(event_name, 'direct path read', wait_micro)) direct_micro, max(decode(event_name, 'direct path read', total_waits)) direct_waits from ( select e.snap_id, e.instance_number, e.dbid, sn.begin_interval_time, cast(begin_interval_time as date) - cast(lag(begin_interval_time) over (partition by e.dbid, e.instance_number, e.event_name order by sn.begin_interval_time) as date) snap_interval, sn.startup_time, e.event_name, case when (sn.begin_interval_time >= sn.startup_time and lag(sn.begin_interval_time) over (partition by e.dbid, e.instance_number, e.event_name order by sn.begin_interval_time) < sn.startup_time) then e.time_waited_micro else e.time_waited_micro - lag(e.time_waited_micro) over (partition by e.dbid, e.instance_number, e.event_name order by sn.begin_interval_time) end wait_micro, case when (sn.begin_interval_time >= sn.startup_time and lag(sn.begin_interval_time) over (partition by e.dbid, e.instance_number, e.event_name order by sn.begin_interval_time) < sn.startup_time) then e.total_waits else e.total_waits - lag(e.total_waits) over (partition by e.dbid, e.instance_number, e.event_name order by sn.begin_interval_time) end total_waits from dba_hist_system_event e, dba_hist_snapshot sn, t_interval t where sn.snap_id = e.snap_id and sn.dbid = e.dbid and sn.instance_number = e.instance_number and sn.begin_interval_time between t.begin and t.end and e.event_name in ( 'log file sync', 'log file switch (checkpoint incomplete)', 'db file sequential read', 'db file scattered read', 'direct path read' ) ) group by dbid, instance_number, begin_interval_time, snap_id ) ) waits, ( select snap_id, begin_interval_time, instance_number, dbid, redo_size, redo_write_time, redo_writes, is_restart, cpu_used, phy_reads, phy_reads_cache, phy_writes, phy_writes_cache, cr_served, current_served, median(redo_size) over (partition by dbid, instance_number) md_redo_size from ( select snap_id, begin_interval_time, instance_number, dbid, max(is_restart) is_restart, max(decode(stat_name, 'redo size', stat_diff)) redo_size, max(decode(stat_name, 'redo write time', stat_diff)) redo_write_time, max(decode(stat_name, 'redo writes', stat_diff)) redo_writes, max(decode(stat_name, 'CPU used by this session', stat_diff)) cpu_used, max(decode(stat_name, 'physical read total IO requests', stat_diff)) phy_reads, max(decode(stat_name, 'physical reads cache', stat_diff)) phy_reads_cache, max(decode(stat_name, 'physical write total IO requests',stat_diff)) phy_writes, max(decode(stat_name, 'physical writes from cache', stat_diff)) phy_writes_cache, max(decode(stat_name, 'gc cr blocks served', stat_diff)) cr_served, max(decode(stat_name, 'gc current blocks served', stat_diff)) current_served from ( select stats.snap_id, stats.instance_number, stats.dbid, sn.begin_interval_time, sn.startup_time, stats.stat_name, case when (sn.begin_interval_time >= sn.startup_time and lag(sn.begin_interval_time) over (partition by stats.dbid, stats.instance_number, stats.stat_id order by sn.begin_interval_time) < sn.startup_time) then stats.value else stats.value - lag(stats.value) over (partition by stats.dbid, stats.instance_number, stats.stat_id order by stats.snap_id) end stat_diff, case when (sn.begin_interval_time >= sn.startup_time and lag(sn.begin_interval_time) over (partition by stats.dbid, stats.instance_number, stats.stat_id order by sn.begin_interval_time) < sn.startup_time) then 'Yes' end is_restart from dba_hist_sysstat stats, dba_hist_snapshot sn, t_interval t where sn.snap_id = stats.snap_id and sn.dbid = stats.dbid and sn.instance_number = stats.instance_number and sn.begin_interval_time between t.begin and t.end and stats.stat_name in ( 'redo size', 'redo write time', 'redo writes', 'CPU used by this session', 'physical read total IO requests', 'physical reads cache', 'physical write total IO requests', 'physical writes from cache', 'gc cr blocks served', 'gc current blocks served' ) ) group by dbid, instance_number, begin_interval_time, snap_id ) ) stats, ( select stats.snap_id, stats.instance_number, stats.dbid, stats.resource_name, stats.current_utilization from dba_hist_resource_limit stats, dba_hist_snapshot sn, t_interval t where sn.snap_id = stats.snap_id and sn.dbid = stats.dbid and sn.instance_number = stats.instance_number and sn.begin_interval_time between t.begin and t.end and stats.resource_name = 'processes' ) procs where waits.dbid = stats.dbid and waits.instance_number = stats.instance_number and waits.snap_id = stats.snap_id and waits.dbid = procs.dbid and waits.instance_number = procs.instance_number and waits.snap_id = procs.snap_id order by stats.dbid, stats.instance_number, stats.begin_interval_time ;
该脚本转自pythian.
在Oracle监控中如何捕获执行计划变更的语句
在通常的数据库运维中,经常担心数据库一些重要的SQL的性能突然变差,导致整个数据库性能糟糕响应缓慢,甚至让数据库瘫痪的情况时有出现,往往这个时候要在事后通过awrsqrpt.sql脚本从top sql中去抓取分析执行变化的SQL语句,今天脑袋一闪想有没有办法有脚本可以自己去对比呢?回家就开始敲命令测试,测试的平台为11gr1版本,通过dba_hist_sqlstats数据字典中抓去sql id数据进行对比,通过对比获取相关的SQL_ID,以及此SQL的最优执行计划是什么。通过该sql可以设定一个固定的执行时间,通过分别查出来的数据进行监控,对执行计划产生误差的语句及时的通过SQL_PROFILE或者SPM(11g)调整成最优执行计划。
具体语句如下:
spool sql_have_logs_plan.txt set lines 220 pages 9999 trimspool on set numformat 999,999,999 column plan_hash_value format 99999999999999 column min_snap format 999999 column max_snap format 999999 column min_avg_ela format 999,999,999,999,999 column avg_ela format 999,999,999,999,999 column ela_gain format 999,999,999,999,999 select sql_id, min(min_snap_id) min_snap, max(max_snap_id) max_snap, max(decode(rw_num,1,plan_hash_value)) plan_hash_value, max(decode(rw_num,1,avg_ela)) min_avg_ela, avg(avg_ela) avg_ela, avg(avg_ela) - max(decode(rw_num,1,avg_ela)) ela_gain, -- max(decode(rw_num,1,avg_buffer_gets)) min_avg_buf_gets, -- avg(avg_buffer_gets) avg_buf_gets, max(decode(rw_num,1,sum_exec))-1 min_exec, avg(sum_exec)-1 avg_exec from ( select sql_id, plan_hash_value, avg_buffer_gets, avg_ela, sum_exec, row_number() over (partition by sql_id order by avg_ela) rw_num , min_snap_id, max_snap_id from ( select sql_id, plan_hash_value , sum(BUFFER_GETS_DELTA)/(sum(executions_delta)+1) avg_buffer_gets, sum(elapsed_time_delta)/(sum(executions_delta)+1) avg_ela, sum(executions_delta)+1 sum_exec, min(snap_id) min_snap_id, max(snap_id) max_snap_id from dba_hist_sqlstat a where exists ( select sql_id from dba_hist_sqlstat b where a.sql_id = b.sql_id and a.plan_hash_value != b.plan_hash_value and b.plan_hash_value > 0) and plan_hash_value > 0 group by sql_id, plan_hash_value order by sql_id, avg_ela ) order by sql_id, avg_ela ) group by sql_id having max(decode(rw_num,1,sum_exec)) > 1 order by 7 desc / spool off clear columns set numformat 9999999999
具体的执行结果参考如下:
SQL> spool sql_with_more_than_1plan.txt set lines 220 pages 9999 trimspool on SQL> SQL> set numformat 999,999,999 SQL> column plan_hash_value format 99999999999999 SQL> column min_snap format 999999 SQL> column max_snap format 999999 SQL> column min_avg_ela format 999,999,999,999,999 SQL> column avg_ela format 999,999,999,999,999 SQL> column ela_gain format 999,999,999,999,999 SQL> select sql_id, 2 min(min_snap_id) min_snap, 3 max(max_snap_id) max_snap, 4 max(decode(rw_num,1,plan_hash_value)) plan_hash_value, 5 max(decode(rw_num,1,avg_ela)) min_avg_ela, 6 avg(avg_ela) avg_ela, 7 avg(avg_ela) - max(decode(rw_num,1,avg_ela)) ela_gain, 8 -- max(decode(rw_num,1,avg_buffer_gets)) min_avg_buf_gets, 9 -- avg(avg_buffer_gets) avg_buf_gets, 10 max(decode(rw_num,1,sum_exec))-1 min_exec, 11 avg(sum_exec)-1 avg_exec 12 from ( 13 select sql_id, plan_hash_value, avg_buffer_gets, avg_ela, sum_exec, 14 row_number() over (partition by sql_id order by avg_ela) rw_num , min_snap_id, max_snap_id 15 from 16 ( 17 select sql_id, plan_hash_value , sum(BUFFER_GETS_DELTA)/(sum(executions_delta)+1) avg_buffer_gets, 18 sum(elapsed_time_delta)/(sum(executions_delta)+1) avg_ela, sum(executions_delta)+1 sum_exec, 19 min(snap_id) min_snap_id, max(snap_id) max_snap_id 20 from dba_hist_sqlstat a 21 where exists ( 22 select sql_id from dba_hist_sqlstat b where a.sql_id = b.sql_id 23 and a.plan_hash_value != b.plan_hash_value 24 and b.plan_hash_value > 0) 25 and plan_hash_value > 0 26 group by sql_id, plan_hash_value 27 order by sql_id, avg_ela 28 ) 29 order by sql_id, avg_ela 30 ) 31 group by sql_id 32 having max(decode(rw_num,1,sum_exec)) > 1 33 order by 7 desc 34 / spool off clear columns set numformat 9999999999 SQL_ID MIN_SNAP MAX_SNAP PLAN_HASH_VALUE MIN_AVG_ELA AVG_ELA ELA_GAIN MIN_EXEC AVG_EXEC ------------- -------- -------- --------------- -------------------- -------------------- -------------------- ------------ ------------ gfjvxb25b773h 192 204 3801470479 6,368 171,152 164,784 1 1 7asddw0r2wykx 192 211 3660107835 10,311 16,448 6,137 57 30 7ng34ruy5awxq 192 209 2606284882 117 320 204 426 1,371 g278wv0zzpn5j 195 211 3142739690 0 189 189 72 63 1gu8t96d0bdmu 192 209 2035254952 67 154 87 826 1,134 bsa0wjtftg3uw 193 211 2020579421 0 18 18 224 191 0kugqg48477gf 201 202 3321035584 0 0 0 19 27 7 rows selected.
通过以上查询我们发现了7个结果,其中MIN_SNAP和MAX_SNAP代表该sql_id所代表最早出现时候的快照和最后一次出现的快照;PLAN_HASH_VALUE代表着最佳的执行计划;ELA_GAIN字段代表如果使用最佳的执行计划,与平均执行时间能够提速多少时间。
在查出结果后,我们可以通过sql profile(参考老刘的《验证sql_profile的执行计划》),也可以通过11g的dbms_spm.load_plans_from_cursor_cache更改相关Sql_id的sql的执行计划,这里以11g为例子做如下变更,对上面查询到的7asddw0r2wykx语句执行计划改为最优:
SQL> variable v_sqlid number SQL> begin 2 :v_sqlid := 3 dbms_spm.load_plans_from_cursor_cache( 4 sql_id=>'&sql_id', 5 plan_hash_value=>&plan_hash, 6 fixed=>'YES'); 7 end; 8 / Enter value for sql_id: 7asddw0r2wykx old 4: sql_id=>'&sql_id', new 4: sql_id=>'7asddw0r2wykx', Enter value for plan_hash: 3660107835 old 5: plan_hash_value=>&plan_hash, new 5: plan_hash_value=>3660107835, PL/SQL procedure successfully completed.
遭遇ora-00600 [6711]
错误如下:
Sat Jul 26 10:22:23 2014 Errors in file /oracle/admin/icdb/udump/icdb_ora_459112.trc: ORA-00600: internal error code, arguments: [6711], [4257353], [1], [4255900], [0], [], [], [] Sat Jul 26 10:24:07 2014 Errors in file /oracle/admin/icdb/udump/icdb_ora_459112.trc: ORA-00600: internal error code, arguments: [6711], [4257353], [1], [4255900], [0], [], [], [] Sat Jul 26 10:31:38 2014 Errors in file /oracle/admin/icdb/udump/icdb_ora_56557800.trc: ORA-00600: 内部错误代码, 参数: [6711], [4257353], [1], [4255900], [0], [], [], [] Sat Jul 26 10:31:59 2014 Errors in file /oracle/admin/icdb/udump/icdb_ora_56557800.trc: ORA-00600: 内部错误代码, 参数: [6711], [4257353], [1], [4255900], [0], [], [], [] Sat Jul 26 11:00:30 2014 Errors in file /oracle/admin/icdb/bdump/icdb_m000_56557774.trc: ORA-00600: internal error code, arguments: [kcbz_check_objd_typ], [0], [0], [1], [], [], [], [] Sat Jul 26 11:01:00 2014 Errors in file /oracle/admin/icdb/udump/icdb_ora_57344088.trc: ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], [] Sat Jul 26 12:00:39 2014 Errors in file /oracle/admin/icdb/bdump/icdb_m000_63307938.trc: ORA-00600: internal error code, arguments: [kcbz_check_objd_typ], [0], [0], [1], [], [], [], [] Sat Jul 26 12:06:47 2014 Thread 1 advanced to log sequence 32908 (LGWR switch) Current log# 1 seq# 32908 mem# 0: /database/icdb/redo01.log Sat Jul 26 12:09:41 2014 Thread 1 advanced to log sequence 32909 (LGWR switch) Current log# 2 seq# 32909 mem# 0: /database/icdb/redo02.log Sat Jul 26 12:13:06 2014 Thread 1 advanced to log sequence 32910 (LGWR switch) Current log# 3 seq# 32910 mem# 0: /database/icdb/redo03.log Sat Jul 26 12:16:03 2014 Thread 1 advanced to log sequence 32911 (LGWR switch) Current log# 1 seq# 32911 mem# 0: /database/icdb/redo01.log Sat Jul 26 12:18:58 2014 Thread 1 advanced to log sequence 32912 (LGWR switch) Current log# 2 seq# 32912 mem# 0: /database/icdb/redo02.log Sat Jul 26 13:00:41 2014 Errors in file /oracle/admin/icdb/bdump/icdb_m000_7668146.trc: ORA-00600: internal error code, arguments: [kcbz_check_objd_typ], [0], [0], [1], [], [], [], [] Sat Jul 26 13:00:44 2014 Errors in file /oracle/admin/icdb/bdump/icdb_m000_7668146.trc: ORA-00600: internal error code, arguments: [13013], [5001], [9001], [12637871], [26], [12613723], [3], []
可以确认
ORA-00600: internal error code, arguments: [qertbFetchByRowID]
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []以及为BUG
7329252 | 10.2.0.4.4, 10.2.0.5, 11.1.0.7.5, 11.2.0.1 | ORA-8102/ORA-1499/OERI[kdsgrp1] Index corruption after rebuild index ONLINE |
剩下就是
[kcbz_check_objd_typ], [13013],[6711] 先分析6711的错误,根据对整体日志的分析过程如下: [root@ludatou /]# cat alert_icdb.log | grep 6711 |wc -l 39 [root@ludatou /]# cat alert_icdb.log | grep 4257353 |wc -l 31 [root@ludatou /]# cat alert_icdb.log | grep 4255900 |wc -l 31 [root@ludatou /]# cat alert_icdb.log | grep -v 4257353 |wc -l 171179 [root@ludatou /]# cat alert_icdb.log | grep -v 4257353 |grep 6711 |wc -l 8 [root@ludatou /]# cat alert_icdb.log | grep -v 4257353 |grep 6711 Thread 1 cannot allocate new log, sequence 6711 Thread 1 advanced to log sequence 6711 (LGWR switch) Current log# 3 seq# 6711 mem# 0: /database/icdb/redo03.log Current log# 3 seq# 6711 mem# 0: /database/icdb/redo03.log Thread 1 advanced to log sequence 16711 (LGWR switch) Current log# 1 seq# 16711 mem# 0: /database/icdb/redo01.log Thread 1 advanced to log sequence 26711 (LGWR switch) Current log# 2 seq# 26711 mem# 0: /database/icdb/redo02.log 通过以上的分析我们可以发现,6711的错误集中的对应数据块都为 [6711], [4257353], [1], [4255900] SQL> SELECT dbms_utility.data_block_address_block(4257353) "BLOCK", 2 dbms_utility.data_block_address_file(4257353) "FILE" from dual; BLOCK FILE ---------- ---------- 63049 1 SQL> SELECT dbms_utility.data_block_address_block(4255900) "BLOCK", 2 dbms_utility.data_block_address_file(4255900) "FILE" from dual; BLOCK FILE ---------- ---------- 61596 1 set linesize 200; select segment_name, segment_type from dba_extents where relative_fno = 1 and (663049 between block_id and block_id + blocks or 61596 between block_id and block_id + blocks);