Skip to content

Database - 63. page

ORA-00600: internal error code, arguments: [4193], [10837], [10841], [], [], [], [], []

大清早遭遇这个内部错误

Thu May 26 08:48:01 2011
Error 607 happened during db open, shutting down database
USER: terminating instance due to error 607
Thu May 26 08:48:01 2011
Errors in file d:oracleadmincqkfbdumpcqkf_smon_2904.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00607: Internal error occurred while making a change to a data block
ORA-00600: internal error code, arguments: [4193], [10837], [10841], [], [], [], [], []

600的4193错误是由于undo和redo块不一致造成的,一般伴随在2662之后

PURPOSE:           
  This article discusses the internal error “ORA-600 [4193]”, what
  it means and possible actions. The information here is only applicable
  to the versions listed and is provided only for guidance.

ERROR:             
  ORA-600 [4193] [a] [b]

VERSIONS:          
  versions 6.0 to 10.1

DESCRIPTION:       

  A mismatch has been detected between Redo records and Rollback (Undo)
  records.

  We are validating the Undo block sequence number in the undo block against
  the Redo block sequence number relating to the change being applied.

  This error is reported when this validation fails.

ARGUMENTS:
  Arg [a] Undo record seq number
  Arg [b] Redo record seq number

这里的解决办法很简单,重建表空间可以解决问题

1.设置undo为手动管理,然后设置参数_CORRUPTED_ROLLBACK_SEGMENTS=(_SYSSMU1$, _SYSSMU2$, _SYSSMU3$, _SYSSMU4$, _SYSSMU5$, _SYSSMU6$, _SYSSMU7$, _SYSSMU8$, _SYSSMU9$, _SYSSMU10$)

这些回滚段可以在日志中找到:

SMON: enabling cache recovery
Thu May 26 08:47:48 2011
Undo Segment 1 Onlined
Undo Segment 2 Onlined
Undo Segment 3 Onlined
Undo Segment 4 Onlined
Undo Segment 5 Onlined
Undo Segment 6 Onlined
Undo Segment 7 Onlined
Undo Segment 8 Onlined
Undo Segment 9 Onlined
Undo Segment 10 Onlined

………

然后启动数据库使用初始化参数文件,再重建回滚表空间,问题解决。

如果出现错误的回滚段存在与system表空间中,那么以上的方法无法解决,需要修改system回滚表空间的块信息才能够打开数据库,具体方法参考如下:
《使用bbed修改system undo segment修复4193》

ORA-00600[kddummy_blkchk][28][2756885][18019][][]

ORA-00600[kddummy_blkchk][28][2756885][18019][][]

 

 

 

—————————————————————-

解决这个故障的思路以及涉及知识点:

 

 1.db_block_checksum原理

 2.tablemove

 3.诊断事件

 4.回滚段管理

 5.rman and dbv

 

这个故障大概的情况是这样的,系统突然关闭(未知原因),再次startup后,稍微过几分钟smon进程自杀

 

等到有人通知时候,跟踪文件都膨胀到16m了,

 

大致内容:

 

Block recovery from logseq 45928, block 55 to scn 1767964025

Tue May 24 17:01:44 2011

Recovery of Online Redo Log: Thread 1 Group 6 Seq 45928 Reading mem 0

  Mem# 0: /oracle2/oracle/oradata/orcl/redo6.log

Block recovery completed at rba 45928.3128.16, scn 0.1767964026

Tue May 24 17:01:45 2011

Errors in file /oracle2/oracle/admin/orcl/bdump/orcl_smon_5751.trc:

ORA-00600: internal error code, arguments: [kddummy_blkchk], [28], [2756885], [18019], [], [], [], []

Tue May 24 17:02:26 2011

Doing block recovery for file 28 block 2756885

Block recovery from logseq 45928, block 55 to scn 1767964025

Tue May 24 17:02:26 2011

Recovery of Online Redo Log: Thread 1 Group 6 Seq 45928 Reading mem 0

  Mem# 0: /oracle2/oracle/oradata/orcl/redo6.log

Block recovery completed at rba 45928.3128.16, scn 0.1767964026

Tue May 24 17:02:26 2011

 

…………………………………….

 

在trc文件里就是smon不断尝试回滚的信息,但是多次尝试回滚失败后自杀

 

开始以为是bug,

经查阅有一个bug相似:

[ID 454909.1]

但是这里是

 

create table foo

      2  ( c1 number)

      3  tablespace default_users;

    create table foo

    *

    ERROR at line 1:

    ORA-00607: Internal error occurred while making a change to a data block

    ORA-00600: internal error code, arguments: [kddummy_blkchk], [36], [432394],

    [18019], [], [], [], []

 

而且出现的原因经oracle验证是9i升级到10203,而且是通过TTS升级方式在assm类型的表里才出现的

 

这里就失望了,这里版本是10204,可以把这个bug排除了

 

下来根据trc的信息看到smon一直在尝试不断回滚,那么就把回滚表空间设置为手动管理,并设置_corrupted_rollback_segments参数到_syssmu20$,追加到50也报错

 

重试后还是报这个错误。

 

然后尝试设置10513事件中止smon回滚,重试后依然报这个错误

 

这不是坏块,但是还是忍不住对28号文件做了一个DBV,但是检测并未发现坏块。

 

绝望之余,死马当活马医,尝试了blockrecover ,明明知道还是会报这个错。

 

然就就尝试了下面的办法:

这个办法来源blog

http://www.oracledatabase12g.com/message-board

通过设置DB_BLOCK_CHECKSUM为OFF来绕过,具体方法如下:

 

alter system set DB_BLOCK_CHECKSUM = OFF;

 

将corrupt block所在数据表,如你这里的NR_AIN_SMP_18_R 执行move操作 —- 报错的28是值得28号文件,2756885指的28号文件里的块号,根据块号通过dba_extents就可以找到对象了

 

alter table NARI.NR_AIN_SMP_18_R move;

 

之后重建该表上的所有索引

 

alter index XXX rebuild online tablespace ***;

 

问题解决。原来是块头的sum和记录的不符合,导致oracle在执行checksum的时候找不到对应的undo恢复而不断报这个块错误。

 

诊断事件:hang 诊断

2011 0428 /*luda*/
一. trc文件以及日志文件

在$ORACLE_BASE/admin/ORACLE_SID/下面有不同类别的跟踪文件
alert.log文件
system log文件 /var/adm/messages

二. hang situations
.what’s hang?
1.查看CPU的使用情况,作为参考指标,一般hang的cpu的使用率是不高的(至少oracle进程的cpu使用率很低)
2.查看进程是否在等待一些不存在的进程。
hang的诊断办法:

1.system/进程 诊断事件
2.从v$session_wait,v$lock,v$latch,v$latchholder
3.hanganalyze event

(1)
V$SESSION_WAITdisplays the resources or events for which active sessions are waiting.

在诊断数据库hang的时候,v$session_wait可以提供数据库级别有用的信息
从v$session_wait字段可以找到有用的信息

sid                  session id(v$session)
seq#                 sequeue number of the wait for this session
event                the event that the session is waiting for just finished for waiting for
wait_time            the time wait for the session
second_in_wait       the approximate time in seconds at the start of the wait state :
                                                                                  
                                                                                   0 – WAITING (the session is currently waiting)

                                                                                   -2 – WAITED UNKNOWN TIME (duration of last wait is unknown)

                                                                                   -1 – WAITED SHORT TIME (last wait <1/100th of a second)

                                                                                   >0 – WAITED KNOWN TIME (WAIT_TIME = duration of last wait)
示例:

SQL> set linesize 2000
SQL> select sid,seq#,event,wait_time,seconds_in_wait from v$session_wait;

       SID       SEQ# EVENT                                                             WAIT_TIME SECONDS_IN_WAIT
———- ———- —————————————————————- ———- —————
       138          1 jobq slave wait                                                           0              25
       149         68 Streams AQ: waiting for time management or cleanup tasks                  0          575891
       150         35 Streams AQ: qmn slave idle wait                                           0           65739
       151          6 Streams AQ: qmn coordinator idle wait                                     0         1863190
       155      43903 rdbms ipc message                                                         0              57
       156      51563 rdbms ipc message                                                         0             208
       159         31 SQL*Net message to client                                                -1               0
       160         10 rdbms ipc message                                                         0          634929
       161       5670 rdbms ipc message                                                         0            1552
       162      59394 rdbms ipc message                                                         0              26
       163         11 rdbms ipc message                                                         0          230558

       SID       SEQ# EVENT                                                             WAIT_TIME SECONDS_IN_WAIT
———- ———- —————————————————————- ———- —————
       164       5051 smon timer                                                                0            3419
       165      62808 rdbms ipc message                                                         0               0
       166      14978 rdbms ipc message                                                         0              21
       167      23334 rdbms ipc message                                                         0              21
       168         20 rdbms ipc message                                                         0           59201
       169      58551 rdbms ipc message                                                         0              24
       170          8 pmon timer                                                                0         1863202

(2)hanganaylyze event
hang诊断事件一般用在数据库hang住或者死锁的情况下
SQL> alter session set events ‘immediate trace name hanganalyze level 4’;

Session altered.

SQL> @gettrc

TRACE_FILE_NAME
—————————————————
/oracle/admin/znjtepp/udump/znjtepp_ora_18886.trc

马上就可以找到跟踪文件在udump目录下的 18886的文件
  –用hanganalyze 监听oracle – 60错误

    在init文件中加入
    event=“60 trace name hanganalyze level 5”
  –oracle debug的使用
  这里handanalyze有6个级别:
  10   全部的进程信息导出,伴随大量的数据,这个一般不被采用
  5    导出全部和等待事件有关的进程
  4    导出等待事件的分支节点
  3    导出被认为是hang的进程
  2    低限度的导出信息
  1    只导出少部分数据
 
loop的诊断办法

系统的诊断dump是非常有帮助于诊断oracle的loop
操作示范:
(1)系统状态转储
SQL> alter session set events ‘immediate trace name systemstate level 10’;

Session altered.
SQL> @gettrc

TRACE_FILE_NAME
——————————————————————————–
/oracle/admin/znjtepp/udump/znjtepp_ora_28508.trc

(2)进程状态转储
SQL> alter session set events ‘immediate trace name processstate level 10’;

Session altered.

SQL> @gettrc

TRACE_FILE_NAME
——————————————————————————–
/oracle/admin/znjtepp/udump/znjtepp_ora_28508.trc

或者

oradebug  setospid <pid>
oradebug dump systemstate 10
下来就是分析trc文件了,当然为了工作便捷,ass.awk可以快速列出当前的等待事件

诊断事件:10513 禁止smon回滚

禁用smon进行tx recovery(所谓tx recovery就是open后数据文件包含提交和未提交数据,数据不一致),不会造成数据库不一致,虽然禁用了smon自动恢复,但是当查询的时候还是会进行回滚从undo中读取 回滚数据(
等同于用到哪个对象回滚哪个对象,这种方式会带来压力,且若undo损坏就十分麻烦了,那么这将是另一个恢复问题select segment_name,status,tablespace_name from dba_rollback_segs看那个段损坏
使用隐藏参数_offline_rollback_segments 标记,然后drop rollback segment ‘xxx’ ,此时才会造成真正的数据不一致)

实验证明:
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area  943718400 bytes
Fixed Size                  2025400 bytes
Variable Size             247466056 bytes
Database Buffers          692060160 bytes
Redo Buffers                2166784 bytes
Database mounted.
Database opened.
SQL> show parameter event

NAME                                 TYPE        VALUE
———————————— ———– ——————————
event                                string
SQL> conn luda/luda
Connected.
SQL> drop table t1;
create table t1 (id number,id2 number);

declare
begin
for i in 1..100000 loop
insert into t1 values(i,i+1);
end loop;
end;
/
Table dropped.

SQL>
Table created.

SQL> SQL>   2    3    4    5    6    7 

PL/SQL procedure successfully completed.

SQL> conn / as sysdba
Connected.
SQL> shutdown abort
ORACLE instance shut down.
SQL> startup mount
ORACLE instance started.

Total System Global Area  943718400 bytes
Fixed Size                  2025400 bytes
Variable Size             251660360 bytes
Database Buffers          687865856 bytes
Redo Buffers                2166784 bytes
Database mounted.
SQL> alter system set events ‘10513 trace name context forever,level 2′;

System altered.

SQL> alter database open;

Database altered.

SQL> select usn,UNDOBLOCKSDONE,UNDOBLOCKSDONE from v$fast_start_transactions;

no rows selected

SQL> show parameter event

NAME                                 TYPE        VALUE
———————————— ———– ——————————
event                                string
SQL> alter session set events’10046 trace name context forever ,level 12’;

Session altered.

SQL> select * from t1;
select * from t1
              *
ERROR at line 1:
ORA-00942: table or view does not exist
SQL> select * from luda.t1;
SQL> alter system set events ‘10046 trace name context off’;

System altered.

在系统的10046的跟踪文件头部记录着:
“znjtepp_ora_26610.trc” 23617 lines, 2172756 characters
/oracle/admin/znjtepp/udump/znjtepp_ora_26610.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – 64bit Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /oracle/product/10.2/db_1
System name:    AIX
Node name:      aix53
Release:        3
Version:        5
Machine:        0009AFDA4C00
Instance name: znjtepp
Redo thread mounted by this instance: 1
Oracle process number: 15
Unix process pid: 26610, image: oracle@aix53 (TNS V1-V3)

*** 2011-05-25 11:46:13.172
*** SERVICE NAME:() 2011-05-25 11:46:13.163
*** SESSION ID:(159.3) 2011-05-25 11:46:13.163
Thread 1 checkpoint: logseq 39, block 8118, scn 3255055
  cache-low rba: logseq 39, block 8119
    on-disk rba: logseq 39, block 59764, scn 3255568
  start recovery at logseq 39, block 8119, scn 0
—– Redo read statistics for thread 1 —–
Read rate (ASYNC): 25823Kb in 0.28s => 90.06 Mb/sec
Total physical reads: 26624Kb
Longest record: 20Kb, moves: 0/101746 (0%)
Change moves: 0/2 (0%), moved: 0Mb
Longest LWN: 683Kb, moves: 25/94 (26%), moved: 16Mb
Last redo scn: 0x0000.0031ad12 (3255570)
———————————————-
—– Recovery Hash Table Statistics ———
Hash table buckets = 32768
Longest hash chain = 2
Average hash chain = 1251/1246 = 1.0
Max compares per lookup = 2
Avg compares per lookup = 202468/203698 = 1.0
———————————————-
*** 2011-05-25 11:46:13.452
KCRA: start recovery claims for 1251 data blocks
*** 2011-05-25 11:46:13.830
KCRA: blocks processed = 1251/1251, claimed = 1251, eliminated = 0
*** 2011-05-25 11:46:13.831
Recovery of Online Redo Log: Thread 1 Group 2 Seq 39 Reading mem 0
—– Recovery Hash Table Statistics ———
Hash table buckets = 32768
Longest hash chain = 2
…..略

证明在扫描T1表时候还是会使用undo的段进行回滚。

关于10513的设置以及oracle的解释

查询smon进程号:
luda_dba:/home/oracle$ sqlplus ‘/as sysdba’

SQL*Plus: Release 10.2.0.1.0 – Production on Wed May 25 10:28:34 2011

Copyright (c) 1982, 2005, Oracle.  All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – 64bit Production
With the Partitioning, OLAP and Data Mining options

SQL> select pid,program from v$process where program like ‘%SMON%’;

       PID PROGRAM
———- ————————————————
         8 oracle@aix53 (SMON)

SQL> oradebug setorapid 8
Unix process pid: 22736, image: oracle@aix53 (SMON)
SQL> oradebug event 10513 trace name context forever,level 2
Statement processed.
SQL>

Oracle官方对于event 10513解释,该文件存放在
$ORACLE_HOJE/rdbms/mesg/oraus.msg文件中。

引用
10513, 00000, “turn off wrap source compression”
// *Cause:
// *Action: Set this event if you do not want source of wrapped PL/SQL
//          objects to be concatenated and stored multiple lines to a row.
// *Comment: This event should be set if you want the source of wrapped
//           PL/SQL objects to be stored one line per source row. The
//           new storage method is multiple lines per row, which
//           improves load and compile performance.