Skip to content

Instance tuning - 2. page

遭遇ora-00600 ktbair1 伴随25012,2662

这个错误比较偏门与诡异,mos上是有个文章关于这个错误,但是只是说了bug,涉及版本号.
还有个文档说是和6061错误一起伴随,是索引的logical corruption伴随一起出现,而实际情况是否认的.

我的场景是9208 sun sparc 64bit平台,9月到现在出现该错误3次,伴随错误主要是2662和25012.期间怀疑过(trace文件中报错的SQL包含的对象)涉及的2张表和索引过大导致,这几个对象没做过碎片清理操作,而且其中有个表和索引已经超过300G,第一次和第二次前后不短的时间出现问题后将该表数据清理剩余500M,同时对表和索引进行重建,过3周后同样问题继续出现,mos也没有吻合的现象解释和bug,首先就是报错ktbair1错误,接着undo和内存中的数据出现不一致报错2662和25012,重启后smon自行修复完成.通过rman对database分别进行phy corruption和logical corruption 进行检查都没发现对应的错误,

我的报错信息:

Ora-600 errors reported in alert log file,

Wed Dec 17 12:58:10 2014
Errors in file /oracle/admin/boss/udump/boss_ora_19696.trc:
ORA-00600: internal error code, arguments: [ktbair1], [0], [6], [], [], [], [], []
Wed Dec 17 12:58:11 2014
Errors in file /oracle/admin/boss/udump/boss_ora_19696.trc:
ORA-07445: exception encountered: core dump [0000000100638F9C] [SIGBUS] [Invalid address alignment] [0x203C2290D2BFFEF] [] []
ORA-00600: internal error code, arguments: [ktbair1], [0], [6], [], [], [], [], []
Wed Dec 17 12:59:01 2014
Errors in file /oracle/admin/boss/udump/boss_ora_6438.trc:
ORA-00600: internal error code, arguments: [2662], [281], [2416071487], [513], [370935493], [3305242969], [], []
ORA-08103: object no longer exists
Wed Dec 17 12:59:01 2014
Errors in file /oracle/admin/boss/udump/boss_ora_6438.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00600: internal error code, arguments: [2662], [281], [2416071487], [513], [370935493], [3305242969], [], []
ORA-08103: object no longer exists
Wed Dec 17 12:59:09 2014
Errors in file /oracle/admin/boss/udump/boss_ora_17808.trc:
ORA-00600: internal error code, arguments: [25012], [12], [606], [], [], [], [], []
.
.
Wed Dec 17 13:31:59 2014
Errors in file /oracle/admin/boss/udump/boss_ora_27048.trc:
ORA-00600: internal error code, arguments: [2662], [281], [2422679772], [513], [370935493], [3305242969], [], []
Wed Dec 17 13:36:55 2014
Shutting down instance: further logons disabled
Shutting down instance (immediate)
License high water mark = 648
Wed Dec 17 13:41:56 2014
Shutting down instance (abort)
License high water mark = 648
Instance terminated by USER, pid = 11139
Wed Dec 17 13:43:20 2014
Starting ORACLE instance (force)
.
.
Completed: ALTER DATABASE MOUNT
Wed Dec 17 13:43:57 2014
ALTER DATABASE OPEN
Wed Dec 17 13:43:57 2014
Beginning crash recovery of 1 threads
Wed Dec 17 13:43:57 2014
Started redo scan
Wed Dec 17 13:43:57 2014
Completed redo scan
788 redo blocks read, 64 data blocks need recovery
Wed Dec 17 13:43:57 2014
Started recovery at
Thread 1: logseq 323774, block 803834, scn 0.0
Wed Dec 17 13:43:57 2014
Recovery of Online Redo Log: Thread 1 Group 6 Seq 323774 Reading mem 0
Mem# 0 errs 0: /tzboss_sysdata/boss/redo06.log
Wed Dec 17 13:43:57 2014
Completed redo application
Wed Dec 17 13:43:57 2014
Ended recovery at
Thread 1: logseq 323774, block 804622, scn 281.2424682375
64 data blocks read, 64 data blocks written, 788 redo blocks read
Crash recovery completed successfully

From trace file ora-19696.trc,

BH (0x58bfede00) file#: 64 rdba: 0x101a691a (64/1730842) class 1 ba: 0x58bd7c000
set: 80 dbwrid: 7 obj: 672167 objn: 66172
hash: [551f8af00,a67d5b010] lru: [52bfdfe68,5a4fcd768]
ckptq: [NULL] fileq: [NULL]
use: [a683b8eb8,a683b8eb8] wait: [NULL]
st: CR md: EXCL rsop: 0x0 tch: 0
cr:[[scn: 0x0119.8e356168],[xid: 0x000a.02c.008b183e],[uba: 0x2dc978b1.2a93.04],[cls: 0x0119.900211c1],[sfl: 0x1]]
flags: buffer_dirty mod_started only_sequential_access
change state: ACTIVE
buffer tsn: 15 rdba: 0x0203c229 (8/246313)
scn: 0x06c5.072c0007 seq: 0x02 flg: 0x03 tail: 0x128c0600
frmt: 0x04 chkval: 0x1f46 type: 0xc1=unknown
Hex dump of corrupt header 4 = CORRUPT

涉及的对象和文件:

SQL> col object_name format a30
SQL> select owner,object_name,object_type,subobject_name,status,temporary from dba_objects where data_object_id=672167;
select file#,rfile#,name,ts# from v$datafile where rfile# in (64,8);

no rows selected

SQL>
FILE# RFILE# NAME TS#
---------- ---------- -------------------------------------------------- ----------
8 8 /tzboss_oradata/boss/ms_01.dbf 8
64 64 /tzboss_oradata/boss/bm_02.dbf 15

逻辑坏块验证结果:

Recovery Manager: Release 9.2.0.8.0 - 64bit Production

Copyright (c) 1995, 2002, Oracle Corporation. All rights reserved.

connected to target database: BOSS (DBID=1157181921)

RMAN> run {
2> allocate channel d1 type disk;
3> allocate channel d2 type disk;
4> allocate channel d3 type disk;
5> allocate channel d4 type disk;
6> backup validate check logical database;
7> }
8>
using target database controlfile instead of recovery catalog
allocated channel: d1
channel d1: sid=240 devtype=DISK

allocated channel: d2
channel d2: sid=21 devtype=DISK

allocated channel: d3
channel d3: sid=277 devtype=DISK

allocated channel: d4
channel d4: sid=179 devtype=DISK

Starting backup at 05-NOV-14
channel d1: starting full datafile backupset
channel d1: specifying datafile(s) in backupset
input datafile fno=00009 name=/tzboss_oradata/boss/cs_01.dbf
input datafile fno=00028 name=/tzboss_index/boss/cs_idx_01.dbf
input datafile fno=00183 name=/tzboss_sysdata/boss/undotbs1_05.dbf
....
input datafile fno=00226 name=/tzboss_oradata/boss/ei_05.dbf
input datafile fno=00231 name=/tzboss_oradata/boss/tr_03.dbf
input datafile fno=00243 name=/tzboss_oradata/boss/bp_18.dbf
channel d2: backup set complete, elapsed time: 00:02:36
channel d2: starting full datafile backupset
channel d2: specifying datafile(s) in backupset
input datafile fno=00251 name=/tzboss_oradata/boss/pm_04.dbf
input datafile fno=00252 name=/tzboss_oradata/boss/ei_06.dbf
channel d4: backup set complete, elapsed time: 00:02:56
channel d1: backup set complete, elapsed time: 00:02:16
channel d2: backup set complete, elapsed time: 00:01:40
channel d3: backup set complete, elapsed time: 00:02:40
Finished backup at 06-NOV-14
released channel: d1
released channel: d2
released channel: d3
released channel: d4

Recovery Manager complete.

SQL> select count(*) from v$database_block_corruption;

COUNT(*)
----------
0

oracle官方1级SR的诊断结果为

-- Search --

Note: This is INTERNAL ONLY research. No action should be taken by the customer on this information. This is research only, and may NOT be applicable to your specific situation. Searched on: Bug 6401576 : ORA-600 [KTBAIR1] / ORA-600 [KCBZPB_1] / CORRUPTION MESSAGES --> DB CRASH

但是实际上我这里从未出现过6101的错误,该bug按照如下的分析并不吻合我这个数据库所在的情况.但是按照解释这个bug也可以产生表示为ktbair1的错误,最终的流程都是先抛错误ORA-600 [KTBAIR1] / ORA-600 [KCBZPB_1] / CORRUPTION MESSAGES –> DB CRASH .进一步的核验正在进行,目前打算将报错涉及到的DBA地址所在的数据文件对象转移到其他表空间,然后看看是否能避开此类错误,如果还存在则对该版本打上6401576的补丁继续观察,与此错误的纠缠还没完.

Bug 6401576 : ORA-600 [KTBAIR1] / ORA-600 [KCBZPB_1] / CORRUPTION MESSAGES –> DB CRASH 描述:

Hdr: 6401576 9.2.0.8.0 RDBMS 9.2.0.8 RAM INDEX PRODID-5 PORTID-59 ORA-600
Abstract: ORA-600 [KTBAIR1] / ORA-600 [KCBZPB_1] / CORRUPTION MESSAGES --> DB CRASH

*** 09/07/07 03:47 am ***
*** 09/07/07 03:49 am ***
*** 09/07/07 04:19 am *** (CHG: Sta->16)
*** 09/07/07 04:19 am ***
*** 09/07/07 04:32 am ***
*** 09/07/07 04:33 am ***
*** 09/07/07 07:30 am *** (CHG: Sta->10)
*** 09/07/07 07:30 am ***
*** 09/07/07 08:01 am ***
*** 09/07/07 08:01 am ***
*** 09/07/07 08:14 am ***
*** 09/07/07 08:21 am *** ESCALATED
*** 09/07/07 08:21 am ***
*** 10/22/07 03:13 am *** (CHG: Sta->33 Asg->RDBMSREP SubComp->UNKNOWN)
*** 10/22/07 03:13 am ***
*** 10/22/07 03:13 am ***
*** 10/26/07 01:05 am *** (CHG: Sta->16)
*** 10/26/07 01:05 am ***
*** 10/26/07 01:07 am ***
*** 10/29/07 07:03 am ***
*** 10/29/07 07:03 am ***
*** 10/29/07 07:13 am ***
*** 10/29/07 11:39 am *** (CHG: Sta->10)
*** 10/29/07 11:39 am ***
*** 10/30/07 03:19 am ***
*** 10/31/07 02:32 am *** (CHG: Sta->16)
*** 10/31/07 02:32 am ***
*** 10/31/07 02:33 am ***
*** 10/31/07 02:39 am ***
*** 10/31/07 02:49 am ***
*** 10/31/07 02:50 am ***
*** 10/31/07 04:31 am *** (CHG: Sta->10)
*** 10/31/07 04:31 am ***
*** 10/31/07 07:21 am ***
*** 10/31/07 07:21 am ***
*** 11/02/07 06:40 am *** (CHG: Sta->16)
*** 11/02/07 06:40 am ***
*** 11/03/07 03:53 am *** (CHG: Sta->11 SubComp->RAM INDEX)
*** 11/03/07 03:53 am ***
*** 11/03/07 03:54 am ***
*** 11/03/07 03:54 am ***
*** 11/03/07 03:58 am ***
*** 11/06/07 08:27 pm ***
*** 11/06/07 08:27 pm ***
*** 11/06/07 08:39 pm ***
*** 11/06/07 08:39 pm *** (CHG: Sta->30)
*** 11/06/07 08:39 pm ***
*** 11/06/07 11:55 pm ***
*** 11/08/07 03:36 am *** (CHG: Sta->11)
*** 11/08/07 03:36 am ***
*** 11/08/07 06:51 am ***
*** 11/12/07 06:30 pm *** (CHG: Sta->30)
*** 11/12/07 06:30 pm ***
*** 11/14/07 03:38 am *** (CHG: Sta->11)
*** 11/14/07 03:38 am ***
*** 11/14/07 03:53 am ***
*** 11/14/07 03:28 pm ***
*** 11/21/07 04:35 pm *** (CHG: Sta->30)
*** 11/21/07 04:35 pm ***
*** 11/26/07 07:05 pm ***
*** 12/05/07 06:23 am *** (CHG: Sta->11)
*** 12/05/07 06:23 am ***
*** 12/14/07 06:04 pm *** (CHG: Sta->30)
*** 12/14/07 06:04 pm ***
*** 12/18/07 03:58 am *** (CHG: Sta->11)
*** 12/18/07 03:58 am ***
*** 12/19/07 03:42 am ***
*** 12/19/07 08:23 am ***
*** 12/20/07 11:35 am ***
*** 01/03/08 03:14 pm ***
*** 01/07/08 06:56 am ***
*** 01/07/08 08:36 am ***
*** 01/11/08 03:19 pm ***
*** 01/15/08 11:55 am ***
*** 01/16/08 08:10 am *** (ADD: Impact/Symptom->DATA CORRUPTION )
*** 01/16/08 08:11 am *** (ADD: Impact/Symptom->MEMORY CORRUPTION )
*** 01/16/08 08:12 am ***
*** 01/16/08 08:12 am *** (CHG: Sta->80)
*** 01/16/08 08:22 am ***
*** 01/16/08 08:23 am ***
Rediscovery Information: If ORA-600[6101] is encountered during the
compaction of a compressed
index block, then it is this bug.  There can also be other internal
errors such as ktbair1, kcbzpb_1, 4519.

Workaround: none
Release Notes:
]] Internal errors could occur during the compaction of a compressed
]] index block.

Oracle10g版本后enq: TX – allocate ITL entry等待事件的根源以及解决思路

客户昨日一套核心系统有进程挂住,top等待事件为enq: TX – allocate ITL entry,判断为itl争用的原因,这方面以前做了不少测试并没有整理,这里顺便做个整理作为诊断手段用。该内容主要分为主要的三个部分:

一.itl的解释以及原理
二.数据块上的initrans不能扩展分配slot导致的itl争用测试
三.索引块上递归事务专用的itl slot争用的识别判断

 

一.ITL原理解释

ITL(Interested Transaction List)是Oracle数据块内部的一个组成部分,用来记录该块所有发生的事务,一个itl可以看作是一个记录,在一个时间,可以记录一个事务(包括提交或者未提交事务)。当然,如果这个事务已经提交,那么这个itl的位置就可以被反复使用了,因为itl类似记录,所以,有的时候也叫itl槽位。Oracle的每个数据块中都有一个或者多个事务槽,每一个对数据块的并发访问事务都会占用一个事务槽。 表和索引的事务槽ini_trans是1、max_trans是255,在oracle10g中,不能修改max_trans这个参数,因为oracle10g忽略了这个参数。如果一个事务一直没有提交,那么,这个事务将一直占用一个itl槽位,itl里面记录了事务信息,回滚段的嵌入口,事务类型等等。如果这个事务已经提交,那么,itl槽位中还保存的有这个事务提交时候的SCN号。如dump一个块,就可以看到itl信息:

Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0006.002.0000158e  0x0080104d.00a1.6e  --U-  734  fsc 0x0000.6c9deff0
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000

如果在并发量特别大的系统中,最好分配足够的itl个数(10g之前的版本),其实它并浪费不了太多的空间,或者,设置足够的pctfree,保证itl能扩展,但是pctfree有可能是被行数据给消耗掉的,如update可能一下占满块空间,所以,也有可能导致块内部的空间不够而导致itl等待,所以在通常情况下,10g版本后引起itl等待的原因往往是因为块的空间不足导致,并不是tran事务槽数量不足,在正常情况下2k的数据块最多可以拥有41个itl,4k数据块最多拥有83,8k最多用友169个itl(以itl 24byte为单位)。INITRANS不足的问题不会出现在索引数据块上,当发现没有足够空间分配ITL slot时,无论是枝点块还是叶子块,数据块会发生分裂(Index Block Split)。

有一种特殊情况也会引起ITL的等待,就是在索引上的递归事务itl争用,这种情况比较特殊。在索引的枝节点上,有且只有一个ITL slot,它是用于当发生节点分裂的递归事务(Recursive Transaction)。在叶子节点上,第一条ITL Slot也是用于分裂的递归事务的。在一个用户事务中,如果发生多次分裂,每一次分裂都是由一个单独的递归事务控制的,如果下层节点分裂导致其父节点分裂,它们的分裂则由同一个递归事务控制。当2个事务同时需要分裂一个枝节点或者叶子节点时,或者枝节点下的2个子节点分别被2个事务分裂,就会造成这种ITL等待。

 

2.数据块上的initrans不能扩展分配slot导致的itl争用测试

我们做个测试关于如果数据块没有空间留给itl slot扩展时候的测试,创建表luda,指定pctfree为0,同时指定initrans为1然后填满相关数据块,再对块满的数据进行更新模拟出itl的等待。
创建表luda,并指定pctfree为0,initrans为1

SQL> create table luda(a int) pctfree 0 initrans 1;

Table created.

 

插入大量数据

SQL> begin
2 for i in 1..20000 loop
3 insert into luda values(i);
4 end loop;
5 end;
6 /

PL/SQL procedure successfully completed.

SQL> commit ;

Commit complete.

 

SQL> select f,b,count(*) from (select dbms_rowid.rowid_relative_fno(rowid) f,dbms_rowid.rowid_block_number(rowid) b
2 from luda) group by f,b order by 3;

F B COUNT(*)
---------- ---------- ----------
1 61101 200
1 61093 734
1 61089 734
1 61095 734
1 61085 734
1 61099 734
1 61074 734
1 61077 734
1 61080 734
1 61092 734
1 61100 734
1 61083 734
1 61091 734
1 61097 734
1 61098 734
1 61075 734
1 61076 734
1 61078 734
1 61081 734
1 61084 734
1 61087 734
1 61096 734
1 61079 734
1 61094 734
1 61088 734
1 61090 734
1 61082 734
1 61086 734

 

插入20018条数据后可以发现该表有26个数据块,填满了除了61101块意外的其他数据块。
接着导出已经填满的数据块61074.

SQL> alter system dump datafile 1 block 61074;

System altered.

 

Block header dump: 0x0040ee92
Object id on Block? Y
seg/obj: 0xcb0a csc: 0x00.bb6a1 itc: 2 flg: - typ: 1 - DATA
fsl: 0 fnx: 0x0 ver: 0x01

Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0005.020.0000013b 0x0080078c.013c.3a --U- 734 fsc 0x0000.000bb765
0x02 0x0000.000.00000000 0x00000000.0000.00 ---- 0 fsc 0x0000.00000000
--可以发现initrans为1的情况下默认是有2个事务槽,itc=2
data_block_dump,data header at 0xd7fe45c
===============
tsiz: 0x1fa0
hsiz: 0x5ce
pbl: 0x0d7fe45c
bdba: 0x0040ee92
76543210
flag=--------
ntab=1
nrow=734
frre=-1
fsbo=0x5ce
fseo=0xbf8

avsp=0x4
tosp=0x4
0xe:pti[0] nrow=734 offs=0

 

块满的情况测试slot的分配,根据前面的查询结果我们知道单个块的存储行数为734行,也可以通过dump中的nrow=734得知,所以我们在这部测试中依次更新第100,200,300行的数据。

session 1 更新第100行的数据:

SQL> update luda set a=a
2 where dbms_rowid.ROWID_BLOCK_NUMBER(rowid)=61074
3 and dbms_rowid.ROWID_ROW_NUMBER(rowid)=100;

1 row updated.

 

session 2更新第200行的数据:

SQL> update luda set a=a
2 where dbms_rowid.ROWID_BLOCK_NUMBER(rowid)=61074
3 and dbms_rowid.ROWID_ROW_NUMBER(rowid)=200;

1 row updated.

 

session 3更新第300行的数据,session3的sid为158,并且在执行过程中session 3 hang住:

SQL> select sid from v$mystat where rownum=1;

SID
----------
158

SQL> update luda set a=a
2 where dbms_rowid.ROWID_BLOCK_NUMBER(rowid)=61074
3 and dbms_rowid.ROWID_ROW_NUMBER(rowid)=300;
--此时进程hang住
 

alter system dump datafile 1 block 61074;

 

Block header dump: 0x0040ee92
Object id on Block? Y
seg/obj: 0xcb0a csc: 0x00.bb97e itc: 2 flg: - typ: 1 - DATA
fsl: 0 fnx: 0x0 ver: 0x01

Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0003.014.00000159 0x008006bb.01f1.12 ---- 1 fsc 0x0000.00000000
0x02 0x0005.00e.0000013c 0x0080083c.014b.20 ---- 1 fsc 0x0000.00000000
--通过此时的dump我们也可以发现原先为被占用的2个事务槽已经被占用而且事务未提交。
data_block_dump,data header at 0xd77645c
===============
tsiz: 0x1fa0
hsiz: 0x5ce
pbl: 0x0d77645c
bdba: 0x0040ee92
76543210
flag=--------
ntab=1
nrow=734
frre=-1
fsbo=0x5ce
fseo=0xbf8
avsp=0x4
tosp=0x4
0xe:pti[0] nrow=734 offs=0

 

 
--查询158进程的等待事件为itl的相关等待事件enq: TX - allocate ITL entry
SQL> select sid,event from v$session where sid=158;

SID EVENT
---------- ----------------------------------------------------------------
158 enq: TX - allocate ITL entry

从以上验证了空间不足的情况下会导致itl无法分配引起enq: TX – allocate ITL entry等待事件的产生。

接下来测试块不满的情况,在表luda中目前有数据的块为26个,其中块号为61101的块只有200条数据,只占用该块30%的空间,为了测试需要对61101号块采用4个session分别对第10,20,30,40行进行更新:

session 1:

 

SQL> update luda set a=a
2 where dbms_rowid.ROWID_BLOCK_NUMBER(rowid)=61101
3 and dbms_rowid.ROWID_ROW_NUMBER(rowid)=10;

1 row updated.

 

 

session2:

 

SQL> update luda set a=a
2 where dbms_rowid.ROWID_BLOCK_NUMBER(rowid)=61101
3 and dbms_rowid.ROWID_ROW_NUMBER(rowid)=20;

1 row updated.

 

session3:

 

SQL> update luda set a=a
2 where dbms_rowid.ROWID_BLOCK_NUMBER(rowid)=61101
3 and dbms_rowid.ROWID_ROW_NUMBER(rowid)=30;

1 row updated.

 

session4:

SQL> update luda set a=a
2 where dbms_rowid.ROWID_BLOCK_NUMBER(rowid)=61101
3 and dbms_rowid.ROWID_ROW_NUMBER(rowid)=40;

1 row updated.

 

以上4个session都没有遇到阻塞,导出61101号块可以发现该块有4个itl slot,自动扩展了2个slot,验证了在空间足够的情况下itl slot会自动扩展。

SQL> alter system dump datafile 1 block 61101;

System altered.

 

--

Block header dump: 0x0040eead
Object id on Block? Y
seg/obj: 0xcb0a csc: 0x00.bc003 itc: 4 flg: O typ: 1 - DATA
fsl: 0 fnx: 0x0 ver: 0x01

Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x0003.014.00000159 0x008006bb.01f1.13 ---- 1 fsc 0x0000.00000000
0x02 0x0009.016.0000013b 0x0080075c.0143.37 ---- 1 fsc 0x0000.00000000
0x03 0x0005.00e.0000013c 0x0080083c.014b.21 ---- 1 fsc 0x0000.00000000
0x04 0x0002.026.00000139 0x00800130.00fc.09 ---- 1 fsc 0x0000.00000000
--itc=4,在61101块上存在4条未提交的事务,分别是我们刚才执行的sesson1-4.

data_block_dump,data header at 0xe17048c
===============
tsiz: 0x1f70
hsiz: 0x1a2
pbl: 0x0e17048c
bdba: 0x0040eead
76543210
flag=--------
ntab=1
nrow=200
frre=-1
fsbo=0x1a2
fseo=0x1957
avsp=0x16c6
tosp=0x16c6
0xe:pti[0] nrow=200 offs=0

 

以上测试说明了在Oracle10g以来,itl的争用事件产生在数据块上主要是受块空间的影响,如果大量事务在相关block上执行操作但是当块无法满足分配事务槽,则会产生itl相关争用。
下一节主要是区分是索引块还是表数据块的争用. http://www.ludatou.com/?p=1920