Skip to content

关于光纤链路的排障

碰到过不少因为光纤链路的问题导致oracle rac集群故障,所以把一些这方面的排障思路方向汇一下,任何做过这方面排障的工程师都清楚这是一个比较繁冗的过程,因此知道从什么地方入手寻找故障非常重要。这里给出了一些最常见的光纤故障以及产生这些故障的可能因素,这些信息将有助于用户对网络故障进行有根据的猜测。
    
1.光纤断裂通常是由于外力物理挤压或过度弯折;
2.传输功率不足;
3.光纤铺设距离过长可能造成信号丢失;
4.连接器受损可能造成信号丢失;
5.光纤接头和连接器(connectors)故障可能造成信号丢失;
6.使用过多的光纤接头和连接器可能造成信号丢失;
7.光纤配线盘(patch panel)或熔接盘(splice tra)连接处故障。
8.通常而言,如果连接完全不通,那么很可能是光纤断裂。但如果连接时断时续,可能有以下原因:
9.结合处制作水平低劣或结合次数过多造成光纤衰减严重;
10.由于灰尘、指纹、擦伤、湿度等因素损伤了连接器;
11.传输功率过低;
12.在配线间连接器错误。

一般在碰到集群相关心跳出问题时候,在诊断光纤方面,应该尽量了解客户现场的案发时候的情形,比如是否动过光纤线?热拔插过否?有没有更换或者挪动等,也尽量去把每一方面涉及到的处理摸清技术手段搞清楚,以便在排障时候提升效率

有关Oracle database 11g大面积Client报错Ora-28001错误

这两年oracle 11g普及非常迅速,很多用户在近半年已经将oracle db的版本升级到11g,而且新上线的系统一般都采用了11g的数据库.这与最近一段时间遭遇到大量用户28001的错误有较大关系.

ORA-28001: the password has expired

原因是在于oracle database 11g后oracle新建用户默认的default profile的password_life_time为180天,具体可以查询如下:


SQL> select * from dba_profiles where profile='DEFAULT';

PROFILE RESOURCE_NAME RESOURCE LIMIT
------------------------------ -------------------------------- -------- ----------------------------------------
DEFAULT COMPOSITE_LIMIT KERNEL UNLIMITED
DEFAULT SESSIONS_PER_USER KERNEL UNLIMITED
DEFAULT CPU_PER_SESSION KERNEL UNLIMITED
DEFAULT CPU_PER_CALL KERNEL UNLIMITED
DEFAULT LOGICAL_READS_PER_SESSION KERNEL UNLIMITED
DEFAULT LOGICAL_READS_PER_CALL KERNEL UNLIMITED
DEFAULT IDLE_TIME KERNEL UNLIMITED
DEFAULT CONNECT_TIME KERNEL UNLIMITED
DEFAULT PRIVATE_SGA KERNEL UNLIMITED
DEFAULT FAILED_LOGIN_ATTEMPTS PASSWORD 10
DEFAULT PASSWORD_LIFE_TIME PASSWORD 180
DEFAULT PASSWORD_REUSE_TIME PASSWORD UNLIMITED
DEFAULT PASSWORD_REUSE_MAX PASSWORD UNLIMITED
DEFAULT PASSWORD_VERIFY_FUNCTION PASSWORD NULL
DEFAULT PASSWORD_LOCK_TIME PASSWORD UNLIMITED
DEFAULT PASSWORD_GRACE_TIME PASSWORD UNLIMITED

只需要执行如下命令即可避免类似错误


alter profile default limit password_life_time unlimited;

恢复被删除(损坏、丢失)的goldengate trailfile

恢复被删除(损坏、丢失)的goldengate trailfile

1.1恢复目标被删除(损坏、丢失)的trailfile

原理:根据目标端trailfile最后应用的scn来去源端找到准确位置后重传数据

测试方法:手动删除目标端的最后一个trail file后进行测试(并不是完全恢复这个文件,只是将删除文件中未应用的数据找出,产生包含丢失数据的文件而已)
关于trailfile存在源、目标方式:
1、源端、目标拥有独立的trailfile.extract抽取后经过datapump传输到目标端
2、源端、目标共享trailfile.extract抽取被replicat进程直接读取.这种不需要配置datapump,这种情形使用不如第一种情形多.

测试场景:
采用第一种,源端和目标端都是拥有独立的trailfile.是否想过源、目的trailfile文件大小或文件个数能否对应上,恢复直接从源端复制过来就可以使用?

具体如下.
Ex开头是源端,dp是目标端.初步对比来看个数以及文件大小都存在差异,初步设想不成功.

操作步骤:
1、先停止目标端的replicat进程.
2、源端:对一个空表插入100条数据.然后手动rename目标端未应用的trailfile一个.
验证源端数据:

SQL> select count(*) from leo_rows;
COUNT(*)
----------
100

验证目标端数据:

SQL> select count(*) from leo_rows;
COUNT(*)
----------
0

通过这个信息对比,发现2边数据不同步.
3、rename目标端未应用的trailfile。例如dp000014变成dp000014.old
mv dp000014 dp000014.old
4、根据目标端分析上次成功checkpoint信息位置找出目标端实际应用位置.
从checkpoint信息中查出实际应用到位置是extseqno=14、extrba=8751283.
但是此时dp000014文件已经丢失(只是rename而已).10g以后版本trailfile记录csn信息,不需要通过@genenv(“ORATRANSACTION”,”SCN”)强制写个token到trailfile.SCN是指Oracle transaction id的对应数值.可以通过Oracle v$transaction视图查询.查看进程的checkpoint里面记录了

CSN stateinformation:
CRC: 55-A2-8E-E4
Latest CSN: 1963641—这个就是对应Oraclescn.
Latest TXN: 10.23.846 –这个是transactionid
Latest CSN of finished TXNs: 1963641:这里面提示已经完成.不如此事务是不完整的,需要从重新同步这个事物.
Completed TXNs: 10.23.846
GGSCI (ludatou) 4> inforepleo showch
REPLICAT REPLEO Last Started 2014-04-22 20:27 Status STOPPED
Checkpoint Lag 00:00:00 (updated 00:09:17 ago)
Log Read Checkpoint File ./dirdat/dp000014
2014-04-22 20:32:05.801351 RBA 8751283
Current Checkpoint Detail:
Read Checkpoint #1
GGS Log Trail
Startup Checkpoint (starting position in thedata source):
Sequence #: 14
RBA: 8748614
Timestamp: 2014-04-22 13:01:47.962664
Extract Trail: ./dirdat/dp
Current Checkpoint (position of last recordread in the data source):
Sequence #: 14
RBA: 8751283
Timestamp: 2014-04-22 20:32:05.801351
Extract Trail: ./dirdat/dp
CSN state information:
CRC: 55-A2-8E-E4
Latest CSN: 1963641
Latest TXN: 10.23.846
Latest CSN of finished TXNs: 1963641
Completed TXNs: 10.23.846
Header:
Version = 2
Record Source = A
Type = 1
# Input Checkpoints = 1
# Output Checkpoints = 0
File Information:
Block Size = 2048
Max Blocks = 100
Record Length = 2048
Current Offset = 0
Configuration:
Data Source = 0
Transaction Integrity = -1
Task Type = 0
Database Checkpoint:
Checkpoint table = goldengate.ogg_checkpoint
Key = 296275922 (0x11a8cfd2)
Create Time = 2014-04-22 07:56:50
Status:
Start Time = 2014-04-22 20:27:58
Last Update Time = 2014-04-22 20:37:41
Stop Status = G
Last Result = 400

从上述看已经最后被应用的事务信息包括如下:
1、 csn=1963641
2、 TXN=10.23.846
根据如上信息登陆源端系统找到对应trailfile的rba信息.然后根据这个位置,重新修改datapump从这个位置传输trailfile到目标端.

开始恢复数据:
登陆源端主机:

[ludatou ~]$ cd $OGG
[ludatou112101]$cd dirdat
[ludatoudirdat]$ ls -lrt
total 126000
-rw-rw-rw- 1oracle oinstall 15198 Apr 27 12:24ex000000
-rw-rw-rw- 1oracle oinstall 9999850 Apr 27 12:28 ex000001
-rw-rw-rw- 1oracle oinstall 9999928 Apr 27 12:28 ex000002
-rw-rw-rw- 1oracle oinstall 9999824 Apr 27 12:33 ex000003
-rw-rw-rw- 1oracle oinstall 9999838 Apr 27 12:33 ex000005
-rw-rw-rw- 1oracle oinstall 9999495 Apr 27 12:33 ex000004
-rw-rw-rw- 1oracle oinstall 9999904 Apr 27 12:33 ex000007
-rw-rw-rw- 1oracle oinstall 9999928 Apr 27 12:33 ex000006
-rw-rw-rw- 1oracle oinstall 9999863 Apr 27 12:41 ex000010
-rw-rw-rw- 1oracle oinstall 9999714 Apr 27 12:41 ex000009
-rw-rw-rw- 1oracle oinstall 9999792 Apr 27 12:41 ex000008
-rw-rw-rw- 1oracle oinstall 9999867 Apr 27 12:44 ex000012
-rw-rw-rw- 1oracle oinstall 9999804 Apr 27 12:44 ex000011
-rw-rw-rw- 1oracle oinstall 8762807 Apr 27 21:01 ex000013

从根据目标段的时间来源端寻找时机最靠近的trailfile.即ex000013是第一选择,如果对应csn信息不在000013,则相前反向推找.

[ludatou dirdat]$ cd $OGG
[ludatou 112101]$./logdump –使用logdump对trailfile进行操作
Logdump 236>open ./dirdat/ex000013
Current LogTrailis /goldengate/112101/dirdat/ex000013
Logdump 237>ghdr on
Logdump 238>detail data
Logdump 239>usertoken detail
Logdump 240>ggstoken detail
Logdump 241>filter include csn = 1963641 --来自目标端的csn
Logdump 242>n --查看满足条件的数据
Scanned 10000 records, RBA 3160036, 2014/04/27 12:44:51.000.000
Scanned 20000 records, RBA 6320505, 2014/04/27 12:44:51.000.000
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x00)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 1204 (x04b4) IO Time : 2014/04/27 20:53:21.000.000
IOType : 160 (xa0) OrigNode : 0 (x00)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 0 AuditPos : 0
Continued : N (x00) RecCount : 1 (x01)
2014/04/2720:53:21.000.000 DDLOP Len 1204 RBA 8749251
Name:
After Image: Partition 0 G s
2c43 353d 2732 3533 3127 2c2c 4237 3d27 32353331 | ,C5='2531',,B7='2531
272c 2c42 323d 2727 2c2c 4233 3d27 4c45 4f272c2c | ',,B2='',,B3='LEO',,
4234 3d27 4c45 4f5f 524f 5753 272c 2c43 31323d27 | B4='LEO_ROWS',,C12='
272c 2c43 3133 3d27 272c 2c42 353d 2754 41424c45 | ',,C13='',,B5='TABLE
272c 2c42 363d 2743 5245 4154 4527 2c2c 42383d27 | ',,B6='CREATE',,B8='
474f 4c44 454e 4741 5445 2e47 4753 5f44 444c5f48 | GOLDENGATE.GGS_DDL_H
4953 5427 2c2c 4239 3d27 4c45 4f27 2c2c 43373d27 | IST',,B9='LEO',,C7='
GGS tokens:
TokenID x52 'R'ORAROWID Info x00 Length 20
4141 4156 5849 4141 4541 4141 4146 6741 41460001 | AAAVXIAAEAAAAFgAAF..
TokenID x44 'D'DDL Info x00 Length 36
4c45 4f00 4c45 4f5f 524f 5753 0031 312e 322e302e | LEO.LEO_ROWS.11.2.0.
342e 3000 3131 2e32 2e30 2e34 2e30 004e | 4.0.11.2.0.4.0.N
TokenID x4c 'L'LOGCSN Info x00 Length 7
3139 3633 3634 31 | 1963641 –与目标csn符合
TokenID x36 '6'TRANID Info x00 Length 9
3130 2e32 332e 3834 36 | 10.23.846 –与目标tranid符合
Filteringsuppressed 27661 records

根据上述信息已经找到目标端应用的数据信息.
Datapump需要从满足条件下一条记录开始传输.所以继续输入N捕获下一条数据信息的rba
但是输入N却发现没有任何数据返回,是到达文件结尾了吗?根据判断这个属于中间数据.经过分析发现,原来是filter include csn = 1963641搞的鬼.现在清楚这个过滤条件.根据下面输出可以获得正确rba信息是:

Logdump 243 >n
Filteringsuppressed 100 records
Logdump 244>filter clear
Logdump 246>pos 8749251 –来自上面的rba信息,是目标端已经应用的数据.
Reading forwardfrom RBA 8749251
Logdump 247>n –表示输出这个数据的详细信息.下n才是我们需要的数据
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x00)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 1204 (x04b4) IO Time : 2014/04/27 20:53:21.000.000
IOType : 160 (xa0) OrigNode : 0 (x00)
TransInd : . (x03) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 0 AuditPos : 0
Continued : N (x00) RecCount : 1 (x01)
2014/04/2720:53:21.000.000 DDLOP Len 1204 RBA 8749251
Name:
After Image: Partition 0 G s
2c43 353d 2732 3533 3127 2c2c 4237 3d27 32353331 | ,C5='2531',,B7='2531
272c 2c42 323d 2727 2c2c 4233 3d27 4c45 4f272c2c | ',,B2='',,B3='LEO',,
4234 3d27 4c45 4f5f 524f 5753 272c 2c43 31323d27 | B4='LEO_ROWS',,C12='
272c 2c43 3133 3d27 272c 2c42 353d 2754 41424c45 | ',,C13='',,B5='TABLE
272c 2c42 363d 2743 5245 4154 4527 2c2c 4238 3d27| ',,B6='CREATE',,B8='
474f 4c44 454e 4741 5445 2e47 4753 5f44 444c5f48 | GOLDENGATE.GGS_DDL_H
4953 5427 2c2c 4239 3d27 4c45 4f27 2c2c 43373d27 | IST',,B9='LEO',,C7='
GGS tokens:
TokenID x52 'R'ORAROWID Info x00 Length 20
4141 4156 5849 4141 4541 4141 4146 6741 41460001 | AAAVXIAAEAAAAFgAAF..
TokenID x44 'D'DDL Info x00 Length 36
4c45 4f00 4c45 4f5f 524f 5753 0031 312e 322e302e | LEO.LEO_ROWS.11.2.0.
342e 3000 3131 2e32 2e30 2e34 2e30 004e | 4.0.11.2.0.4.0.N
TokenID x4c 'L'LOGCSN Info x00 Length 7
3139 3633 3634 31 | 1963641
TokenID x36 '6'TRANID Info x00 Length 9
3130 2e32 332e 3834 36 | 10.23.846

Logdump 248>n --这里是正确的rba信息,同时判断这个条记录是否开始位置.根据transind后面数字x00表示是开始位置,同时也确认我们找的rba是正确的.否则就是错误的.
TransInd : . (x00) 表示记录开始位置
TransInd : . (x01) 表示记录中间位置
TransInd : . (x02) 表示记录结尾位置
TransInd : . (x03) 表示这个事务只有1条记录
___________________________________________________________________
Hdr-Ind : E (x45) Partition : . (x04)
UndoFlag : . (x00) BeforeAfter: A (x41)
RecLength : 31 (x001f) IO Time : 2014/04/27 21:01:11.000.000
IOType : 5 (x05) OrigNode : 255 (xff)
TransInd : . (x00) FormatType : R (x52)
SyskeyLen : 0 (x00) Incomplete : . (x00)
AuditRBA : 161 AuditPos : 11343376
Continued : N (x00) RecCount : 1 (x01)
2014/04/2721:01:11.000.000 Insert Len 31 RBA 8750598
NameEO.LEO_ROWS
After Image: Partition 4 G b
0000 0005 0000 0001 3100 0100 0700 0000 036c656f | ........1........leo
0002 0007 0000 0003 3130 30 | ........100
Column 0 (x0000), Len 5 (x0005)
0000 0001 31 | ....1
Column 1 (x0001), Len 7 (x0007)
0000 0003 6c65 6f | ....leo
Column 2(x0002), Len 7 (x0007)
0000 0003 3130 30 | ....100
GGS tokens:
TokenID x52 'R'ORAROWID Info x00 Length 20
4141 4156 5976 4141 4541 4141 4146 3141 41410001 | AAAVYvAAEAAAAF1AAA..
TokenID x4c 'L'LOGCSN Info x00 Length 7
3139 3635 3636 33 | 1965663
TokenID x36 '6'TRANID Info x00 Length 8
382e 342e 3130 3638 | 8.4.1068

停止源端的pump,修改配置

Logdump 249>exit
[ludatou112101]$ ./ggsci
OracleGoldenGate Command Interpreter for Oracle
Version11.2.1.0.1 OGGCORE_11.2.1.0.1_PLATFORMS_120423.0230_FBO
Linux, x64,64bit (optimized), Oracle 11g on Apr 23 2012 08:32:14
Copyright (C)1995, 2012, Oracle and/or its affiliates. All rights reserved.
GGSCI (ludatou)1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPLEO 00:00:00 00:00:06
EXTRACT RUNNING EXTLEO 00:00:03 00:00:02
GGSCI (cifpay02)2> stop DPLEO
Sending STOPrequest to EXTRACT DPLEO ...
Requestprocessed.
GGSCI (ludatou)3> alter extract dpleo,extrba 8750598
EXTRACT altered.
GGSCI (cifpay02)4> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT STOPPED DPLEO 00:00:00 00:00:01
EXTRACT RUNNING EXTLEO 00:00:02 00:00:07
GGSCI (cifpay02)5> start dpleo
Sending STARTrequest to MANAGER ...
EXTRACT DPLEOstarting
GGSCI (ludatou)14> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT ABENDED DPLEO 00:00:00 00:02:38
EXTRACT RUNNING EXTLEO 00:00:02 00:00:07

从上面看到dumpabend,主要目标端dp000014被手动删除,导致dump传输数据无法lock trailfile,当目标端达到设置大小时候才会切换到下一个或者我们手动切换到下一个.下面手动切换dump从新文件开始写,即dp000015.其实dump指定写目标端extseqno.

GGSCI (cifpay02)15> alter DPLEO,etrollover
2014-04-2803:43:52 INFO OGG-01520 Rollover performed. For eachaffected output trail of Version 10 or higher format, after starting the sourceextract, issue ALTER EXTSEQNO for that trail's reader (either pump EXTRACT orREPLICAT) to move the reader's scan to the new trail file; it will not happen automatically.
EXTRACT altered.

目标端确认配置

GGSCI (ludatou)1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED REPLEO 00:00:00 06:56:12

因为dp000014已经不存在了,修改从dp000015开始读取.

GGSCI (cifpay01)2> shell ls -lrt ./dirdat/*
-rw-rw-rw- 1oracle oinstall 13135 May 2 10:08 ./dirdat/dp000000
-rw-rw-rw- 1oracle oinstall 3120 May 2 12:03 ./dirdat/dp000001
-rw-rw-rw- 1oracle oinstall 9999708 May 2 12:07./dirdat/dp000002
-rw-rw-rw- 1oracle oinstall 9999967 May 2 12:07./dirdat/dp000003
-rw-rw-rw- 1oracle oinstall 9999491 May 2 12:13./dirdat/dp000005
-rw-rw-rw- 1oracle oinstall 9999677 May 2 12:13./dirdat/dp000004
-rw-rw-rw- 1oracle oinstall 9999879 May 2 12:13./dirdat/dp000006
-rw-rw-rw- 1oracle oinstall 9999975 May 2 12:13./dirdat/dp000007
-rw-rw-rw- 1oracle oinstall 9999943 May 2 12:13./dirdat/dp000008
-rw-rw-rw- 1oracle oinstall 9999749 May 2 12:21./dirdat/dp000009
-rw-rw-rw- 1oracle oinstall 9999908 May 2 12:21./dirdat/dp000011
-rw-rw-rw- 1oracle oinstall 9999743 May 2 12:21./dirdat/dp000010
-rw-rw-rw- 1oracle oinstall 9999841 May 2 12:24./dirdat/dp000012
-rw-rw-rw- 1oracle oinstall 9999908 May 2 12:24./dirdat/dp000013
-rw-rw-rw- 1 oracleoinstall 8763492 May 2 20:40./dirdat/dp000014.old
-rw-rw-rw- 1oracle oinstall 13269 May 3 03:23 ./dirdat/dp000015
GGSCI (ludatou)4> alter REPLEO,extseqno 15,extrba 0
REPLICATaltered.
GGSCI (ludatou)5> start repleo
Sending STARTrequest to MANAGER ...
REPLICAT REPLEOstarting
GGSCI (ludatou)6> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT RUNNING REPLEO 00:00:00 00:00:09

目标端验证数据是否同步
经过查看源端leo_rows表中100条数据已经成功同步到目标端.由此可以确认恢复被删除的trailfile成功.

GGSCI (ludatou)8> stats REPLEO
Sending STATSrequest to REPLICAT REPLEO ...
Start ofStatistics at 2014-05-03 03:36:13.
Replicating fromLEO.LEO_ROWS to LEO.LEO_ROWS:
*** Totalstatistics since 2014-05-03 03:35:23 ***
Total inserts 100.00
Total updates 0.00
Total deletes 0.00
Total discards 0.00
Total operations 100.00
*** Dailystatistics since 2014-05-03 03:35:23 ***
Total inserts 100.00
Total updates 0.00
Total deletes 0.00
Total discards 0.00
Total operations 100.00
*** Hourlystatistics since 2014-05-03 03:35:23 ***
Total inserts 100.00
Total updates 0.00
Total deletes 0.00
Total discards 0.00
Total operations 100.00
*** Lateststatistics since 2014-05-03 03:35:23 ***
Total inserts 100.00
Total updates 0.00
Total deletes 0.00
Total discards 0.00
Total operations 100.00
End ofStatistics.

至此目标端被删除的trailfile中数据已经恢复成功.

遭遇ora-00600 [2662]一例

遭遇2662错误,

Thu Jul 24 00:45:47 2014
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Jul 24 00:45:47 2014
ARC0: Becoming the 'no FAL' ARCH
ARC0: Becoming the 'no SRL' ARCH
Thu Jul 24 00:45:47 2014
ARC1: Becoming the heartbeat ARCH
Thu Jul 24 00:45:47 2014
SMON: enabling cache recovery
Thu Jul 24 00:45:47 2014
Errors in file /ora10g/admin/lu10g/udump/lu10g_ora_16481.trc:
ORA-00600: internal error code, arguments: [2662], [0], [10480758], [0], [10480787], [4194313], [], []
Thu Jul 24 00:45:48 2014
Errors in file /ora10g/admin/lu10g/udump/lu10g_ora_16481.trc:
ORA-00600: internal error code, arguments: [2662], [0], [10480758], [0], [10480787], [4194313], [], []
Thu Jul 24 00:45:48 2014
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 16481
ORA-1092 signalled during: alter database open resetlogs...

根据《谈谈adjust_scn的计算方法》的方式处理,使用adjust处理的计算过程如下:
因为c=0所以c*4=0,因为D=10480787 < 1073741824,所以WRAP=0+1,所以整个adjust的level 为1,则处理的语句为如下:
alter session set events ‘IMMEDIATE trace name ADJUST_SCN level 1’;

当然这里也可以通过oradebug poke的方式更改scn的wrap+1以达到一样的效果.

登录数据库执行处理如下:

SQL&gt; conn / as sysdba
Connected to an idle instance.
SQL&gt; startup mount
ORACLE instance started.
Total System Global Area 318767104 bytes
Fixed Size 1267260 bytes
Variable Size 100665796 bytes
Database Buffers 213909504 bytes
Redo Buffers 2924544 bytes
Database mounted.

SQL&gt; alter session set events 'IMMEDIATE trace name ADJUST_SCN level 1';

Session altered.

SQL&gt; alter database open;

Database altered.

至此数据库已经打开,此时应该用dump工具导出数据重建数据库,因为此时数据库运行会非常艰难,后台会不断报错,比如:
ORA-00600: internal error code, arguments: [kcsadjn1], [], [], [], [], [], [], []

所以建议还是需要做好备份以及容灾,避免出现各类问题。当然这里的scn还可以通过其他推进scn的方式来解决,比如在11204版本后,就需要使用其他手段来推进SCN,传统方式在110204已经不支持,具体参考以前写的手工推荐scn的5种方式。

遭遇ora-600[kjblpkeydrmqscchk:pkey]

平台为 linux x86_64
db版本 11.2.0.3

现象节点2,3,4自动重启.有session挂住集中出现等待事件gc buffer busy acquire.
诊断为bug 14409183

具体如下:

Fixed:

The fix for 14409183 is first included in
12.1.0.1 (Base Release)
11.2.0.4 (Server Patch Set)
11.2.0.3.4 Database Patch Set Update
11.2.0.3 Bundle Patch 10 for Exadata Database
11.2.0.3 Patch 10 on Windows Platforms

Interim patches may be available for earlier versions - click here to check.
Symptoms:

Related To:

Instance May Crash
Process May Dump (ORA-7445) / Abend / Abort
Hang (Process Hang)
Internal Error May Occur (ORA-600)
ORA-600 [kjblpkeydrmqscchk:pkey]
ORA-600 [kjbldrmrpst:pkey]
ORA-600 [kjbrchkpkeywait:timeout]
ORA-600 [kclcls_8]
ORA-600 [kjbmmchkinteg:from]
ORA-600 [kclantilock_5]
ORA-600 [kclchkinteg_8]
ORA-600 [kclexpand_9]
Waits for "gc buffer busy acquire"
RAC (Real Application Clusters) / OPS
_gc_read_mostly_locking
_gc_read_mostly_locking
Description

This bug is only relevant when using Real Application Clusters (RAC)
The fix for bug 13397104 addressed some potential ORA-600 issues in RAC
*BUT* the fix was faulty and can cause a hang scenario in RAC with sessions waiting on
"gc buffer busy acquire" READING buffer's pin is not cleaned up properly during a
previous exception(ORA-8103).

This fix supersedes the 13397104 and should be used instead of that fix
for interim patches. It addresses both the symptoms of bug 13397104 and the
follow on hang issue.

Rediscovery Notes:
If you have the fix for bug 13397104 present and see the following symptoms then
this bug should be suspected:
1. Fix 13397104 is present on the system,
2. RAC env,
3. Sessions wait for a READING buffer with a dangling pin.

If you do not have the fix for bug 13397104 but see the symptoms of that bug#
then this fix should be used. Symptoms of that issue include:
- Various ora-600's such as:
ora-600 [kjblpkeydrmqscchk:pkey]
ora-600 [kjbldrmrpst:pkey]
ora-600 [kjbrchkpkeywait:timeout]
in 11.2.0.3 where fix for bug 10170926 is present
- the LE flag includes KCLL_F_ANTI (0x080)
ie. flags: 0xa1
- ORA-600 [kclcls_8] due to a mismatch from a guess-dba

Workaround
The original problem in bug 13397104 can be avoided by
disabling read mostly locking via: _gc_read_mostly_locking=false

Getting a Fix
Use one of the "Fixed" versions listed above
(for Patch Sets / bundles use the latest version available as
contents are cumulative - the "Fixed" version listed above is
the first version where the fix is included)
or
Click here for suggestions on how to get a fix for this issue

Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support.