Skip to content

Grid and Cluster - 4. page

RAC 实例崩溃比较常见的问题

下面为rac常见的5个问题,结合经验和参考metalink给出。

问题 1:ORA-29770 LMHB 终止实例

症状:

LMON (ospid: 31216) waits for event 'control file sequential read' for 88 secs.
Errors in file /oracle/base/diag/rdbms/prod/prod3/trace/prod3_lmhb_31304.trc (incident=2329):
ORA-29770: global enqueue process LMON (OSID 31216) is hung for more than 70 seconds
LMHB (ospid: 31304) is terminating the instance.

或
LMON (ospid: 8594) waits for event 'control file sequential read' for 118 secs.
ERROR: LMON is not healthy and has no heartbeat.
ERROR: LMHB (ospid: 8614) is terminating the instance.

 

可能的原因:

LMON 等待读取控制文件,导致LMHB 使实例崩溃
Bug 11890804 LMHB crashes instance with ORA-29770 after long “control file sequential read” waits
解决方案:

Bug 8888434 已在 11.2.0.2 及以上版本 中得到修正
Bug 11890804 已在 11.2.0.3及以上版本中得到修正
请参阅 Document 1197674.1, Document 8888434.8 和 Document 11890804.8 了解详细信息

问题 2:ORA-481 导致的实例崩溃

症状:

1. PMON (ospid: 12585): terminating the instance due to error 481
LMON 进程跟踪文件显示:
Begin DRM(107) (swin 0)
* drm quiesce

LMS 进程跟踪文件显示:

2011-07-05 10:53:44.218905 : Start affinity expansion for pkey 81885.0
2011-07-05 10:53:44.498923 : Expand failed: pkey 81885.0, 229 shadows traversed, 153 replayed 1 retries

2. PMON (ospid: 4915562): terminating the instance due to error 481
Sat Oct 01 19:21:37 2011
System state dump requested by (instance=2, osid=4915562 (PMON)), summary=[abnormal instance termination].

可能的原因:

1. Bug 11875294 LMS gets stuck during DRM, Instance crashed with ORA-481
2. HAIP 在部分集群节点上离线,或者 HAIP 在所有集群节点上都在线,但是无法通过其进行通信,例如ping操作失败。
解决方案:

1. Bug 11875294 已在 11.2.0.3 中得到修正,绕过问题的方法是:
通过设置
_gc_read_mostly_locking=FALSE 来禁用read mostly。
请参阅 < Document 11875294.8> 了解详细信息。

2. 修正 HAIP 问题,请参阅 Document 1383737.1

问题 3:ORA-600[kjbmprlst:shadow]、ORA-600[kjbrref:pkey]、ORA-600[kjbmocvt:rid]、[kjbclose_remaster:!drm]、ORA-600 [kjbrasr:pkey] 导致的实例崩溃

症状:

由于 ORA-600 [kjbmprlst:shadow]、ORA-600[kjbrref:pkey]、ORA-600[kjbmocvt:rid]、[kjbclose_remaster:!drm] 或 ORA-600 [kjbrasr:pkey] 导致 RAC 实例崩溃

可能的原因:

这一组 ORA-600 与 DRM(dynamic resource remastering)消息或 read mostly 锁有关。涉及多个 bug,包括:
Document 9458781.8 Missing close message to master leaves closed lock dangling crashing the instance with assorted Internal error
Document 9835264.8 ORA-600 [kjbrasr:pkey] / ORA-600 [kjbmocvt:rid] in RAC with dynamic remastering
Document 10200390.8 ORA-600[kjbclose_remaster:!drm] in RAC with fix for 9979039
Document 10121589.8 ORA-600 [kjbmprlst:shadow] can occur in RAC
Document 11785390.8 Stack corruption / incorrect behaviour possible in RAC
Document 12408350.8 ORA-600 [kjbrasr:pkey] in RAC with read mostly locking
Document 12834027.8 ORA-600 [kjbmprlst:shadow] / ORA-600 [kjbrasr:pkey] with RAC read mostly locking
解决方案:

上述大部分 bug 都在 11.2.0.3 中得到了修正,安装 11.2.0.3 补丁集应该可以避免这些 bug,除了 Bug 12834027,此 bug 将在 12.1 中进行修正。绕过这个 bug 的方法是:

禁用 DRM

禁用read mostly
例如:设置 “_gc_read_mostly_locking”=FALSE

有关每个 bug 的说明和解决方案,请参阅上述相关文档。

问题 4:启用flash cache后产生kcldle/kclfplz/kcbbxsv_l2/kclfprm,导致实例崩溃

症状:

警报日志中报告了 ORA-7445[kcldle]
ORA-7445[kclfplz]
ORA-7445[kcbbxsv_12]
ORA-744[kclfprm]

可能的原因:

它们是由不同的 bug 引起的,而这些bug都归结为 基础bug Bug 12337941 Dumps on kcldle / kclfplz / kcbbxsv_l2 / kclfprm using flash
解决方案:

此 bug 已在 11.2.0.3 中得到修正,请安装补丁集或使用以下方法绕过这个问题:禁用 Flash Cache
请参阅 Document 12337941.8 ,了解更多详细信息

问题 5:LMS 报 ORA-600 [kclpdc_21]错误,实例崩溃

症状:

警报日志中报告了 ORA-600[kclpdc_21]

可能的原因:

Document 10040035.8 LMS gets ORA-600 [kclpdc_21] and instance crashes
解决方案:

此 bug 已在 11.2.0.3 中得到修正

10.2.0.5 的问题

症状:

1. LMS进程 报 ORA-600[kjccgmb:1]错误导致实例崩溃, LMS: terminating instance due to error 484
2. 由于以下原因导致实例崩溃:
Received an instance abort message from instance 2 (reason 0x0)
Please check instance 2 alert and LMON trace files for detail.
LMD0: terminating instance due to error 481

可能的原因:

1. Bug 11893577 – LMD CRASHED WITH ORA-00600 [KJCCGMB:1]
2. Bug 9577274 – 1OFF:UNABLE TO VIEW REQUEST OUTPUT AND LOG AFTER APPLYING FIX TO ISSUE IN BUG 9400041
解决方案:

1. 对于 10.2.0.5.0,安装合并的补丁 12616787
2. 对于 10.2.0.5.5,安装合并的补丁 13470618
撰写本文时,只有特定平台才有可用补丁。对于任何 10.2.0.5.x 版本,不需要同时安装上述两个补丁。

Script to Collect DRM Information

参考文档:drmdiag.sql
适用 11.2.0.1 to 12.1.0.1 [Release 11.2 to 12.1]

-- NAME: DRMDIAG.SQL
-- ------------------------------------------------------------------------
-- AUTHOR: Michael Polaski - Oracle Support Services
-- ------------------------------------------------------------------------
-- PURPOSE:
-- This script is intended to provide a user friendly guide to troubleshoot
-- drm (dynamic resource remastering) waits. The script will create a file
-- called drmdiag_.out in your local directory.

set echo off
set feedback off
column timecol new_value timestamp
column spool_extension new_value suffix
select to_char(sysdate,'Mondd_hh24mi') timecol,
'.out' spool_extension from sys.dual;
column output new_value dbname
select value || '_' output
from v$parameter where name = 'db_name';
spool drmdiag_&&dbname&&timestamp&&suffix
set trim on
set trims on
set lines 140
set pages 100
set verify off
alter session set optimizer_features_enable = '10.2.0.4';
set feedback on

PROMPT DRMDIAG DATA FOR &&dbname&&timestamp
PROMPT Important paramenters:
PROMPT
PROMPT _gc_policy_minimum (default is 1500). Increasing this would cause DRMs to happen less frequently.
PROMPT Use the "OBJECT_POLICY_STATISTICS" section later in this report to see how active various objects are.
PROMPT
PROMPT _gc_policy_time (default to 10 (minutes)). Amount of time to evaluate policy stats. Use the
PROMPT "OBJECT_POLICY_STATISTICS" section later in this report to see how active various objects are for the
PROMPT _gc_policy_time. Usually not necessary to change this parameter.
PROMPT
PROMPT _gc_read_mostly_locking (default is TRUE). Setting this to FALSE would disable read mostly related DRMs.
PROMPT
PROMPT gcs_server_processes (default is derived from CPU count/4). May need to increase this above the
PROMPT default to add LMS processes to complte the work during a DRM but the default is usually adequate.
PROMPT
PROMPT _gc_element_percent (default is 110). May need to apply the fix for bug 14791477 and increase this to
PROMPT 140 if running out of lock elements. Usually not necessary to change this parameter.
PROMPT
PROMPT GC Related parameters set in this instance:
show parameter gc
PROMPT
PROMPT CPU count on this instance:
show parameter cpu_count

PROMPT
PROMPT SGA INFO FOR &&dbname&&timestamp
PROMPT
PROMPT Larger buffer caches (above 100 gig) may increase the cost of DRMs significantly.
set lines 120
set pages 100
column component format a40 tru
column current_size format 99999999999999999
column min_size format 99999999999999999
column max_size format 99999999999999999
column user_specified_size format 99999999999999999
select component, current_size, min_size, max_size, user_specified_size
from v$sga_dynamic_components
where current_size > 0;

PROMPT
PROMPT ASH THRESHOLD...
PROMPT
PROMPT This will be the threshold in milliseconds for total drm freeze
PROMPT times. This will be used for the next queries to look for the worst
PROMPT 'drm freeze' minutes. Any minutes that have an average log file
PROMPT sync time greater than the threshold will be analyzed further.
column threshold_in_ms new_value threshold format 999999999.999
select decode(min(threshold_in_ms),null,0,min(threshold_in_ms)) threshold_in_ms
from (select inst_id, to_char(sample_time,'Mondd_hh24mi') minute,
sum(time_waited)/1000 threshold_in_ms
from gv$active_session_history
where event like '%drm freeze%'
group by inst_id,to_char(sample_time,'Mondd_hh24mi')
order by 3 desc)
where rownum <= 10; PROMPT PROMPT ASH WORST MINUTES FOR DRM FREEZE WAITS: PROMPT PROMPT APPROACH: These are the minutes where the avg drm freeze time PROMPT was the highest (in milliseconds). column event format a30 tru column program format a35 tru column total_wait_time format 999999999999.999 column avg_time_waited format 999999999999.999 select to_char(sample_time,'Mondd_hh24mi') minute, inst_id, event, sum(time_waited)/1000 TOTAL_WAIT_TIME , count(*) WAITS, avg(time_waited)/1000 AVG_TIME_WAITED from gv$active_session_history where event like '%drm freeze%' group by to_char(sample_time,'Mondd_hh24mi'), inst_id, event having sum(time_waited)/1000 > &&threshold
order by 1,2;

PROMPT
PROMPT ASH DRM BACKGROUND PROCESS WAITS DURING WORST MINUTES:
PROMPT
PROMPT APPROACH: What is LMS doing when 'drm freeze' waits
PROMPT are happening? LMD and LMON info may also be relevant
column inst format 999
column minute format a12 tru
column event format a50 tru
column program format a55 wra
select to_char(sample_time,'Mondd_hh24mi') minute, inst_id inst,
sum(time_waited)/1000 TOTAL_WAIT_TIME , count(*) WAITS,
avg(time_waited)/1000 AVG_TIME_WAITED,
program, event
from gv$active_session_history
where to_char(sample_time,'Mondd_hh24mi') in (select to_char(sample_time,'Mondd_hh24mi')
from gv$active_session_history
where event like '%drm freeze%'
group by to_char(sample_time,'Mondd_hh24mi'), inst_id
having sum(time_waited)/1000 > &&threshold and sum(time_waited)/1000 > 0.5)
and (program like '%LMS%' or program like '%LMD%' or
program like '%LMON%' or event like '%drm freeze%')
group by to_char(sample_time,'Mondd_hh24mi'), inst_id, program, event
order by 1,2,3,5 desc, 4;

PROMPT
PROMPT POLICY HISTORY INFO:
PROMPT See if you can correlate policy history events with minutes of high
PROMPT wait time.
select * from gv$policy_history
order by event_date;
PROMPT
PROMPT DYNAMIC_REMASTER_STATS
PROMPT This shows where time is spent during DRM operations.
set heading off
set lines 60
select 'Instance: '||inst_id inst, 'Remaster Ops: '||remaster_ops rops,
'Remaster Time: '||remaster_time rtime, 'Remastered Objects: '||remastered_objects robjs,
'Quiesce Time: '||quiesce_time qtime, 'Freeze Time: '||freeze_time ftime,
'Cleanup Time: '||cleanup_time ctime, 'Replay Time: '||replay_time rptime,
'Fixwrite Time: '||fixwrite_time fwtime, 'Sync Time: '||sync_time stime,
'Resources Cleaned: '||resources_cleaned rclean,
'Replayed Locks Sent: '||replayed_locks_sent rlockss,
'Replayed Locks Received: '||replayed_locks_received rlocksr,
'Current Objects: '||current_objects
from gv$dynamic_remaster_stats
order by 1;
set lines 120
set heading on

PROMPT
PROMPT OBJECT_POLICY_STATISTICS:
PROMPT The sum of the last 3 columns (sopens,xopens,xfers) decides whether the object
PROMPT will be considered for DRM (_gc_policy_minimum). The duration of the stats
PROMPT are controlled by _gc_policy_time (default is 10 minutes).
select object,node,sopens,xopens,xfers from x$object_policy_statistics;

PROMPT
PROMPT ACTIVE OBJECTS (OBJECT_POLICY_STATISTICS)
PROMPT These are the objects that are above the default _gc_policy_minimum (1500).
select object, node, sopens+xopens+xfers activity
from x$object_policy_statistics
where sopens+xopens+xfers > 1500
order by 3 desc;

PROMPT LWM FOR LE FREELIST
PROMPT This number should never get near zero, if it does consider the fix for bug 14791477
PROMPT and/or increasing _gc_element_percent.
select sum(lwm) from x$kclfx;

PROMPT
PROMPT GCSPFMASTER INFO WITH OBJECT NAMES
column objname format a120 tru
select o.name || ' - '|| o.subname objname, o.type#, h.*
from v$gcspfmaster_info h, obj$ o where h.data_object_id=o.dataobj#
order by data_object_id;

PROMPT
PROMPT ASH DETAILS FOR WORST MINUTES:
PROMPT
PROMPT APPROACH: If you cannot determine the problem from the data
PROMPT above, you may need to look at the details of what each session
PROMPT is doing during each 'bad' snap. Most likely you will want to
PROMPT note the times of the high drm freezewaits, look at what
PROMPT LMS, LMD0, LMON is doing at those times, and go from there...
set lines 140
column program format a45 wra
column sample_time format a25 tru
column event format a30 tru
column time_waited format 999999.999
column p1 format a40 tru
column p2 format a40 tru
column p3 format a40 tru
select sample_time, inst_id inst, session_id, program, event, time_waited/1000 TIME_WAITED,
p1text||': '||p1 p1,p2text||': '||p2 p2,p3text||': '||p3 p3
from gv$active_session_history
where to_char(sample_time,'Mondd_hh24mi') in (select
to_char(sample_time,'Mondd_hh24mi')
from gv$active_session_history
where event like '%drm freeze%'
group by to_char(sample_time,'Mondd_hh24mi'), inst_id
having sum(time_waited)/1000 > &&threshold)
and time_waited > 0.5
order by 1,2,3,4,5;

spool off

PROMPT
PROMPT OUTPUT FILE IS: drmdiag_&&dbname&&timestamp&&suffix
PROMPT

使用Strace让oracle hgng住的测试

数据库hang住的原因很多,以往碰到得案例不少,大部分都和内存的抖动引起的进程僵死,或者bug造成,或者其他类似如归档满了,dg最高级模式下网络阻塞等原因.今天这里我介绍的这个案例偏门,几乎很难碰到.这里测试的版本为11.2.0.2版本,基于linux内核2.6.18 ×64. 且10g和11r1版本此类测试无效.这应该算是11.0.2.2的一个bug.

在linux中有一个命令strace,它常被用来跟踪进程执行时的系统调用和所接收的信号。
具体的参数含义如下:

-c 统计每一系统调用的所执行的时间,次数和出错的次数等.
-d 输出strace关于标准错误的调试信息.
-f 跟踪由fork调用所产生的子进程.
-ff 如果提供-o filename,则所有进程的跟踪结果输出到相应的filename.pid中,pid是各进程的进程号.
-F 尝试跟踪vfork调用.在-f时,vfork不被跟踪.
-h 输出简要的帮助信息.
-i 输出系统调用的入口指针.
-q 禁止输出关于脱离的消息.
-r 打印出相对时间关于,,每一个系统调用.
-t 在输出中的每一行前加上时间信息.
-tt 在输出中的每一行前加上时间信息,微秒级.
-ttt 微秒级输出,以秒了表示时间.
-T 显示每一调用所耗的时间.
-v 输出所有的系统调用.一些调用关于环境变量,状态,输入输出等调用由于使用频繁,默认不输出.
-V 输出strace的版本信息.
-x 以十六进制形式输出非标准字符串
-xx 所有字符串以十六进制形式输出.
-a column
设置返回值的输出位置.默认 为40.
-e expr
指定一个表达式,用来控制如何跟踪.格式如下:
[qualifier=][!]value1[,value2]...
qualifier只能是 trace,abbrev,verbose,raw,signal,read,write其中之一.value是用来限定的符号或数字.默认的 qualifier是 trace.感叹号是否定符号.例如:
-eopen等价于 -e trace=open,表示只跟踪open调用.而-etrace!=open表示跟踪除了open以外的其他调用.有两个特殊的符号 all 和 none.
注意有些shell使用!来执行历史记录里的命令,所以要使用\\.
-e trace=set
只跟踪指定的系统 调用.例如:-e trace=open,close,rean,write表示只跟踪这四个系统调用.默认的为set=all.
-e trace=file
只跟踪有关文件操作的系统调用.
-e trace=process
只跟踪有关进程控制的系统调用.
-e trace=network
跟踪与网络有关的所有系统调用.
-e strace=signal
跟踪所有与系统信号有关的 系统调用
-e trace=ipc
跟踪所有与进程通讯有关的系统调用
-e abbrev=set
设定 strace输出的系统调用的结果集.-v 等与 abbrev=none.默认为abbrev=all.
-e raw=set
将指 定的系统调用的参数以十六进制显示.
-e signal=set
指定跟踪的系统信号.默认为all.如 signal=!SIGIO(或者signal=!io),表示不跟踪SIGIO信号.
-e read=set
输出从指定文件中读出 的数据.例如:
-e read=3,5
-e write=set
输出写入到指定文件中的数据.
-o filename
将strace的输出写入文件filename
-p pid
跟踪指定的进程pid.
-s strsize
指定输出的字符串的最大长度.默认为32.文件名一直全部输出.
-u username
以username 的UID和GID执行被跟踪的命令

如果做过在linux的开发,那么对这个命令一定不会陌生,哈哈。闲话不多说,我们接着猪蹄,今天我们主要是采用-p -o这2个参数,用它对oracle的进程进行跟踪调用。
首先我对lgwr进程进行跟踪调用,执行如下:

[root@ludatou ~]# ps -ef | grep lgwr
root     10136  9578  0 06:30 pts/1    00:00:00 grep lgwr
ora10g   32496     1  0 Apr15 ?        00:00:04 ora_lgwr_lu11r2
[root@ludatou ~]# export ORACLE_SID=lu11r2
[root@ludatou ~]# strace -p $(pgrep -fx ora_lgwr_$ORACLE_SID) -o /tmp/l.out -T &
[1] 10257
[root@ludatou ~]# Process 32496 attached - interrupt to quit

这个时候日志l.out的输出大致如下(我db是空闲的):

gettimeofday({1397602111, 401986}, NULL) = 0 <0.000032>
gettimeofday({1397602111, 402111}, NULL) = 0 <0.000032>
gettimeofday({1397602111, 402233}, NULL) = 0 <0.000032>
gettimeofday({1397602111, 402356}, NULL) = 0 <0.000031>
gettimeofday({1397602111, 402510}, NULL) = 0 <0.000032>
gettimeofday({1397602111, 402537}, NULL) = 0 <0.000011>
gettimeofday({1397602111, 402551}, NULL) = 0 <0.000007>
gettimeofday({1397602111, 402560}, NULL) = 0 <0.000005>
times(NULL)                             = 436324232 <0.000002>
gettimeofday({1397602111, 402570}, NULL) = 0 <0.000003>
gettimeofday({1397602111, 402573}, NULL) = 0 <0.000002>
semtimedop(131073, 0xbfa12088, 1, {3, 0}

这个时候数据库运行正常,正常读写提交,具体如下:

SQL> conn luda/luda
Connected.
SQL> create table t1 as select * from dba_objects;

Table created.

SQL> alter table t1 add ludatou varchar2(200);

Table altered.

SQL> update t1 set ludatou=250;

50068 rows updated.

SQL> commit;

Commit complete.

下一步我将strace跟踪lgwr的进程杀掉后,再看看数据库是否运行正常,具体如下:

1.杀掉跟踪进程

[root@ludatou ~]# kill %1
[root@ludatou ~]# Process 32496 detached
--这个时候提示守护进程和32496进程(lgwr)分离

2.在数据库层面执行
SQL> update t1 set ludatou=250;

50068 rows updated.

SQL> commit;

到这一步会发现….一直没反应,查看全库等待事件为log file sync.这个时候检测lgwr的进程状态可以发现为如下:

[ora11g@ludatou ~]$ ps $(pgrep -fx ora_lgwr_$ORACLE_SID)
  PID TTY      STAT   TIME COMMAND
32496 ?        Ts     0:00 ora_lgwr_lu11r2

进程32496为Ts的状态,意味着该进程为停止状态或者该进程被其他进程控制.但是实际上我已经把strace停止。我们将其唤醒:

kill -SIGCONT 32496
[ora11g@ludatou ~]$ ps $(pgrep -fx ora_lgwr_$ORACLE_SID)
  PID TTY      STAT   TIME COMMAND
32496 ?        Ss     0:00 ora_lgwr_lu11r2

这个时候进程变为ss状态,而且数据库也随之恢复正常。这不知道应算是数据库的bug还是linux的bug,个人更倾向linux,毕竟我执行退出strace了,实际上strace进程还附加在lgwr进程上不断callout进程。这个测试我只在11.2.0.2,11.1.0.7,10.2.0.4版本上测试过,目前只有11.2.0.2有这个情况。

浅谈ORA-12545 / TNS-12545故障诊断思路

前一个刚经历rac安装完后遭遇的ORA-12545错误,这里就顺便把这个错误的诊断思路理出来,毕竟这个错误在10g后还是比较常见,本文只是对这个故障的处理诊断思路做一些经验上的讨论,纯为有趣。

12545的报错提示为如下:

ORA-12545 / TNS-12545 Connect failed because target host or object does not exist

先往下聊吧,这类错误通常是和hostname相关配置不正确有关,举个例子

[ora10g@ludatou ~]$ tnsping lu10g
TNS Ping Utility for Linux: Version 10.2.0.4.0 - Production on 15-APR-2014 12:08:27
Copyright (c) 1997,  2007, Oracle.  All rights reserved.
Used parameter files:

Used TNSNAMES adapter to resolve the alias
Attempting to contact (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = ludatou)(PORT = 1523)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = lu10g)))
OK (10 msec)

[ora10g@ludatou ~]$ sqlplus luda/luda@lu10g
SQL*Plus: Release 10.2.0.4.0 - Production on Tue Apr 15 12:12:15 2014
Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

=====这里我的tnsnames配置的解析主机名是ludatou,而且采用网络验证方式登录成功
[ora10g@ludatou ~]$ cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
192.168.102.128         ludatou   luda
=====在这边我把/etc/hosts文件的主机名ludatou变更为ludatouxx,变更结果如下
[ora10g@ludatou ~]$ cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
192.168.102.128         ludatouxx   luda
 

再次在客户端(另外一个主机)远程登录,在响应了很久之后报错

[root@ludatou ~]# su - oracle
[oracle@ludatou ~]$ sqlplus luda/luda@lu10g
SQL*Plus: Release 11.1.0.7.0 - Production on Tue Apr 15 12:13:40 2014
Copyright (c) 1982, 2008, Oracle.  All rights reserved.

ERROR:
ORA-12545 Connect failed because target host or object does not exist
 

下来以分类场景的方式来描述这类故障的诊断:
场景1:
当连接数据库不通过listener的时候,而是使用BEQ协议来连接数据库,这个时候如果$ORACLE_HOME/bin目录下oracle文件丢失或者损坏,用户执行权限不足,以及Oracle_home(多个安装版本)的路径设置错误的情况下,使用beq协议连接oracle的用户就会遭遇ORA/TNS-12545的错误。
关于BEQ协议可以参考如下解释:

This is the bequeth oracle process. The bequeth process starts first. Later all the processes related to particular database are controlled by its bequeth process. 

场景2:
这个场景就是上面我所做的测试场景,由客户端通过listner连接服务端oracle数据库,这个时候如果客户端的tnsnames解析文件中对应service name部分的hostname和服务端的hostname不匹配,就会报错ora-12545.
在这个场景中可以使用nslookup命令来搜索对应的主机,按照上面的例子,这里使用nslookup命令搜索ludatou主机时候则会出现以下情况:

$ nslookup ludatou
Server: 192.168.102.128
Address: 192.168.1.102#53
** server can't find ludatou: ==>Indicates ludatou is not resolvable on the machine

这个场景的解决办法就是把客户端的tnsnames中的hostname部分写为ip,避免服务端主机名变更导致配置不符的情况出现。

场景3:
参考 RAC实施完遭遇ORA-12545

Cause: One of the hostname (which corresponds to public IP or VIP) is not reachable from this client machine.
When the server side load balancing is enabled in the RAC setup, the listener will redirect the connection to the least loaded node.While doing so, the server sends the packet NSPTRD containing the hostname of the corresponding machine.

The Remote Service Handler value registered with the remote Listener process via the REMOTE_LISTENER parameter is built by the LOCAL_LISTENER value on the local server.So, its necessary to check whether the local_listener / remote listener information are reflected properly in the listener services output as well.

Diagnosis:
Enable the oracle sqlnet client tracing at support level, and reproduce the issue.In the generated client trace, you would see the below information:
After NSPTCN Connect to the listener , listener sends

场景4:
listerner.ora的hostname配置错误,导致监听无法找寻到正确的监听地址,也会产生ora/tns-12545的错误。具体监听日志报错如下:

Error listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=abc)(PORT=1522)))
TNS-12545: Connect failed because target host or object does not exist
TNS-12560: TNS:protocol adapter error
TNS-00515: Connect failed because target host or object does not exist

如果打开了sqlnet的trace跟踪,可以发现类似如下的报错:

nsc2addr: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=abc)(PORT=1522)))
nttbnd2addr: entry
snlinGetAddrInfo: entry
snlinGetAddrInfo: Invalid IP address string abc
snlinFreeAddrInfo: entry
snlinFreeAddrInfo: exit
snlinGetAddrInfo: exit
nttbnd2addr: looking up IP addr for host: abc
snlinGetAddrInfo: entry
snlinGetAddrInfo: Name resolution failed for abc
snlinFreeAddrInfo: entry
snlinFreeAddrInfo: exit
snlinGetAddrInfo: exit
nttbnd2addr: *** hostname lookup failure! ***
nttbnd2addr: exit
nserror: entry
nserror: nsres: id=0, op=78, ns=12545, ns2=12560; nt[0]=515, nt[1]=0, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0

这类问题解决办法就是检测监听的配置是否和主机的ip,hostname配置吻合,不吻合的情况下会造成12545的错误。

场景5:
操作系统或者硬件问题也会导致产生ora/tns-12545的产生。
5.1windows 2000 with Service Pack 3 会导致这个ora/tns-12545 的产生
http://support.microsoft.com/default.aspx?scid=kb;en-us;329405
5.2Solaris系统在配置不正确情况下也会导致ora/tns-12545的产生

以上的场景描述是对1254错误的故障做一些分类分析,大概的诊断思路也是这样。

当然以上只是针对12545,oracle网络的错误有很多,对于诊断oracle网络错误,我个人的习惯诊断流程是如下
1.check listener.log确认是否有报错信息
2.check syslog是否有设备或者系统的报错信息
3.check $ORACLE_HOME/bin/oracle文件是否存在以及权限正确与否
4.check tnsnames,sqlnet,listener的配置是否与主机对应
5.在rac环境下,我会优先检查节点和client的通信情况,网络配置文件是否配置正确,最后才是taf和ld的影响
6.必要时候开启trace进一步跟踪详细信息

RAC安装完成之后客户端连接数据库报错ORA-12545

日前一次一个朋友装完一套rac后,客户端连接rac数据库时候报错,错误如下:

ORA-12545: Connect failed because target host or object does not exist./pre>

Rac的监听机制我就不解释了,12545的错误基本和Tnsnames里的连接通配符里的host设置有关,刚装完的rac报错,大部分都是因为采用了节点的hostname作为解析字符串中的host,而客户端一般从业务的连续冗余性角度应该要用vip对应的ip或者hostname来匹配网络连接字符串,这2者不一致导致了12545的出现;当然也有可能是设置了VIP,但是客户端与服务端的VIP无法通信,也会造成ora-12545的错误,这种情况是因为负载均衡打开来后,listener会分配新的请求进程到负载较低的节点,如果负载低的节点正好和客户端无法通信,就造成了12545的错误。现在问题基本上解决思路就有一些了,遇到rac的这个错误,首先就是检查你本机的local_listener,remote_listener是否设置的对应host为其他的错误的ip或者主机名,一般新装的rac这个一般都为localhost。明白了为题所在,解决思路也就出来了,只要把实例的local_listener设置为对应vip,客户端采用vip的解析方式即可解决,也应配合检查rac的配置以及TAF和LOAD BLANCE的开启情况。

这个情况一般出现在刚装完rac之后。报错如果采用trace level 16的跟踪可以发现类似如下的日志:

[05-APR-2004 11:32:55] nttbnd2addr: looking up IP addr for host: myhost.oracle.com

[05-APR-2014 11:32:55] nttbnd2addr: *** hostname lookup failure! ***
[05-APR-2014 11:32:55] nttbnd2addr: exit
[05-APR-2014 11:32:55] nserror: nsres: id=0, op=77, ns=12545, ns2=12560; nt[0]=515, nt[1]=145,
nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
[05-APR-2014 11:32:55] nsmfr: 207 bytes at 0xf2a18
[05-APR-2014 11:32:55] nsmfr: 140 bytes at 0xef078
[05-APR-2014 11:32:55] nladtrm: entry
[05-APR-2014 11:32:55] nladtrm: exit
[05-APR-2014 11:32:55] nioqper: error from nscall
[05-APR-2014 11:32:55] nioqper: nr err code: 0
[05-APR-2014 11:32:55] nioqper: ns main err code: 12545
[05-APR-2014 11:32:55] nioqper: ns (2) err code: 12560
[05-APR-2014 11:32:55] nioqper: nt main err code: 515
[05-APR-2014 11:32:55] nioqper: nt (2) err code: 145
[05-APR-2014 11:32:55] nioqper: nt OS err code: 0
[05-APR-2014 11:32:55] niomapnserror: entry
[05-APR-2014 11:32:55] niqme: entry
[05-APR-2014 11:32:55] niqme: reporting NS-12545 error as ORA-12545
[05-APR-2014 11:32:55] niqme: exit
[05-APR-2014 11:32:55] niomapnserror: returning error 12545
[05-APR-2014 11:32:55] niomapnserror: exit
[05-APR-2014 11:32:55] niotns: Couldn't connect, returning 12545
...

正确的local_listener配置为如下:

alter system set local_listener='(ADDRESS =(PROTOCOL=TCP)(HOST=vip_host1)(PORT=1521))' scope=both sid='luda1';
alter system set local_listener='(ADDRESS =(PROTOCOL=TCP)(HOST=vip_host2)(PORT=1521))' scope=both sid='luda2';

客户端采用VIP的解析方式:

LUDA =
  (DESCRIPTION =
     (ADDRESS = (PROTOCOL = TCP)(HOST = vip_host1)(PORT = 1521))
     (ADDRESS = (PROTOCOL = TCP)(HOST = vip_host2)(PORT = 1521))
     (LOAD_BALANCE=YES)
      (CONNECT_DATA=
        (SERVER=DEDICATED)
         (SERVICE_NAME=LUDA)
      )
  )