Skip to content

Grid and Cluster - 5. page

更改RAC私有网络(private network change)配置的步骤以及版本差异的注意事项

Network information(interface, subnet and role of each interface) for Oracle Clusterware is managed by ‘oifcfg’, but actual IP address for each interfaces are not, ‘oifcfg’ can not update IP address information. ‘oifcfg getif’ can be used to find out currently configured interfaces in OCR:

% $CRS_HOME/bin/oifcfg getif
eth0 10.2.156.0 global public
eth1 192.168.0.0 global cluster_interconnect

On Unix/Linux systems, the interface names are generally assigned by the OS, and standard names vary by platform. For Windows systems, see additional notes below. Above example shows currently interface eth0 is used for public with subnet 10.2.156.0, and eth1 for cluster_interconnect/private with subnet 192.168.0.0.

The ‘public’ network is for database client communication (VIP also uses the same network though it’s stored in OCR as separate entry), whereas the ‘cluster_interconnect’ network is for RDBMS/ASM cache fusion. Starting with 11gR2, cluster_interconnect is also used for clusterware heartbeats – this is significant change compare to prior release as pre-11gR2 uses the private nodename that were specified at installation time for clusterware heartbeats.

If the subnet or interface name for ‘cluster_interconnect’ interface is incorrect, it needs to be changed as crs/grid user.

Case I. Changing private hostname

In pre-11.2 Oracle Clusterware, private hostname is recorded in OCR, it can not be updated. Generally private hostname is not required to change. Its associated IP can be changed. The only way to change private hostname is by deleting/adding nodes, or reinstall Oracle Clusterware.

In 11.2 Grid Infrastructure, private hostname is no longer recorded in OCR and there is no dependancy on the private hostname. It can be changed freely in /etc/hosts.

Case II. Changing private IP only without changing network interface, subnet and netmask

For example, private IP is changed from 192.168.1.10 to 192.168.1.21, network interface name and subnet remain the same,.

Simply shutdown Oracle Clusterware stack on the node where change required, make IP modification at OS layer (eg: /etc/hosts, OS network config etc) for private network, restart Oracle Clusterware stack will complete the task.

Case III. Changing private network MTU only

For example, private network MTU is changed from 1500 to 9000 (enable jumbo frame), network interface name and subnet remain the same.

1. Shutdown Oracle Clusterware stack on all nodes
2. Make the required network change of MTU size at OS network layer, ensure private network is available with the desired MTU size, ping with the desired MTU size works on all cluster nodes
3. Restart Oracle Clusterware stack on all nodes

Case IV. Changing private network interface name, subnet or netmask

Note: When the netmask is changed but the subnet ID doesn’t change, for example:
The netmask is changed from 255.255.0.0 to 255.255.255.0 with private IP like 192.168.0.x, the subnet ID remains the same as 192.168.0.0, the network interface name is not changed.
Please follow the same procedure as outlined in Case II.
When the netmask is changed, the associated subnet ID is often changed. Oracle only store network interface name and subnet ID in OCR, not the netmask. Oifcfg command can be used for such change, oifcfg commands only require to run on 1 of the cluster node, not all.

A. For pre-11gR2 Oracle Clusterware

1. Use oifcfg to add the new private network information, delete the old private network information:

% $ORA_CRS_HOME/bin/oifcfg/oifcfg setif -global <if_name>/:cluster_interconnect
% $ORA_CRS_HOME/bin/oifcfg/oifcfg delif -global <if_name>[/]]

For example:
% $ORA_CRS_HOME/bin/oifcfg setif -global eth3/192.168.2.0:cluster_interconnect
% $ORA_CRS_HOME/bin/oifcfg delif -global eth1/192.168.1.0

To verify the change
% $ORA_CRS_HOME/bin/oifcfg getif
eth0 10.2.166.0 global public
eth3 192.168.2.0 global cluster_interconnect

2. Shutdown Oracle Clusterware stack

As root user: # crsctl stop crs

3. Make required network change at OS level, /etc/hosts file should be modified on all nodes to reflect the change.
Ensure the new network is available on all cluster nodes:

% ping % ifconfig -a on Unix/Linux
or
% ipconfig /all on windows

4. restart the Oracle Clusterware stack

As root user: # crsctl start crs

Note: If running OCFS2 on Linux, one may also need to change the private IP address that OCFS2 is using to communicate with other nodes. For more information, please refer to Note 604958.1

B. For 11gR2 and higher

As of 11.2 Grid Infrastructure, the private network configuration is not only stored in OCR but also in the gpnp profile. If the private network is not available or its definition is incorrect, the CRSD process will not start and any subsequent changes to the OCR will be impossible. Therefore care needs to be taken when making modifications to the configuration of the private network. It is important to perform the changes in the correct order. Please also note that manual modification of gpnp profile is not supported.

Please take a backup of profile.xml on all cluster nodes before proceeding, as grid user:

$ cd $GRID_HOME/gpnp//profiles/peer/
$ cp -p profile.xml profile.xml.bk

1. Ensure Oracle Clusterware is running on ALL cluster nodes in the cluster

2. As grid user:

Get the existing information. For example:
$ oifcfg getif
eth1 100.17.10.0 global public
eth0 192.168.0.0 global cluster_interconnect

Add the new cluster_interconnect information:

$ oifcfg setif -global /:cluster_interconnect

For example:
a. add a new interface bond0 with the same subnet
$ oifcfg setif -global bond0/192.168.0.0:cluster_interconnect

b. add a new subnet with the same interface name but different subnet or new interface name
$ oifcfg setif -global eth0/192.65.0.0:cluster_interconnect
or
$ oifcfg setif -global eth3/192.168.1.96:cluster_interconnect

1. This can be done with -global option even if the interface is not available yet, but this can not be done with -node option if the interface is not available, it will lead to node eviction.

2. If the interface is available on the server, subnet address can be identified by command:

$ oifcfg iflist

It lists the network interface and its subnet address. This command can be run even if Oracle Clusterware is not running. Please note, subnet address might not be in the format of x.y.z.0, it can be x.y.z.24, x.y.z.64 or x.y.z.128 etc. For example,
$ oifcfg iflist
lan1 18.1.2.0
lan2 10.2.3.64 < < this is the private network subnet address associated with private network IP: 10.2.3.86

3. If it is for adding a 2nd private network, not replacing the existing private network, please ensure MTU size of both interfaces are the same, otherwise instance startup will report error:

ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if MTU failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpcini2
ORA-27303: additional information: requested interface lan1:801 has a different MTU (1500) than lan3:801 (9000), which is not supported. Check output from ifconfig command

Verify the change:

$ oifcfg getif

3. Shutdown Oracle Clusterware on all nodes and disable the Oracle Clusterware as root user:

# crsctl stop crs
# crsctl disable crs

4. Make the network configuration change at OS level as required, ensure the new interface is available on all nodes after the change.

$ ifconfig -a
$ ping

5. Enable Oracle Clusterware and restart Oracle Clusterware on all nodes as root user:

# crsctl enable crs
# crsctl start crs

6. Remove the old interface if required:

$ oifcfg delif -global <if_name>[/]
eg:
$ oifcfg delif -global eth0/192.168.0.0

Something to note for 11gR2

1. If underlying network configuration has been changed, but oifcfg has not been run to make the same change, then upon Oracle Clusterware restart, the CRSD will not be able to start.

The crsd.log will show:

2010-01-30 09:22:47.234: [ default][2926461424] CRS Daemon Starting
..
2010-01-30 09:22:47.273: [ GPnP][2926461424]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=7153, tl=3, f=0
2010-01-30 09:22:47.282: [ OCRAPI][2926461424]clsu_get_private_ip_addresses: no ip addresses found.
2010-01-30 09:22:47.282: [GIPCXCPT][2926461424] gipcShutdownF: skipping shutdown, count 2, from [ clsinet.c : 1732], ret gipcretSuccess (0)
2010-01-30 09:22:47.283: [GIPCXCPT][2926461424] gipcShutdownF: skipping shutdown, count 1, from [ clsgpnp0.c : 1021], ret gipcretSuccess (0)
[ OCRAPI][2926461424]a_init_clsss: failed to call clsu_get_private_ip_addr (7)
2010-01-30 09:22:47.285: [ OCRAPI][2926461424]a_init:13!: Clusterware init unsuccessful : [44]
2010-01-30 09:22:47.285: [ CRSOCR][2926461424] OCR context init failure. Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]
2010-01-30 09:22:47.285: [ CRSD][2926461424][PANIC] CRSD exiting: Could not init OCR, code: 44
2010-01-30 09:22:47.285: [ CRSD][2926461424] Done.
Above errors indicate a mismatch between OS setting (oifcfg iflist) and gpnp profile setting profile.xml.

Workaround: restore the OS network configuration back to the original status, start Oracle Clusterware. Then follow above steps to make the changes again.

If the underlying network has not been changed, but oifcfg setif has been run with a wrong subnet address or interface name, same issue will happen.

2. If any one node is down in the cluster, oifcfg command will fail with error:

$ oifcfg setif -global bond0/192.168.0.0:cluster_interconnect
PRIF-26: Error in update the profiles in the cluster
Workaround: start Oracle Clusterware on the node where it is not running. Ensure Oracle Clusterware is up on all cluster nodes. If the node is down for any OS reason, please remove the node from the cluster before performing private network change.

3. If a user other than Grid Infrastructure owner issues above command, it will fail with same error:

$ oifcfg setif -global bond0/192.168.0.0:cluster_interconnect
PRIF-26: Error in update the profiles in the cluster
Workaround: ensure to login as Grid Infrastructure owner to perform such command.

4. From 11.2.0.2 onwards, if attempt to delete the last private interface (cluster_interconnect) without adding a new one first, following error will occur:

PRIF-31: Failed to delete the specified network interface because it is the last private interface
Workaround: Add new private interface first before deleting the old private interface.

5. If Oracle Clusterware is down on the node, the following error is expected:

$ oifcfg getif
PRIF-10: failed to initialize the cluster registry
Workaround: Start the Oracle Clusterware on the node

Notes for Windows Systems

The syntax for changing the interfaces on Windows/RAC clusters is the same as on Unix/Linux, but the interface names will be slightly different. On Windows systems, the default names assigned to the interfaces are generally named such as:

Local Area Connection
Local Area Connection 1
Local Area Connection 2

If using an interface name that has space in it, the name must be enclosed in quotes. Also, be aware that it is case sensitive. For example, on Windows, to set cluster_interconnect:

C:\oracle\product\10.2.0\crs\bin\oifcfg setif -global “Local Area Connection 1″/192.168.1.0:cluster_interconnect
However, it is best practice on Windows to rename the interfaces to be more meaningful, such as renaming them to ‘ocwpublic’ and ‘ocwprivate’. If interface names are renamed after Oracle Clusterware is installed, then you will need to run ‘oifcfg’ to add the new interface and delete the old one, as described above.

You can view the available interface names on each node by running the command:

oifcfg iflist -p -n
This command must be run on each node to verify the interface names are defined the same.

Ramifications of Changing Interface Names Using oifcfg

For the Private interface, the database will use the interface stored in the OCR and defined as a 'cluster_interconnect' for cache fusion traffic. The cluster_interconnect information is available at startup in the alert log, after the parameter listing - for example:

For pre 11.2.0.2:
Cluster communication is configured to use the following interface(s) for this instance
192.168.1.1

For 11.2.0.2+: (HAIP address will show in alert log instead of private IP)
Cluster communication is configured to use the following interface(s) for this instance
169.254.86.97
If this is incorrect, then instance is required to restart once the OCR entry is corrected. This applies to ASM instances and Database instances alike. On Windows systems, after shutting down the instance, it is also required to stop/restart the OracleService (or OracleASMService before the OCR will be re-read.

Oifcfg Usage

To see the full options of oifcfg, simply type:

$ $ORA_CRS_HOME/bin/oifcfg

 

ASM 实例报错 04031

04031这个错误算是老生常谈了,shared pool不足。在asm实例中也有shared pool,在内存不足时候也会报错04031,碰到这种情况怎么处理?

今天一个客户碰到了这个错误,一般来讲默认的asm实例大小为272M,足够asm使用,而客户是这个节点上面部署了监控asm实例的脚本,跑了一年多出现了今天的情况,初步估计是因为asm的shared pool缓存了太多的这个脚本的sql(解析版本多),导致了今天的情况出现,出现这种情况有2种处理办法:

 

1.增加sga的大小

10g中通过修改sga_max_size的方式(需要重启asm)

11g中通过修改memory_max_target和memory_target(需要重启asm)

 

2.刷共享池

通过刷共享池的方式来解决04031,但是这样会对dbinstance有影响。

 

具体命令为:

alter system flush shared_pool;

具体参考MOS 1370925.1

对11.2.0.2版本的rebootless restart的浅析

在今天的某银行的集群测试中遇到一个问题,主要是关于11.2.0.2后对rebootless restart这个特性的理解。中间有些对这个特性有些鸡肋的感受。
碰到的问题是,在好几套集群中做的单节点把光纤线,或者拔网线测试主机是否会重启,结果有一个集群在把光纤线拔掉之后集群不重启~,愁死我们这群做技术的,官方的dba被我们烦死了还是没能给出合理解释回复,最终把责任推倒aix系统的问题上,说是aix把hang的ocssd进程干掉了,所以其他集群才重启,没有重启才是正常的!我有种买表的感觉。逐对这个东西进行如下分析,如有错误之处,还请勘误。

梳理:

1.在这个11.2.0.2版本中,rebootless restart的特性是为了干嘛呢?减少os的重启数量。ok
那么我们先提出几个问题,
1.1 这个特性什么时候会触发?
1.2 这个特性什么时候是无效(不被触发)的?
1.3 这个特性触发后,如果特性想要的目的不能达到的时候是否会重启OS?那么这个特性是否存在尝试次数的要求?

2.在11.2.0.1以前的版本中,重启的情况我们统计几个例子
2.1 network heart超过misscount的设置,这个情况是网络心跳引起,如果网络心跳通信出现问题,这个时候超过misscount,那么节点就会脑裂,脑裂后分两种情况:子集群节点数不一样,节点数量少的集群在reboottimr重启;子集群节点数一样,这个时候node id大的子集群在reboottime内重启;
2.2 某个节点的voting disk的timeout超过disktimeout(2个节点时候,多个节点稍微有点不一样),这个时候css会在reboot time内把这个节点OS进行重启
2.3 meberkill upgrade to node kill

而在官方中解释如上的3个例子,在11.2.0.2中都将会被避免,这些都有cssd进程来完成,他首先GI会让集群重启,让cssd通知其他进程来停止资源以及有io能力的进程。那么cssd进程来完成这些动作的话,很明显,如果在cssd进程hang,或者在主机系统资源异常繁忙的情况下请求cssd.bin进程发出的相关指令延时严重也会对rebootless restart的特性严重影响。

3. GI 在重启集群之前,首先要对集群进行graceful shutdown, 基本的步骤如下。

1.停止本地节点的所有心跳(网络心跳,磁盘心跳和本地心跳)。
2.通知cssd agent,ocssd.bin即将停止
3.停止所有注册到css的具有i/o能力的进程,例如 lmon。
4.cssd通知crsd 停止所有资源,如果crsd不能成功的停止所有的资源,节点重启仍然会发生。
5.cssd等待所有的具有i/o能力的进程退出,如果这些进程在short i/o timeout时间内不能不能全部推迟,节点重启仍然会发生。
6.通知cssd agent 所有的有i/o能力的进程全部退出。
7.Ohasd 重新启动集群。
8.本地节点通知其他节点进行集群重配置。

4.分析
分析一:我们来分析下心跳网络中断时候的情况:(TEST14)
这个时候根据前面讲到的,超过了misscount不会主机reboot主机,而是gi会重启集群件,这时候回到3,问题又到了gi启动关闭集群程序的过程,graceful shutdown的步骤,
基本所有情况下(根据我们的实践结果)资源都不会被停止或者相关的有IO能力进程没有被成功的中止,这时候会导致节点reboot。(在某银行所有的集群中,10多个集群2个节点的,拔心跳线,2号节点死,这是规律了。)
万一如果清理干净了(这个会导致所有客户端的连接中断),那么gi集群程序了了,这个时候本身心跳线是被拔掉的状态,还是无法访问对方节点,这时候是不是会一直尝试验证心跳呢?也就是不关机。那这样是不是也算是服务丢失的一种情况?

分析二:我们再来分析下光纤线中断时候的情况:(TEST18,19)
这个时候理论上就是ocr和voting都无法访问了,那么这2个无法访问的情况下,集群件会做什么事情?
首先OCR无法访问,节点上的crsd进程就会由于无法访问ocr被判断中止(国外一个文档上介绍ocr无法访问,crsd就会abort,而事实上有点出入),下来ocssd进程会通知ohasd进程对crsd尝试重启10次,如果还是失败咋办?前面讲到cssd停止crsd先停止所有资源,这个时候资源无法被中止了,只好被迫重启主机。如果到这里觉得是最终答案,我就不会写那么多了,这也是本文疑问的地方,其实节点1上crsd并没有abort,而是在停止相关资源后,才被gi graceful shutdown。但是节点2上的同样的操作就发生了crsd进程直接abort,ohasd进程尝试重启10次crsd进程失败后,导致资源中止失败,节点重启。

具体看下面的分析日志。
其次到voting disk的访问上,如果超过了disktimeout,这时候GI会尝试重启集群件,这个过程会中止所有具有io能力和资源进程,如果清理干净,GI重启完集群程序,cssd进程会反复去获取voting file的位置。如果清理失败,则会重启主机。

2013-09-24 11:06:57.229
[cssd(2031962)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rhdiskpower0 will be considered not functional in 99149 milliseconds — 发生io错误,开始计时对比
2013-09-24 11:07:33.995
[cssd(2031962)]CRS-1649:An I/O error occured for voting file: /dev/rhdiskpower0; details at (:CSSNM00059:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log.
2013-09-24 11:07:34.364
[/grid/database/11.2.0/bin/oraagent.bin(5046282)]CRS-5011:Check of resource “fxdb” failed: details at “(:CLSN00007:)” in “/grid/database/11.2.0/log/fxdb1/agent/crsd/oraagent_oracle/oraagent_oracle.log”
2013-09-24 11:07:35.000
[cssd(2031962)]CRS-1649:An I/O error occured for voting file: /dev/rhdiskpower0; details at (:CSSNM00060:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log.
2013-09-24 11:07:47.043
[cssd(2031962)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/rhdiskpower0 will be considered not functional in 49335 milliseconds
2013-09-24 11:08:17.318
[cssd(2031962)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/rhdiskpower0 will be considered not functional in 19060 milliseconds
2013-09-24 11:08:36.449
[cssd(2031962)]CRS-1604:CSSD voting file is offline: /dev/rhdiskpower0; details at (:CSSNM00058:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log.
2013-09-24 11:08:36.449
[cssd(2031962)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log –voting fileoffline导致cssd进程中止
2013-09-24 11:08:36.450
[cssd(2031962)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log ————–css守护进程被中止
2013-09-24 11:08:36.501
[cssd(2031962)]CRS-1652:Starting clean up of CRSD resources. —————-发起清除crsd的资源
2013-09-24 11:08:37.979
[/grid/database/11.2.0/bin/oraagent.bin(3080748)]CRS-5016:Process “/grid/database/11.2.0/opmn/bin/onsctli” spawned by agent “/grid/database/11.2.0/bin/oraagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/grid/database/11.2.0/log/fxdb1/agent/crsd/oraagent_grid/oraagent_grid.log”
2013-09-24 11:08:38.589
[/grid/database/11.2.0/bin/oraagent.bin(3080748)]CRS-5016:Process “/grid/database/11.2.0/bin/lsnrctl” spawned by agent “/grid/database/11.2.0/bin/oraagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/grid/database/11.2.0/log/fxdb1/agent/crsd/oraagent_grid/oraagent_grid.log”
2013-09-24 11:08:38.597
[cssd(2031962)]CRS-1654:Clean up of CRSD resources finished successfully. ——————–提示发起清楚crsd的资源成功,这里是用什么清楚的?这里还未提示crsd abort!所以应该还是crsd进程本身在中止资源,但是浩南的文档中提到了crsd会abort这时候还没发现。
2013-09-24 11:08:38.598
[cssd(2031962)]CRS-1655:CSSD on node fxdb1 detected a problem and started to shutdown. ———提示开始关闭
2013-09-24 11:08:38.684
[/grid/database/11.2.0/bin/oraagent.bin(3080748)]CRS-5822:Agent ‘/grid/database/11.2.0/bin/oraagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:1:8} in /grid/database/11.2.0/log/fxdb1/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-09-24 11:08:38.692
[/grid/database/11.2.0/bin/orarootagent.bin(5112062)]CRS-5822:Agent ‘/grid/database/11.2.0/bin/orarootagent_root’ disconnected from server. Details at (:CRSAGF00117:) {0:3:3784} in /grid/database/11.2.0/log/fxdb1/agent/crsd/orarootagent_root/orarootagent_root.log.
2013-09-24 11:08:38.778
[cssd(2031962)]CRS-1660:The CSS daemon shutdown has completed ——— 守护进程关闭
2013-09-24 11:08:39.565
[ohasd(2950012)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb1’. ——crsd进程失败报错了,从这里开始提示后续的进程都开始失败了
2013-09-24 11:08:40.052
[/grid/database/11.2.0/bin/oraagent.bin(3211384)]CRS-5011:Check of resource “+ASM” failed: details at “(:CLSN00006:)” in “/grid/database/11.2.0/log/fxdb1/agent/ohasd/oraagent_grid/oraagent_grid.log”
2013-09-24 11:08:40.225
[ohasd(2950012)]CRS-2765:Resource ‘ora.asm’ has failed on server ‘fxdb1’.
2013-09-24 11:08:40.248
[/grid/database/11.2.0/bin/oraagent.bin(3211384)]CRS-5011:Check of resource “+ASM” failed: details at “(:CLSN00006:)” in “/grid/database/11.2.0/log/fxdb1/agent/ohasd/oraagent_grid/oraagent_grid.log”
2013-09-24 11:08:40.503
[ohasd(2950012)]CRS-2765:Resource ‘ora.ctssd’ has failed on server ‘fxdb1’.
2013-09-24 11:08:40.772
[crsd(3277342)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in /grid/database/11.2.0/log/fxdb1/crsd/crsd.log. ——crsd进程到这里宕掉了。现在问题明了了,先前的梳理在于CRSD清理资源的顺序上有误,实际上在清除之后crsd才会宕掉。
2013-09-24 11:08:40.777
[ohasd(2950012)]CRS-2765:Resource ‘ora.evmd’ has failed on server ‘fxdb1’.
2013-09-24 11:08:41.558
[ohasd(2950012)]CRS-2765:Resource ‘ora.diskmon’ has failed on server ‘fxdb1’.
2013-09-24 11:08:41.572
[ctssd(4064022)]CRS-2402:The Cluster Time Synchronization Service aborted on host fxdb1. Details at (:ctss_css_init1:) in /grid/database/11.2.0/log/fxdb1/ctssd/octssd.log.
2013-09-24 11:08:41.621
[ohasd(2950012)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb1’.
2013-09-24 11:08:41.644
[ohasd(2950012)]CRS-2765:Resource ‘ora.cluster_interconnect.haip’ has failed on server ‘fxdb1’.
2013-09-24 11:08:42.125
[ohasd(2950012)]CRS-2765:Resource ‘ora.cssd’ has failed on server ‘fxdb1’.
2013-09-24 11:08:43.954
[cssd(2556520)]CRS-1713:CSSD daemon is started in clustered mode —— gi重启了集群程序,这个时候cssd就开始工作,接下来就找voting file就有了下面的情况,一直在找,但是并不重启了。
2013-09-24 11:08:44.244
[cssd(2556520)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log
2013-09-24 11:08:58.247
[ohasd(2950012)]CRS-2765:Resource ‘ora.cssdmonitor’ has failed on server ‘fxdb1’.

在11点32分37秒发现磁盘,此时光纤线接回去了。就是说如果主机没重启,gi会不断去尝试扫描磁盘找voting file。但是前提是有io能力的进程以及资源进程被成功停止,下面之后开始看crsd日志

2013-09-24 11:32:37.496: [ SKGFD][1286]OSS discovery with ::

2013-09-24 11:32:37.527: [ SKGFD][1286]Handle 11125b910 from lib :UFS:: for disk :/dev/rhdiskpower9:

2013-09-24 11:32:37.558: [ SKGFD][1286]Handle 11125d1b0 from lib :UFS:: for disk :/dev/rhdiskpower8:

2013-09-24 11:32:37.588: [ SKGFD][1286]Handle 111268970 from lib :UFS:: for disk :/dev/rhdiskpower7:

2013-09-24 11:32:37.618: [ SKGFD][1286]Handle 111268df0 from lib :UFS:: for disk :/dev/rhdiskpower6:

2013-09-24 11:32:37.648: [ SKGFD][1286]Handle 11146bf50 from lib :UFS:: for disk :/dev/rhdiskpower5:

2013-09-24 11:32:37.681: [ SKGFD][1286]Handle 1114e5c30 from lib :UFS:: for disk :/dev/rhdiskpower4:

2013-09-24 11:32:37.712: [ SKGFD][1286]Handle 1114e6390 from lib :UFS:: for disk :/dev/rhdiskpower3:

2013-09-24 11:32:37.743: [ SKGFD][1286]Handle 1114e6af0 from lib :UFS:: for disk :/dev/rhdiskpower2:

2013-09-24 11:32:37.773: [ SKGFD][1286]Handle 1114fa5d0 from lib :UFS:: for disk :/dev/rhdiskpower14:

2013-09-24 11:32:37.803: [ SKGFD][1286]Handle 111501490 from lib :UFS:: for disk :/dev/rhdiskpower13:

在11点08分之后,crsd就在被启动过程,具体看crsd日志如下:

2013-09-24 11:08:39
Changing directory to /grid/database/11.2.0/log/fxdb1/crsd
2013-09-24 11:08:39
CRSD REBOOT
Attempt to add duplicate debug module CRSUI (old description CRSUI Component, new description CRSUI Component.
Attempt to add duplicate debug module CRSCOMM (old description CRSCOMM Component, new description CRSCOMM Component.
Attempt to add duplicate debug module CRSRTI (old description CRSRTI Component, new description CRSRTI Component.
Attempt to add duplicate debug module CRSMAIN (old description CRSMAIN Component, new description CRSMAIN Component.
Attempt to add duplicate debug module CRSPLACE (old description CRSPLACE Component, new description CRSPLACE Component.
Attempt to add duplicate debug module CRSAPP (old description CRSAPP Component, new description CRSAPP Component.
Attempt to add duplicate debug module CRSRES (old description CRSRES Component, new description CRSRES Component.
Attempt to add duplicate debug module CRSTIMER (old description CRSTIMER Component, new description CRSTIMER Component.
Attempt to add duplicate debug module CRSEVT (old description CRSEVT Component, new description CRSEVT Component.
Attempt to add duplicate debug module CRSD (old description CRSD Component, new description CRSD Component.
Attempt to add duplicate debug module CLUCLS (old description CLUCLS Component, new description CLUCLS Component.
Attempt to add duplicate debug module CLSVER (old description CLSVER Component, new description CLSVER Component.
Attempt to add duplicate debug module COMMCRS (old description COMMCRS, new description COMMCRS.
Attempt to add duplicate debug module COMMNS (old description COMMNS, new description COMMNS.
CRSD exiting: Could not init the CSS context, error: 3
2013-09-24 11:20:05
Changing directory to /grid/database/11.2.0/log/fxdb1/crsd
2013-09-24 11:20:05
CRSD REBOOT
Attempt to add duplicate debug module CRSUI (old description CRSUI Component, new description CRSUI Component.
Attempt to add duplicate debug module CRSCOMM (old description CRSCOMM Component, new description CRSCOMM Component.
Attempt to add duplicate debug module CRSRTI (old description CRSRTI Component, new description CRSRTI Component.
Attempt to add duplicate debug module CRSMAIN (old description CRSMAIN Component, new description CRSMAIN Component.
Attempt to add duplicate debug module CRSPLACE (old description CRSPLACE Component, new description CRSPLACE Component.
Attempt to add duplicate debug module CRSAPP (old description CRSAPP Component, new description CRSAPP Component.
Attempt to add duplicate debug module CRSRES (old description CRSRES Component, new description CRSRES Component.
Attempt to add duplicate debug module CRSTIMER (old description CRSTIMER Component, new description CRSTIMER Component.
Attempt to add duplicate debug module CRSEVT (old description CRSEVT Component, new description CRSEVT Component.
Attempt to add duplicate debug module CRSD (old description CRSD Component, new description CRSD Component.
Attempt to add duplicate debug module CLUCLS (old description CLUCLS Component, new description CLUCLS Component.
Attempt to add duplicate debug module CLSVER (old description CLSVER Component, new description CLSVER Component.
Attempt to add duplicate debug module COMMCRS (old description COMMCRS, new description COMMCRS.
Attempt to add duplicate debug module COMMNS (old description COMMNS, new description COMMNS.
CRSD exiting: Could not init the CSS context, error: 3
2013-09-24 11:33:09
Changing directory to /grid/database/11.2.0/log/fxdb1/crsd
2013-09-24 11:33:09
CRSD REBOOT
Attempt to add duplicate debug module CRSUI (old description CRSUI Component, new description CRSUI Component.
Attempt to add duplicate debug module CRSCOMM (old description CRSCOMM Component, new description CRSCOMM Component.
Attempt to add duplicate debug module CRSRTI (old description CRSRTI Component, new description CRSRTI Component.
Attempt to add duplicate debug module CRSMAIN (old description CRSMAIN Component, new description CRSMAIN Component.
Attempt to add duplicate debug module CRSPLACE (old description CRSPLACE Component, new description CRSPLACE Component.
Attempt to add duplicate debug module CRSAPP (old description CRSAPP Component, new description CRSAPP Component.
Attempt to add duplicate debug module CRSRES (old description CRSRES Component, new description CRSRES Component.
Attempt to add duplicate debug module CRSTIMER (old description CRSTIMER Component, new description CRSTIMER Component.
Attempt to add duplicate debug module CRSEVT (old description CRSEVT Component, new description CRSEVT Component.
Attempt to add duplicate debug module CRSD (old description CRSD Component, new description CRSD Component.
Attempt to add duplicate debug module CLUCLS (old description CLUCLS Component, new description CLUCLS Component.
Attempt to add duplicate debug module CLSVER (old description CLSVER Component, new description CLSVER Component.
Attempt to add duplicate debug module COMMCRS (old description COMMCRS, new description COMMCRS.
Attempt to add duplicate debug module COMMNS (old description COMMNS, new description COMMNS.

通过分析了节点一的告警日志基本可以把上个邮件的分析二否决了,而且整个过程是gi在管理集群程序的过程,而不是重启gi,是gi在执行graceful shutdown以及startup的过程。而且crsd是在停止完资源才被gi程序中止的或者是自己abort的,可以通过prostat crsd进程来看就知道了,但是这个只是节点一上的表现,到了节点2同样的操作,反馈的信息就完全不一样了。


分析二:我们再来分析下光纤线中断时候的情况:(TEST18,19)

这个时候理论上就是ocr和voting都无法访问了,那么这2个无法访问的情况下,集群件会做什么事情?
首先OCR无法访问,
节点上的crsd进程就会被通知去中止相关资源,然后再被gi进程关闭。这时候集群被graceful shutdown成功,cssd被重启后会反复读取votingdisk ,这个是节点1的情况。
那如果crsd进程abort了,无法被ohasd成功启动,那资源无法关闭,会发生什么?重启。 这个是在节点二发生的情况

如下节点2拔除光纤线的现象分析:

2013-09-24 11:38:50.475
[crsd(3735838)]CRS-2765:Resource ‘ora.fxdb.db’ has failed on server ‘fxdb2’.
2013-09-24 11:39:08.351
[cssd(3081164)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rhdiskpower12 will be considered not functional in 99785 milliseconds ————- 这个时候开始把节点2的光纤线拔掉。
2013-09-24 11:39:24.852
[cssd(3081164)]CRS-1649:An I/O error occured for voting file: /dev/rhdiskpower12; details at (:CSSNM00060:) in /grid/database/11.2.0/log/fxdb2/cssd/ocssd.log.
2013-09-24 11:39:58.549
[cssd(3081164)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/rhdiskpower12 will be considered not functional in 49587 milliseconds
2013-09-24 11:40:26.560
[crsd(3735838)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log. —————— 提示找不到ocr,ocr不能访问
2013-09-24 11:40:26.561
[crsd(3735838)]CRS-1006:The OCR location is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.—————— 提示找不到ocr,ocr不能访问,这个时候gi开始进行graceful shutdown 集群程序
2013-09-24 11:40:26.882
[/grid/database/11.2.0/bin/oraagent.bin(3932520)]CRS-5822:Agent ‘/grid/database/11.2.0/bin/oraagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:1:6} in /grid/database/11.2.0/log/fxdb2/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-09-24 11:40:26.891
[/grid/database/11.2.0/bin/orarootagent.bin(4063734)]CRS-5822:Agent ‘/grid/database/11.2.0/bin/orarootagent_root’ disconnected from server. Details at (:CRSAGF00117:) {0:2:10662} in /grid/database/11.2.0/log/fxdb2/agent/crsd/orarootagent_root/orarootagent_root.log.
2013-09-24 11:40:26.895
[/grid/database/11.2.0/bin/scriptagent.bin(4456740)]CRS-5822:Agent ‘/grid/database/11.2.0/bin/scriptagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:6:7} in /grid/database/11.2.0/log/fxdb2/agent/crsd/scriptagent_grid/scriptagent_grid.log.
2013-09-24 11:40:27.324
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’. ———提示crsd进程失败
2013-09-24 11:40:28.768
[cssd(3081164)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/rhdiskpower12 will be considered not functional in 19369 milliseconds
2013-09-24 11:40:28.964
[crsd(9371958)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:29.005
[crsd(9371958)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage —–crsd abort
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:29.442
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’. ———提示crsd进程失败,ohasd进程重启CRSD? 1次
2013-09-24 11:40:31.025
[crsd(5243254)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:31.061
[crsd(5243254)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:31.521
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 2次
2013-09-24 11:40:33.106
[crsd(4981632)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:33.151
[crsd(4981632)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:33.601
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 3次
2013-09-24 11:40:35.179
[crsd(4457080)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:35.216
[crsd(4457080)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:35.683
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 4次
2013-09-24 11:40:37.269
[crsd(3146492)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:37.306
[crsd(3146492)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:37.765
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 5次
2013-09-24 11:40:39.338
[crsd(4260124)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:39.375
[crsd(4260124)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:39.845
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 6次
2013-09-24 11:40:41.423
[crsd(2556490)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:41.456
[crsd(2556490)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:41.923
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 7次
2013-09-24 11:40:43.501
[crsd(6357962)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:43.538
[crsd(6357962)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:44.008
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 8次
2013-09-24 11:40:45.578
[crsd(3146246)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:45.615
[crsd(3146246)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:46.094
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 9次
2013-09-24 11:40:47.671
[crsd(6750230)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:47.709
[crsd(6750230)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:48.171
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 10次
2013-09-24 11:40:48.172
[ohasd(2949250)]CRS-2771:Maximum restart attempts reached for resource ‘ora.crsd’; will not restart..———提示crsd进程重启次数超过最大限制,ohasd进程重启CRSD进程已经10次,但是没能重新起来,导致资源无法被关闭。
2013-09-24 11:40:48.878
[cssd(3081164)]CRS-1604:CSSD voting file is offline: /dev/rhdiskpower12; details at (:CSSNM00058:) in /grid/database/11.2.0/log/fxdb2/cssd/ocssd.log.
2013-09-24 11:40:48.878
[cssd(3081164)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /grid/database/11.2.0/log/fxdb2/cssd/ocssd.log
2013-09-24 11:40:48.878
[cssd(3081164)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /grid/database/11.2.0/log/fxdb2/cssd/ocssd.log
2013-09-24 11:40:48.930
[cssd(3081164)]CRS-1652:Starting clean up of CRSD resources.
2013-09-24 11:40:49.111
[cssd(3081164)]CRS-1653:The clean up of the CRSD resources failed. .———提示crsd进程失败了,清除资源肯定失败啊,但是节点1就没有以上重启现象以及清除资源是成功的。
2013-09-24 11:47:18.633
[ohasd(1966398)]CRS-2112:The OLR service started on node fxdb2.
2013-09-24 11:47:18.718
[ohasd(1966398)]CRS-1301:Oracle High Availability Service started on node fxdb2.
2013-09-24 11:47:18.768
[ohasd(1966398)]CRS-8011:reboot advisory message from host: fxdb2, component: cssagent, with time stamp: L-2013-09-24-11:40:50.570 ——- 到了这一步了,完全验证我第一次的分析结果,同样的操作,2个节点不一样的反馈,为什么??这是所有疑问的终点。
[ohasd(1966398)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS

最终给我们留下了疑问,为什么同样的操作,在节点1和节点2出现的现象在crsd进程上会截然相反?而在其他集群上,无论是哪个节点都和节点2现象一致,这个集群的节点1到目前为止是个案。个中缘由,我想是真的“rebootless restart”生效了吧。这给我带来的疑惑,让我有点往bug方向考虑的倾向,但是官方没回复,我也没有进一步trace,这就是疑问的终点?

粗心大意的DBA还有SA导致OCR initialization failed accessing OCR device

一个比较大的升级项目,临时发现lv镜像问题,对程序进行Tar包调整相关目录做LV镜像,最后tar解压CRS和DATABASE的程序完成后,发现node1的CRS无法启动了,停止在 /etc/init.d/init.cssd startcheck上,在/tmp目录下找到最新生产的CRS相关的日志发现如下报错:

OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage Operating System error [No such device or address] [6]

通过查找/crs/log/client/xhdb1/alert.log也能够找到相关的日志,要更详细点.到了这一步其中的一个DBA就慌了,又是查RAW权限,又是查系统问题,再确认RAW的权限没问题之后,打算用DD导出OCR的信息进行查看,到这一步被我中止,在加班了将近13个小时情况下大家都已经疲惫了,生怕不小心把数据文件给dd掉了,我便登陆这个节点检测问题,一般碰到此类问题无非几个原因,如下:

1)RAW权限被改,没有读的权限
2)RAW的相关信息被变动过(查看最近的更改时间)
3)RAW所在的VG并未varyonvg
4)bug

按照这上面1个个检测去检测,基本都能找到问题,经过lspv的方式查看了节点1的时候,发现datavg处于inactive状态,问题已经发现了.经过一翻查询发现节点1在重启后HA并没有跟随OS的重启而重启,此时节点2正在运行集群,datavg在节点二上出于并发状态,节点1的HA并未启动也就意味着DATAVG处于OFF状态,最终导致CRS无法读取DATAVG上的OCR信息引起报错。

DD在生产库上,在没有准备的情况下突然要使用,这是对自己和客户的不负责,使用时候需三思。

11.1.0.6升级11.1.0.7遭遇错误“ dtexec dtfile dtsession dtterm dtwm java ksh rpc.ttdbserver ttsession”

平台:AIX 61006
版本:ORACLE 11.1.0.6 REAL APPLICATION CLUSTER and DATABASE
目标升级版本
ORACLE CRS 11.1.0.7.7
ORACLE DATABASE 11.1.0.12

遭遇错误信息
安装过程遭遇错误,错误信息如下

dtexec dtfile dtsession dtterm dtwm java ksh rpc.ttdbserver ttsession.

日志信息如下

639014: execve(“/usr/sbin/fuser”, 0x0000000112BA7DD0, 0x000000011000E0D0) argc: 6
639014: argv: /usr/sbin/fuser -f -x -f -x
639014: /ora/oracle/product/11.1/db_1/bin/oracle

经过查询检验为AIX系统bug,补丁
IZ67400: FUSER GIVES INCORRECT PIDS WITH -X OPTION
APAR is a duplicate of IZ71207

△其次在不打系统补丁的情况下可以通过以下方式解决此类问题

在root用户下操作:
1) 重命名 fuser
mv /usr/sbin/fuser /usr/sbin/fuser_renamed
2) touch /usr/sbin/fuser
3) chmod +x /usr/sbin/fuser

在oracle用户下操作:
1) 在图形界面中点击“try again”重试安装

安装成功后,登陆root用户执行以下操作:
1) rename fuser back to its original name
mv /usr/sbin/fuser_renamed /usr/sbin/fuser