Skip to content

云环境使用ORACLE RAC集群时HAIP导致的集群异常问题处理方法

问题背景:

某云环境,要安装12.2版本RAC,网络原因,心跳网络上的HAIP(169.254.*.*)在两台主机间无法通信,导致RAC的ASM/DB均只能启动一个节点,报错即典型的PMON……: terminating the instance due to error 481。

处理办法:

1.协调云厂商在后台虚拟化管理上放开HAIP(169.254.*.*)网络的通信,一直无法解决~~
2.决定ASM/DB实例不使用HAIP,恢复到低版本原有的心跳地址模式(即HAIP功能在集群层面仍然是开启状态,ifconfig中也有169.254.*.*虚拟IP,只是ASM/DB实例设置为不使用);
3.关于HAIP异常引起的问题,可以参考MOS文档:
ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)
关闭HAIP功能可以参考HOWTO: Remove/Disable HAIP on Exadata (Doc ID 2524069.1)中的Disable HAIP章节。

官方的禁用方法:

禁用haip服务及haip依赖
crsctl modify res ora.cluster_interconnect.haip -attr “ENABLED=0″ -init
d(ora.cssd,ora.ctssd)pullup(ora.cssd,ora.ctssd)weak(ora.drivers.acfs)'” -init
crsctl modify res ora.asm -attr “STOP_DEPENDENCIES=’hard(intermediate:ora.cssd)'” -init
之后重启集群。

===
查看状态
crsctl stat res ora.cluster_interconnect.haip -init
crsctl start res ora.cluster_interconnect.haip -init

#############################
恢复haip服务,重启集群
crsctl modify res ora.cluster_interconnect.haip -attr “ENABLED=1” -init

官方建议:

1.Run “crsctl stop crs” on all nodes to stop CRS stack.
2. 关闭HAIP

2. On one node, run the following commands:
crsctl start crs -excl -nocrs
crsctl stop res ora.asm -init
crsctl modify res ora.cluster_interconnect.haip -attr “ENABLED=0” -init
crsctl modify res ora.asm -attr “START_DEPENDENCIES=’hard(ora.cssd,ora.ctssd)pullup(ora.cssd,ora.ctssd)weak(ora.drivers.acfs)’,STOP_DEPENDENCIES=’hard(intermediate:ora.cssd)'” -init
crsctl stop crs
4. 进一步测试

3. Repeat Step(2) on other nodes.

4. Run “crsctl start crs” on all nodes to restart CRS stack.

 

经实验后的确认的最简单处理办法

不需要禁用HAIP功能,只需要人工将ASM/DB实例的参数cluster_interconnects设置为本机的心跳IP即可。
步骤如下:
DB:
SQL> alter system set cluster_interconnects=’10.100.19.18′ scope=spfile sid=’bdcsq1′;
SQL> alter system set cluster_interconnects=’10.100.19.20′ scope=spfile sid=’bdcsq2′;
ASM:
SQL> alter system set cluster_interconnects=’10.100.19.18′ scope=spfile sid=’+ASM1′;
SQL> alter system set cluster_interconnects=’10.100.19.20′ scope=spfile sid=’+ASM2′;

检查ASM及DB的ALERT日志启动时使用的cluster_interconnects信息:
启动日志中在读取参数后马上有使用的心跳网络信息,示例如下:
2021-11-13T11:34:06.408938+08:00
Cluster Communication is configured to use IPs from: GPnP
IP: 10.100.19.18 Subnet: 10.100.19.0 ===>>>不使用HAIP
KSIPC Loopback IP addresses(OSD):
127.0.0.1
KSIPC Available Transports: UDP:TCP
KSIPC: Client: KCL Transport: NONE
KSIPC: Client: DLM Transport: NONE

……………………
NOTE: remote asm mode is remote (mode 0x2; from cluster type)
2021-11-11T09:26:08.588753-05:00
Cluster Communication is configured to use IPs from: GPnP
IP: 169.254.253.252 Subnet: 169.254.0.0 ===>>>使用HAIP
KSIPC Loopback IP addresses(OSD):
127.0.0.1
KSIPC Available Transports: UDP:TCP
KSIPC: Client: KCL Transport: UDP
KSIPC: Client: DLM Transport: UDP
KSIPC CAPABILITIES :IPCLW:GRPAM:TOPO:DLL
KSXP: ksxpsg_ipclwtrans: 2 UDP
cluster interconnect IPC version: [IPCLW over UDP(mode 3) ]
IPC Vendor 1 proto 2
Oracle instance running with ODM: Oracle Direct NFS ODM Library Version 4.0