概述
某客户由存储厂商维护人员将LUN存储划分到EMR数据库主机上;在此期间由于不当操作,导致数据库集群状态异常,之后重启了数据库主机节点2,重启主机后集群软件GRID无法正常启动,节点1仍可以正常提供数据库服务。通过远程方式接入数据库环境进行故障排查,通过对相关日志等信息的深入分析,最终找到问题点并进行处理,数据库集群恢复正常。
集群及数据库日志分析
1. 15:36分主机重启前的错误信息
从此处日志可以看出,在15:36分时数据库实例报IO读写错误,同时数据库集群日志中也在显示报错,之后发生了主机重启动作。
2. 节点2重启主机后相关日志信息
节点2重启主机后,检查集群启动状态可以发现在启动到ASM实例时停止,检查ASM ALERT日志有如下信息:
集群ALERT日志:
2021-12-12 15:48:34.081:
[client(10295)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.
[client(10405)]CRS-10001:12-Dec-19 15:48 ACFS-9391: Checking for existing ADVM/ACFS installation.
[client(10410)]CRS-10001:12-Dec-19 15:48 ACFS-9392: Validating ADVM/ACFS installation files for operating system.
[client(10412)]CRS-10001:12-Dec-19 15:48 ACFS-9393: Verifying ASM Administrator setup.
[client(10415)]CRS-10001:12-Dec-19 15:48 ACFS-9308: Loading installed ADVM/ACFS drivers.
[client(10418)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleoks.ko’ driver.
[client(10446)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleadvm.ko’ driver.
[client(10473)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleacfs.ko’ driver.
[client(10560)]CRS-10001:12-Dec-19 15:48 ACFS-9327: Verifying ADVM/ACFS devices.
[client(10564)]CRS-10001:12-Dec-19 15:48 ACFS-9156: Detecting control device ‘/dev/asm/.asm_ctl_spec’.
[client(10568)]CRS-10001:12-Dec-19 15:48 ACFS-9156: Detecting control device ‘/dev/ofsctl’.
[client(10573)]CRS-10001:12-Dec-19 15:48 ACFS-9322: completed
2021-12-12 15:48:43.930:
[client(10374)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.
2021-12-12 15:48:53.698:
[client(10598)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.
2021-12-12 15:49:03.129:
[client(10905)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.
2021-12-12 15:50:00.511:
[client(12458)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.
2021-12-12 15:51:07.276:
?
节点2:ASM ALERT日志
Thu Dec 12 16:22:11 2021
MMNL started with pid=21, OS id=26248
lmon registered with NM – instance number 2 (internal mem no 1)
Thu Dec 12 16:24:08 2021
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Thu Dec 12 16:24:11 2021
PMON (ospid: 26206): terminating the instance due to error 481
Thu Dec 12 16:24:11 2021
ORA-1092 : opitsk aborting process
Thu Dec 12 16:24:13 2021
System state dump requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216_20211212162413.trc
Dumping diagnostic data in directory=[cdmp_20211212162411], requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].
Instance terminated by PMON, pid = 26206
检查对应的LMON及DIAG日志:
[grid@cxemrdb02 trace]$ more ./+ASM2_diag_26216.trc
Trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production
With the Real Application Clusters and Automatic Storage Management options
ORACLE_HOME = /u01/app/11.2.0/grid
System name: ???Linux
Node name: ?????cxemrdb02
Release: ???????2.6.32-431.el6.x86_64
Version: ???????#1 SMP Sun Nov 10 22:19:54 EST 2013
Machine: ???????x86_64
Instance name: +ASM2
Redo thread mounted by this instance: 0 <none>
Oracle process number: 6
Unix process pid: 26216, image: oracle@cxemrdb02 (DIAG)
*** 2021-12-12 16:22:10.788
*** SESSION ID:(373.1) 2021-12-12 16:22:10.788
*** CLIENT ID:() 2021-12-12 16:22:10.788
*** SERVICE NAME:() 2021-12-12 16:22:10.788
*** MODULE NAME:() 2021-12-12 16:22:10.788
*** ACTION NAME:() 2021-12-12 16:22:10.788
Node id: 1
List of nodes: 0, 1,
*** 2021-12-12 16:22:10.790
Reconfiguration starts [incarn=0]
*** 2021-12-12 16:22:10.790
I’m the voting node
Group reconfiguration cleanup
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).
3. 问题分析描述
通过对数据库集群相关日志的分析,可以发现在15:36分时的非正常操作,导致了数据库及集群报错并状态异常,此时进行了数据库主机的重启动作;重启后集群日志中显示ASM无法启动,根据ASM日志及TRACE文件信息,结合ORACLE官方文档中ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)文档中描述的信息,可以判断是集群私网通信出现了问题,通过对集群私网相关配置进行分析、排查,最终确认是集群通信的socket文件(/var/tmp/.oracle)内容由于集群异常关闭导致不一致,从而引起集群节点2重启后不能正常启动。通过清理socket文件后集群可以正常启动 恢复正常。
三、总结与后续处理建议
1. 问题总结
通过对数据库集群相关日志的分析,可以发现在15:36分时的非正常操作,导致了数据库及集群报错并状态异常,此时进行了数据库主机的重启动作;重启后集群日志中显示ASM无法启动,根据ASM日志及TRACE文件信息,结合ORACLE官方文档中ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)文档中描述的信息,可以判断是集群私网通信出现了问题;通过对集群私网相关配置进行分析、排查,最终确认是集群通信的socket文件(/var/tmp/.oracle)内容由于集群异常关闭导致不一致,从而引起集群节点2重启后与节点1通信异常,不能正常启动。通过清理socket文件后集群可以正常启动,数据库集群恢复正常。
4. 后续对于集群添加磁盘的建议
对于生产环境数据库集群,向ASM环境中添加磁盘是一个重要且有一点风险的动作,由于数据库所有数据都存放在ASM磁盘组中,因此如果向已有ASM磁盘组添加磁盘不当,可能导致ASM磁盘组读写异常或不能正常MOUNT,从而影响正常的数据库业务运行。因此结合ORACLE官方的标准步骤”如何在 RAC 集群或单机 ASM 环境中对已经存在的 Diskgroup 添加新磁盘(最佳实践) (文档 ID 1946668.1)”以及我们多年的数据库维护经验,
建议后续添加ASM磁盘时遵从如下步骤:
步骤 | 操作明细 |
1 | 对现有集群状态、ASM磁盘状态、使用率、磁盘大小检查 |
2 | 存储层面进行相应的LUN划分,建议划分的LUN大小与原ASM磁盘组中的磁盘大小一致 |
3 | 存储层新划分LUN到主机操作系统层面的映射,确认集群中每个主机可以正常识别读写LUN |
4 | 参照已有的ASM磁盘配置相应的别名或多路径映射,集群kfed命令可以读取 |
5 | 使用新划分LUN新建临时测试磁盘组确认新划分LUN可以被ASM使用 |
6 | 删除新加的测试磁盘组 |
7 | 将新划分LUN加入所需的ASM磁盘组中 |
8 | 检查ASM磁盘组rebalance完成状态,检查集群及数据库的状态 |