Skip to content

Oracle 排障配置与调整 - 2. page

概述

 

数据库对应的业务程序在2021/3/7日凌晨0:02分报无法连接数据库,在0:04分接到故障申告后,代维科技工程师通过远程方式接入数据库环境进行故障排查,在0:08分处理完成。通过对相关日志等信息的深入分析,找出问题点并提出相应的解决建议,根据相关问题日志及分析处理情况整理汇总此文档。

 

无法连接的原因分析

1. 宁帆数据库无法连接分析

 

登陆数据库进行查看,数据库无法连接原因为数据库主机磁盘空间使用率100%,归档日志无法切换,应用程序此时无法连接到数据库。紧急通过人工删除一部分数据库日志,并进行日志切换动作后,程序可以正常连接。

[root@nf-dboracle ~]# df -h

Filesystem                          Size  Used Avail Use% Mounted on

/dev/mapper/vg_nfdboracle-LogVol00  476G  452G   16M 100% /

tmpfs                               7.8G  160K  7.8G   1% /dev/shm

/dev/sda1                          1008M   63M  895M   7% /boot

/dev/sr0                            3.6G  3.6G     0 100% /mnt

[root@nf-dboracle ~]# su – oracle

[oracle@nf-dboracle ~]$ rman target /

 

Recovery Manager: Release 11.2.0.3.0 – Production on Sun Mar 7 00:07:32 2021

 

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

 

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-00554: initialization of internal recovery manager package failed

RMAN-04005: error from target database:

ORA-09945: Unable to initialize the audit trail file

Linux-x86_64 Error: 28: No space left on device

 

 

2. 磁盘空间不足原因的分析

通过对数据库归档日志产生数量的分析,可以发现,从3月4号以来,每天的数据库归档日志产生量是原来的15至20倍左右,当前每天产生120GB的归档日志,按原有的备份策略,归档日志保留2天,因此最终引起了磁盘空间不足。

 

 

 

3. 产生归档日志量多的操作分析

数据库层面看,70%的数据变更都在KLB_MICROHOSP用户的 NFSYS_LOG_THIRD表上,在TOP SQL中也发现了对这个表的频率插入;结合表名看,有可能是新开启了写入日志的功能,具体需要应用软件开发商确认一下。

INSERT INTO NFSYS_LOG_THIRD ( ID, TYPE, METHOD, MSG, DESCRIPTION, NAME, OPENID, THIRD_METHOD ) VALUES ( NFSYS_LOG_THIRD_S.nextval, :type0, :method0, :msg, :description, :name, :openid, :third_method ) RETURNING ID INTO :IDENTITY_PARAMETER

 

 


总结与后续处理建议

通过对数据库相关日志以及相关数据的分析,可以得出如下信息:

1. 数据库无法连接原因的分析:

登陆数据库进行查看,数据库无法连接原因为数据库主机磁盘空间使用率100%,归档日志无法切换,应用程序此时无法连接到数据库。紧急通过人工删除一部分数据库日志,并进行日志切换动作后,程序可以正常连接。(0点4分接到电话,0点8分处理完成,用时4分钟。)

2. 磁盘空间不足分析:

通过对数据库归档日志产生数量的分析,可以发现,从3月4号以来,每天的数据库归档日志产生量是原来的15至20倍左右,当前每天产生120GB的归档日志,按原有的备份策略,归档日志保留2天,因此最终引起了磁盘空间不足。已经临时修改了备份及删除归档日志的策略来解决此问题;

3. 产生日志多的应用程序语句:

从数据库层面看,70%的数据变更都在KLB_MICROHOSP用户的 NFSYS_LOG_THIRD表上,在TOP SQL中也发现了对这个表的频率插入;结合表名看,有可能是新开启了写入日志的功能,具体需要应用软件开发商确认一下。

INSERT INTO NFSYS_LOG_THIRD ( ID, TYPE, METHOD, MSG, DESCRIPTION, NAME, OPENID, THIRD_METHOD ) VALUES ( NFSYS_LOG_THIRD_S.nextval, :type0, :method0, :msg, :description, :name, :openid, :third_method ) RETURNING ID INTO :IDENTITY_PARAMETER

 

 

4. 下一步排查及处理方向

因此,接下来需要及时排查的问题为:

  1. 应用程序开发商根据对应的SQL语句,排查程序在3/4日做了哪些升级或更新操作。
  2. 程序的调整是短期还是长期需要按此模式运行;这决定了后续数据库层面如何制定相应的备份及删除策略。
某客户报表数据库无法连接故障分析

概述

 

某客户由存储厂商维护人员将LUN存储划分到EMR数据库主机上;在此期间由于不当操作,导致数据库集群状态异常,之后重启了数据库主机节点2,重启主机后集群软件GRID无法正常启动,节点1仍可以正常提供数据库服务。通过远程方式接入数据库环境进行故障排查,通过对相关日志等信息的深入分析,最终找到问题点并进行处理,数据库集群恢复正常。

 

集群及数据库日志分析

1. 15:36分主机重启前的错误信息

从此处日志可以看出,在15:36分时数据库实例报IO读写错误,同时数据库集群日志中也在显示报错,之后发生了主机重启动作。

 

2. 节点2重启主机后相关日志信息

节点2重启主机后,检查集群启动状态可以发现在启动到ASM实例时停止,检查ASM ALERT日志有如下信息:

集群ALERT日志:

2021-12-12 15:48:34.081:

[client(10295)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

[client(10405)]CRS-10001:12-Dec-19 15:48 ACFS-9391: Checking for existing ADVM/ACFS installation.

[client(10410)]CRS-10001:12-Dec-19 15:48 ACFS-9392: Validating ADVM/ACFS installation files for operating system.

[client(10412)]CRS-10001:12-Dec-19 15:48 ACFS-9393: Verifying ASM Administrator setup.

[client(10415)]CRS-10001:12-Dec-19 15:48 ACFS-9308: Loading installed ADVM/ACFS drivers.

[client(10418)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleoks.ko’ driver.

[client(10446)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleadvm.ko’ driver.

[client(10473)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleacfs.ko’ driver.

[client(10560)]CRS-10001:12-Dec-19 15:48 ACFS-9327: Verifying ADVM/ACFS devices.

[client(10564)]CRS-10001:12-Dec-19 15:48 ACFS-9156: Detecting control device ‘/dev/asm/.asm_ctl_spec’.

[client(10568)]CRS-10001:12-Dec-19 15:48 ACFS-9156: Detecting control device ‘/dev/ofsctl’.

[client(10573)]CRS-10001:12-Dec-19 15:48 ACFS-9322: completed

2021-12-12 15:48:43.930:

[client(10374)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:48:53.698:

[client(10598)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:49:03.129:

[client(10905)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:50:00.511:

[client(12458)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:51:07.276:

?

节点2:ASM ALERT日志

Thu Dec 12 16:22:11 2021

MMNL started with pid=21, OS id=26248

lmon registered with NM – instance number 2 (internal mem no 1)

Thu Dec 12 16:24:08 2021

LMON received an instance eviction notification from instance 1

The instance eviction reason is 0x20000000

The instance eviction map is 2

Thu Dec 12 16:24:11 2021

PMON (ospid: 26206): terminating the instance due to error 481

Thu Dec 12 16:24:11 2021

ORA-1092 : opitsk aborting process

Thu Dec 12 16:24:13 2021

System state dump requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216_20211212162413.trc

Dumping diagnostic data in directory=[cdmp_20211212162411], requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].

Instance terminated by PMON, pid = 26206

检查对应的LMON及DIAG日志:

[grid@cxemrdb02 trace]$ more ./+ASM2_diag_26216.trc

Trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216.trc

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production

With the Real Application Clusters and Automatic Storage Management options

ORACLE_HOME = /u01/app/11.2.0/grid

System name: ???Linux

Node name: ?????cxemrdb02

Release: ???????2.6.32-431.el6.x86_64

Version: ???????#1 SMP Sun Nov 10 22:19:54 EST 2013

Machine: ???????x86_64

Instance name: +ASM2

Redo thread mounted by this instance: 0 <none>

Oracle process number: 6

Unix process pid: 26216, image: oracle@cxemrdb02 (DIAG)

*** 2021-12-12 16:22:10.788

*** SESSION ID:(373.1) 2021-12-12 16:22:10.788

*** CLIENT ID:() 2021-12-12 16:22:10.788

*** SERVICE NAME:() 2021-12-12 16:22:10.788

*** MODULE NAME:() 2021-12-12 16:22:10.788

*** ACTION NAME:() 2021-12-12 16:22:10.788

Node id: 1

List of nodes: 0, 1,

*** 2021-12-12 16:22:10.790

Reconfiguration starts [incarn=0]

*** 2021-12-12 16:22:10.790

I’m the voting node

Group reconfiguration cleanup

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

 

3. 问题分析描述

 

通过对数据库集群相关日志的分析,可以发现在15:36分时的非正常操作,导致了数据库及集群报错并状态异常,此时进行了数据库主机的重启动作;重启后集群日志中显示ASM无法启动,根据ASM日志及TRACE文件信息,结合ORACLE官方文档中ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)文档中描述的信息,可以判断是集群私网通信出现了问题,通过对集群私网相关配置进行分析、排查,最终确认是集群通信的socket文件(/var/tmp/.oracle)内容由于集群异常关闭导致不一致,从而引起集群节点2重启后不能正常启动。通过清理socket文件后集群可以正常启动 恢复正常。

 

 

 

 

 

三、总结与后续处理建议

1. 问题总结

 

通过对数据库集群相关日志的分析,可以发现在15:36分时的非正常操作,导致了数据库及集群报错并状态异常,此时进行了数据库主机的重启动作;重启后集群日志中显示ASM无法启动,根据ASM日志及TRACE文件信息,结合ORACLE官方文档中ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)文档中描述的信息,可以判断是集群私网通信出现了问题;通过对集群私网相关配置进行分析、排查,最终确认是集群通信的socket文件(/var/tmp/.oracle)内容由于集群异常关闭导致不一致,从而引起集群节点2重启后与节点1通信异常,不能正常启动。通过清理socket文件后集群可以正常启动,数据库集群恢复正常。

 

 

 

4. 后续对于集群添加磁盘的建议

对于生产环境数据库集群,向ASM环境中添加磁盘是一个重要且有一点风险的动作,由于数据库所有数据都存放在ASM磁盘组中,因此如果向已有ASM磁盘组添加磁盘不当,可能导致ASM磁盘组读写异常或不能正常MOUNT,从而影响正常的数据库业务运行。因此结合ORACLE官方的标准步骤”如何在 RAC 集群或单机 ASM 环境中对已经存在的 Diskgroup 添加新磁盘(最佳实践) (文档 ID 1946668.1)”以及我们多年的数据库维护经验,

建议后续添加ASM磁盘时遵从如下步骤:

步骤 操作明细
1 对现有集群状态、ASM磁盘状态、使用率、磁盘大小检查
2 存储层面进行相应的LUN划分,建议划分的LUN大小与原ASM磁盘组中的磁盘大小一致
3 存储层新划分LUN到主机操作系统层面的映射,确认集群中每个主机可以正常识别读写LUN
4 参照已有的ASM磁盘配置相应的别名或多路径映射,集群kfed命令可以读取
5 使用新划分LUN新建临时测试磁盘组确认新划分LUN可以被ASM使用
6 删除新加的测试磁盘组
7 将新划分LUN加入所需的ASM磁盘组中
8 检查ASM磁盘组rebalance完成状态,检查集群及数据库的状态
Oracle Asm 增加磁盘 操作不当引起ASM无法启动故障处理记录

概述

医院数据库监控平台显示连接HIS数据库1节点异常,通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。

问题及相关日志分析

1. 检查监听

通过2021/05/18的巡检报告中发现监听中存在1.80的IP地址,而此次检查发现监听中无1.80的IP地址。

 

2. 检查集群状态

检查集群状态,发现集群中多个资源状态显示OFFLINE。

[grid@cxhisdb02 ~]$ crsctl stat res -t

——————————————————————————–

NAME           TARGET  STATE        SERVER                   STATE_DETAILS

——————————————————————————–

Local Resources

——————————————————————————–

ora.DATA.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.LISTENER.lsnr

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.OCR.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.SSD.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.asm

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02                Started

ora.gsd

OFFLINE OFFLINE      cxhisdb01

OFFLINE OFFLINE      cxhisdb02

ora.net1.network

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.ons

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.registry.acfs

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

——————————————————————————–

Cluster Resources

——————————————————————————–

ora.LISTENER_SCAN1.lsnr

1        ONLINE  ONLINE       cxhisdb02

ora.cvu

1        ONLINE  OFFLINE

ora.cxhisdb01.vip

1        ONLINE  OFFLINE

ora.cxhisdb02.vip

1        ONLINE  ONLINE       cxhisdb02

ora.hospital.db

1        ONLINE  OFFLINE

2        ONLINE  ONLINE       cxhisdb02                Open

ora.oc4j

1        ONLINE  OFFLINE

ora.scan1.vip

1        ONLINE  ONLINE       cxhisdb02

 

 

3. 检查集群日志

仅有CRSD服务异常,其他集群资源均正常,因此数据库未宕机。

2021-05-26 13:23:46.059:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:46.068:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:56.293:

[/u01/app/11.2.0/grid/bin/oraagent.bin(66885)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/oraagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:21:18} in

/u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/oraagent_grid/oraagent_grid.log.

2021-05-26 13:23:56.294:

[/u01/app/11.2.0/grid/bin/oraagent.bin(31320)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/oraagent_oracle’ disconnected from server. Details at (:CRSAGF00117:) {0:19:5060

3} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/oraagent_oracle/oraagent_oracle.log.

2021-05-26 13:23:56.461:

[/u01/app/11.2.0/grid/bin/orarootagent.bin(145347)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/orarootagent_root’ disconnected from server. Details at (:CRSAGF00117:) {0:

5:1568} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/orarootagent_root/orarootagent_root.log.

2021-05-26 13:23:56.485:

[/u01/app/11.2.0/grid/bin/scriptagent.bin(145549)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/scriptagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:9:

68} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/scriptagent_grid/scriptagent_grid.log.

2021-05-26 13:23:56.651:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:23:58.540:

[crsd(5795)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:58.548:

[crsd(5795)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:58.964:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:00.374:

[crsd(5834)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:00.382:

[crsd(5834)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:01.010:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:02.447:

[crsd(5886)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:02.455:

[crsd(5886)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:03.068:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:04.457:

[crsd(5909)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:04.465:

[crsd(5909)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:05.102:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:06.492:

[crsd(5937)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:06.501:

[crsd(5937)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:07.132:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:08.517:

[crsd(5986)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:08.525:

[crsd(5986)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:09.162:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:10.544:

[crsd(6015)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:10.552:

[crsd(6015)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:11.193:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:12.581:

[crsd(6051)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:12.589:

[crsd(6051)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:13.223:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:14.614:

[crsd(6070)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:14.622:

[crsd(6070)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:15.253:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:16.643:

[crsd(6090)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:16.650:

[crsd(6090)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:17.284:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:17.284:

[ohasd(144192)]CRS-2771:Maximum restart attempts reached for resource ‘ora.crsd’; will not restart.

2021-05-26 13:24:17.315:

[ohasd(144192)]CRS-2769:Unable to failover resource ‘ora.crsd’.

 

4. 检查ASM日志

ASM日志中2021/05/26 12:19:57显示OCR仲裁盘有问题,13:23出现读写错误。

Wed May 26 12:19:57 2021

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.

Wed May 26 12:19:57 2021

NOTE: process _b000_+asm1 (160488) initiating offline of disk 0.1409468596 (OCR_0000) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 2.1409468594 (OCR_0002) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 3.1409468595 (OCR_0003) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 4.1409468592 (OCR_0004) with mask 0x7e in group 3

NOTE: checking PST: grp = 3

GMON checking disk modes for group 3 at 15 for pid 46, osid 160488

ERROR: no read quorum in group: required 3, found 1 disks

NOTE: checking PST for grp 3 done.

NOTE: initiating PST update: grp = 3, dsk = 0/0x5402c8b4, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 2/0x5402c8b2, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 3/0x5402c8b3, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 4/0x5402c8b0, mask = 0x6a, op = clear

GMON updating disk modes for group 3 at 16 for pid 46, osid 160488

ERROR: no read quorum in group: required 3, found 1 disks

Wed May 26 12:19:57 2021

NOTE: cache dismounting (not clean) group 3/0xA242386E (OCR)

NOTE: messaging CKPT to quiesce pins Unix process pid: 160495, image: oracle@cxhisdb01 (B001)

Wed May 26 12:19:57 2021

NOTE: halting all I/Os to diskgroup 3 (OCR)

Wed May 26 12:19:57 2021

NOTE: LGWR doing non-clean dismount of group 3 (OCR)

NOTE: LGWR sync ABA=15.85 last written ABA 15.85

WARNING: Offline for disk OCR_0000 in mode 0x7f failed.

WARNING: Offline for disk OCR_0002 in mode 0x7f failed.

WARNING: Offline for disk OCR_0003 in mode 0x7f failed.

WARNING: Offline for disk OCR_0004 in mode 0x7f failed.

Wed May 26 12:19:58 2021

kjbdomdet send to inst 2

detach from dom 3, sending detach message to inst 2

Wed May 26 12:19:58 2021

NOTE: No asm libraries found in the system

Wed May 26 12:19:58 2021

List of instances:

1 2

Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)

Global Resource Directory partially frozen for dirty detach

* dirty detach – domain 3 invalid = TRUE

2 GCS resources traversed, 0 cancelled

Dirty Detach Reconfiguration complete

Wed May 26 12:19:58 2021

WARNING: dirty detached from domain 3

NOTE: cache dismounted group 3/0xA242386E (OCR)

 

 

2021-05-26 13:23:46.059:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:46.068:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

 

 

问题处理步骤

由于数据库仅1节点OCR磁盘组访问异常,2节点以及其他盘访问正常,因而导致1节点CRS资源异常,且数据库未宕机,处理步骤需启动CRS资源,并且重启监听。

1. 1节点挂载OCR磁盘组

[root@cxhisdb01 ~]# su – grid

[grid@cxhisdb01 ~]$ sqlplus / as sysasm

 

SQL*Plus: Release 11.2.0.4.0 Production on Thu May 27 12:12:17 2021

 

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

 

 

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production

With the Real Application Clusters and Automatic Storage Management options

 

SQL> alter diskgroup ocr mount;

 

Diskgroup altered.

 

SQL> exit

2. 启动CRS

[grid@cxhisdb01 ~]$ crsctl start res ora.crsd -init

CRS-2672: Attempting to start ‘ora.crsd’ on ‘cxhisdb01’

CRS-2676: Start of ‘ora.crsd’ on ‘cxhisdb01’ succeeded

3. 检查集群资源

[grid@cxhisdb01 ~]$ crsctl stat res -t

——————————————————————————–

NAME           TARGET  STATE        SERVER                   STATE_DETAILS

——————————————————————————–

Local Resources

——————————————————————————–

ora.DATA.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.LISTENER.lsnr

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.OCR.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.SSD.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.asm

ONLINE  ONLINE       cxhisdb01                Started

ONLINE  ONLINE       cxhisdb02                Started

ora.gsd

OFFLINE OFFLINE      cxhisdb01

OFFLINE OFFLINE      cxhisdb02

ora.net1.network

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.ons

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.registry.acfs

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

——————————————————————————–

Cluster Resources

——————————————————————————–

ora.LISTENER_SCAN1.lsnr

1        ONLINE  ONLINE       cxhisdb02

ora.cvu

1        ONLINE  ONLINE       cxhisdb01

ora.cxhisdb01.vip

1        ONLINE  ONLINE       cxhisdb01

ora.cxhisdb02.vip

1        ONLINE  ONLINE       cxhisdb02

ora.hospital.db

1        ONLINE  ONLINE       cxhisdb01                Open

2        ONLINE  ONLINE       cxhisdb02                Open

ora.oc4j

1        ONLINE  ONLINE       cxhisdb01

ora.scan1.vip

1        ONLINE  ONLINE       cxhisdb02

4. 重启监听

[grid@cxhisdb01 ~]$ srvctl stop listener -n cxhisdb01

[grid@cxhisdb01 ~]$ srvctl start listener -n cxhisdb01

[grid@cxhisdb01 ~]$ lsnrctl status

 

LSNRCTL for Linux: Version 11.2.0.4.0 – Production on 27-MAY-2021 12:20:44

 

Copyright (c) 1991, 2013, Oracle.  All rights reserved.

 

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))

STATUS of the LISTENER

————————

Alias                     LISTENER

Version                   TNSLSNR for Linux: Version 11.2.0.4.0 – Production

Start Date                27-MAY-2021 12:20:40

Uptime                    0 days 0 hr. 0 min. 4 sec

Trace Level               off

Security                  ON: Local OS Authentication

SNMP                      OFF

Listener Parameter File   /u01/app/11.2.0/grid/network/admin/listener.ora

Listener Log File         /u01/app/grid/diag/tnslsnr/cxhisdb01/listener/alert/log.xml

Listening Endpoints Summary…

(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))

(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.80)(PORT=1521)))

(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.10)(PORT=1521)))

Services Summary…

Service “hospital” has 1 instance(s).

Instance “hospital1”, status READY, has 1 handler(s) for this service…

Service “hospitalXDB” has 1 instance(s).

Instance “hospital1”, status READY, has 1 handler(s) for this service…

The command completed successfully

[grid@cxhisdb01 ~]$ cd /u01/app/grid/diag/tnslsnr/cxhisdb01/listener/trace/

[grid@cxhisdb01 trace]$ tail -f listener.log

27-MAY-2021 12:20:51 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVICE_NAME=hospital)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.171)(PORT=62045)) * establish * hospital * 0

27-MAY-2021 12:20:51 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3676)) * establish * hospital * 0

27-MAY-2021 12:20:51 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3677)) * establish * hospital * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23431)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4442)) * establish * hospital * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4444)) * establish * hospital * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SID=hospital1)(CID=(PROGRAM=配置数据库.exe)(HOST=YYJQZZJFW)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.244)(PORT=58397)) * establish * hospital1 * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23432)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3681)) * establish * hospital * 0

27-MAY-2021 12:20:57 * service_update * hospital1 * 0

27-MAY-2021 12:20:58 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23438)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:59 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4460)) * establish * hospital * 0

27-MAY-2021 12:21:00 * service_update * hospital1 * 0

Thu May 27 12:21:01 2021

27-MAY-2021 12:21:01 * (CONNECT_DATA=(SID=hospital1)(CID=(PROGRAM=配置数据库.exe)(HOST=YYJQZZJFW)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.244)(PORT=58398)) * establish * hospital1 * 0

27-MAY-2021 12:21:03 * service_update * hospital1 * 0

27-MAY-2021 12:21:03 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23444)) * establish * HOSPITAL * 0

 

总结与后续处理建议

1. 问题总结

HIS数据库1节点访问OCR仲裁盘异常导致CRSD异常,进而引起1节点上多个集群资源offline,监听资源中相较于以前也少了192.2.1.80的IP,因此监控平台显示连接HIS数据库1节点异常,由于2节点和其他磁盘组无异常,因此数据库未发生宕机现象。

2. 处理操作

  • HIS数据库1节点手工mount OCR磁盘组
  • 启动CRS资源
  • 1节点重启监听资源

3. 后续建议

此次故障发生是由于集群资源访问OCR磁盘组异常导致,建议联系存储工程师,排查相关时间点存储运行情况,并定期做好存储巡检和状态监控工作。

Oracle voting disk 故障处理一例

新装的oracle 19c grid集群启动报错,日志信息如下:

[root@luda bin]# crsctl start crs
CRS-41053: checking Oracle Grid Infrastructure for file permission issues
PRVH-0112 : Path “/etc/oracle/maps” with permissions “rwxrwxr-x” does not have write permissions for others on node “rac1”.
PRVH-0100 : Restricted deletion flag is not set for path “/etc/oracle/maps” on node “rac1”.
PRVH-0100 : Restricted deletion flag is not set for path “/oracle/19c/grid/crsdata/rac1/shm” on node “rac1”.
PRVG-11960 : Set user ID bit is not set for file “/oracle/19c/19.3.0/grid/bin/jssu” on node “rac1”.
PRVH-0147 : Set group ID bit is not set for file “/oracle/19c/19.3.0/grid/bin/extproc” on node “rac1”.
PRVG-11960 : Set user ID bit is not set for file “/oracle/19c/19.3.0/grid/bin/extjob” on node “rac1”.
PRVG-11960 : Set user ID bit is not set for file “/oracle/19c/19.3.0/grid/bin/oracle” on node “rac1”.
PRVH-0147 : Set group ID bit is not set for file “/oracle/19c/19.3.0/grid/bin/oracle” on node “rac1”.
PRVG-11960 : Set user ID bit is not set for file “/oracle/19c/19.3.0/grid/bin/oradism” on node “rac1”.
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.

 

检查发现权限有问题,通过增加权限解决

 

chmod a+s /oracle/19c/19.3.0/grid/bin/extjob
chmod a+s /oracle/19c/19.3.0/grid/bin/oracle
chmod a+s /oracle/19c/19.3.0/grid/bin/oradism
chmod +t /oracle/19c/grid/crsdata/rac1/shm

Oracle 19c RAC PRVG-11960 0147

19c的重要特性之一DML重定向,原理发送到ADG备库上的DML操作,自动转发到主库执行,然后通过主库日志传递到备库实时应用,在保证了ACID的前提下,增强了备库的功能性

在 18c 中,这个特性是否启用通过隐含参数 _enable_proxy_adg_redirect 控制;

在 19c 中,则由显式参数 ADG_REDIRECT_DML控制;

SQL> show parameter ADG_REDIRECT_DML

NAME TYPE VALUE
———————————— ———– ——————————
adg_redirect_dml boolean FALSE

SQL> alter system set adg_redirect_dml=true scope=both;

System altered.

SQL> create user test identified by test;
create user test identified by test
*
ERROR at line 1:
ORA-16000: database or pluggable database open for read-only access

SQL> create table test(id int,sdate char(20));
create table test(id int,sdate char(20))
*
ERROR at line 1:
ORA-16000: database or pluggable database open for read-only access

SQL> conn luda/luda
Connected.
SQL> show parameter ADG_REDIRECT_DML

NAME TYPE VALUE
———————————— ———– ——————————
adg_redirect_dml boolean TRUE
SQL>
alter session set events ‘10046 trace name context forever ,level 12′;

insert into test values (1, to_char(current_timestamp at time zone dbtimezone,’dd-mon-rr hh:mi:ss’));

alter session set events ‘10046 trace name context off’;

select distinct(m.sid),p.pid,p.tracefile from v$mystat m,v$session s,v$process p where m.sid=s.sid and s.paddr=p.addr;

SID PID
———- ———-
TRACEFILE
——————————————————————————–
223 44
/u01/app/oracle/diag/rdbms/std/orcl/trace/orcl_ora_1768553.trc

格式化输出

tkprof /u01/app/oracle/diag/rdbms/std/orcl/trace/orcl_ora_1768553.trc /home/oracle/orcl_ora_1768553.out

[oracle@oracle19c ~]$ cat orcl_ora_1768553.out

TKPROF: Release 19.0.0.0.0 – Development on Mon Apr 27 07:56:09 2020

Copyright (c) 1982, 2019, Oracle and/or its affiliates. All rights reserved.

Trace file: /u01/app/oracle/diag/rdbms/std/orcl/trace/orcl_ora_1768553.trc
Sort options: default

********************************************************************************
count = number of times OCI procedure was executed
cpu = cpu time in seconds executing
elapsed = elapsed time in seconds executing
disk = number of physical reads of buffers from disk
query = number of buffers gotten for consistent read
current = number of buffers gotten in current mode (usually for update)
rows = number of rows processed by the fetch or execute call
********************************************************************************

The following statement encountered a error during parse:

insert into test values (1, to_char(current_timestamp at time zone dbtimezone,’dd-mon-rr hh:mi:ss’))

Error encountered: ORA-01157
********************************************************************************

SQL ID: 8r8b2pyy9qjm2 Plan Hash: 0

alter session set events ‘10046 trace name context forever, level 12’

call count cpu elapsed disk query current rows
——- —— ——– ———- ———- ———- ———- ———-
Parse 0 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 0 0.00 0.00 0 0 0 0
——- —— ——– ———- ———- ———- ———- ———-
total 1 0.00 0.00 0 0 0 0

Misses in library cache during parse: 0
Misses in library cache during execute: 1
Optimizer mode: ALL_ROWS
Parsing user id: 106

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
—————————————- Waited ———- ————
SQL*Net message to client 3 0.00 0.00
SQL*Net message from client 3 26.12 47.13
SQL*Net break/reset to client 3 0.00 0.00
single-task message 1 0.01 0.01
SQL*Net message from dblink 7 0.01 0.02
SQL*Net message to dblink 6 0.00 0.00
SQL*Net vector data to dblink 1 0.00 0.00
asynch descriptor resize 1 0.00 0.00
Disk file operations I/O 1 0.00 0.00
control file sequential read 8 0.00 0.00
********************************************************************************

SQL ID: 06nvwn223659v Plan Hash: 0

alter session set events ‘10046 trace name context off’

call count cpu elapsed disk query current rows
——- —— ——– ———- ———- ———- ———- ———-
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 0 0.00 0.00 0 0 0 0
——- —— ——– ———- ———- ———- ———- ———-
total 2 0.00 0.00 0 0 0 0

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 106

********************************************************************************

OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call count cpu elapsed disk query current rows
——- —— ——– ———- ———- ———- ———- ———-
Parse 1 0.00 0.00 0 0 0 0
Execute 2 0.00 0.00 0 0 0 0
Fetch 0 0.00 0.00 0 0 0 0
——- —— ——– ———- ———- ———- ———- ———-
total 3 0.00 0.00 0 0 0 0

Misses in library cache during parse: 1
Misses in library cache during execute: 1

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
—————————————- Waited ———- ————
SQL*Net message to client 3 0.00 0.00
SQL*Net message from client 3 26.12 47.13
SQL*Net break/reset to client 3 0.00 0.00
single-task message 1 0.01 0.01
SQL*Net message from dblink 7 0.01 0.02
SQL*Net message to dblink 6 0.00 0.00
SQL*Net vector data to dblink 1 0.00 0.00
asynch descriptor resize 1 0.00 0.00
Disk file operations I/O 1 0.00 0.00
control file sequential read 8 0.00 0.00

OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS

call count cpu elapsed disk query current rows
——- —— ——– ———- ———- ———- ———- ———-
Parse 0 0.00 0.00 0 0 0 0
Execute 0 0.00 0.00 0 0 0 0
Fetch 0 0.00 0.00 0 0 0 0
——- —— ——– ———- ———- ———- ———- ———-
total 0 0.00 0.00 0 0 0 0

Misses in library cache during parse: 0

2 user SQL statements in session.
0 internal SQL statements in session.
2 SQL statements in session.
********************************************************************************
Trace file: /u01/app/oracle/diag/rdbms/std/orcl/trace/orcl_ora_1768553.trc
Trace file compatibility: 12.2.0.0
Sort options: default

1 session in tracefile.
2 user SQL statements in trace file.
0 internal SQL statements in trace file.
2 SQL statements in trace file.
2 unique SQL statements in trace file.
88 lines in trace file.
47 elapsed seconds in trace file.

Oracle 19C ADG的自动DML重定向 ADG_REDIRECT_DML