Skip to content

Oracle - 12. page

概述

 

数据库对应的业务程序在2021/3/7日凌晨0:02分报无法连接数据库,在0:04分接到故障申告后,代维科技工程师通过远程方式接入数据库环境进行故障排查,在0:08分处理完成。通过对相关日志等信息的深入分析,找出问题点并提出相应的解决建议,根据相关问题日志及分析处理情况整理汇总此文档。

 

无法连接的原因分析

1. 宁帆数据库无法连接分析

 

登陆数据库进行查看,数据库无法连接原因为数据库主机磁盘空间使用率100%,归档日志无法切换,应用程序此时无法连接到数据库。紧急通过人工删除一部分数据库日志,并进行日志切换动作后,程序可以正常连接。

[root@nf-dboracle ~]# df -h

Filesystem                          Size  Used Avail Use% Mounted on

/dev/mapper/vg_nfdboracle-LogVol00  476G  452G   16M 100% /

tmpfs                               7.8G  160K  7.8G   1% /dev/shm

/dev/sda1                          1008M   63M  895M   7% /boot

/dev/sr0                            3.6G  3.6G     0 100% /mnt

[root@nf-dboracle ~]# su – oracle

[oracle@nf-dboracle ~]$ rman target /

 

Recovery Manager: Release 11.2.0.3.0 – Production on Sun Mar 7 00:07:32 2021

 

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

 

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-00554: initialization of internal recovery manager package failed

RMAN-04005: error from target database:

ORA-09945: Unable to initialize the audit trail file

Linux-x86_64 Error: 28: No space left on device

 

 

2. 磁盘空间不足原因的分析

通过对数据库归档日志产生数量的分析,可以发现,从3月4号以来,每天的数据库归档日志产生量是原来的15至20倍左右,当前每天产生120GB的归档日志,按原有的备份策略,归档日志保留2天,因此最终引起了磁盘空间不足。

 

 

 

3. 产生归档日志量多的操作分析

数据库层面看,70%的数据变更都在KLB_MICROHOSP用户的 NFSYS_LOG_THIRD表上,在TOP SQL中也发现了对这个表的频率插入;结合表名看,有可能是新开启了写入日志的功能,具体需要应用软件开发商确认一下。

INSERT INTO NFSYS_LOG_THIRD ( ID, TYPE, METHOD, MSG, DESCRIPTION, NAME, OPENID, THIRD_METHOD ) VALUES ( NFSYS_LOG_THIRD_S.nextval, :type0, :method0, :msg, :description, :name, :openid, :third_method ) RETURNING ID INTO :IDENTITY_PARAMETER

 

 


总结与后续处理建议

通过对数据库相关日志以及相关数据的分析,可以得出如下信息:

1. 数据库无法连接原因的分析:

登陆数据库进行查看,数据库无法连接原因为数据库主机磁盘空间使用率100%,归档日志无法切换,应用程序此时无法连接到数据库。紧急通过人工删除一部分数据库日志,并进行日志切换动作后,程序可以正常连接。(0点4分接到电话,0点8分处理完成,用时4分钟。)

2. 磁盘空间不足分析:

通过对数据库归档日志产生数量的分析,可以发现,从3月4号以来,每天的数据库归档日志产生量是原来的15至20倍左右,当前每天产生120GB的归档日志,按原有的备份策略,归档日志保留2天,因此最终引起了磁盘空间不足。已经临时修改了备份及删除归档日志的策略来解决此问题;

3. 产生日志多的应用程序语句:

从数据库层面看,70%的数据变更都在KLB_MICROHOSP用户的 NFSYS_LOG_THIRD表上,在TOP SQL中也发现了对这个表的频率插入;结合表名看,有可能是新开启了写入日志的功能,具体需要应用软件开发商确认一下。

INSERT INTO NFSYS_LOG_THIRD ( ID, TYPE, METHOD, MSG, DESCRIPTION, NAME, OPENID, THIRD_METHOD ) VALUES ( NFSYS_LOG_THIRD_S.nextval, :type0, :method0, :msg, :description, :name, :openid, :third_method ) RETURNING ID INTO :IDENTITY_PARAMETER

 

 

4. 下一步排查及处理方向

因此,接下来需要及时排查的问题为:

  1. 应用程序开发商根据对应的SQL语句,排查程序在3/4日做了哪些升级或更新操作。
  2. 程序的调整是短期还是长期需要按此模式运行;这决定了后续数据库层面如何制定相应的备份及删除策略。
某客户报表数据库无法连接故障分析
某医疗客户最近在做大表归档以及优化的项目,加上oracle版本的变更,最近参考oracle support结合以前的版本经验总结如下的集中方式:
参考 NOTE:1390871.1
A) 通过 Export/import 方法

B) 通过 Insert with a subquery 方法

C) 通过 Partition Exchange 方法

D) 通过 DBMS_REDEFINITION 方法

E) ALTER TABLE 命令的MODIFY语法 (从 12.2 开始)

以上5种方法都会从一个已经存在的非分区表创建出一个分区表。尽管我们已经对这四种方法的维护时间预期做了倒序排列,但实际维护时间针对不同情况有所不同。

通过 Export/import 方法

这种方法的实现是先 export 一个非分区表,创建一个新的分区表,然后 import 数据到新创建的分区表中。

1) Export 您的非分区表:

$ exp usr/pswd tables=numbers file=exp.dmp

2) Drop 掉该非分区表:

SQL> drop table numbers;

3) 重新创建该表成为一个分区表:

SQL> create table numbers (qty number(3), name varchar2(15)) partition by range (qty)(partition p1 values less than (501),
partition p2 values less than (maxvalue));

4) 通过 import 的 ignore=y 方式来还原备份的数据:

$ imp usr/pswd file=exp.dmp ignore=y

ignore=y 语句会让 import 忽略掉表的创建,直接加载所有数据。

如果使用 Data Pump export/import(expdp/impdp)您可以采用 impdp 的 table_exists_action 选项,例如 table_exists_action = APPEND 或者 table_exists_action = REPLACE。

可以参考 Note 552424.1 Export/Import DataPump Parameter ACCESS_METHOD – How to Enforce a Method of Loading and Unloading Data?

通过 Insert with a subquery 方法

1) 创建一个分区表:

SQL> create table partbl (qty number(3), name varchar2(15)) partition by range (qty) (partition p1 values less than (501),partition p2 values less than (maxvalue));

2) 将原来非分区表中的数据通过子查询 insert 到新创建的分区表中:

SQL> insert into partbl (qty, name) select * from origtbl;

3) 如果您想让新建的分区表与原表名相同,那么 drop 掉原来的非分区表然后重命名新表:

SQL> drop table origtbl;
SQL> alter table partbl rename to origtbl;

您可以通过 direct path insert 和利用并行来改善 insert 的性能。如下的例子演示了如何实现并且如何从执行计划中来验证。

传统的 insert

SQL> insert into partbl (qty, name) select * from origtbl;
--------------------------------------------
| Id  | Operation                | Name    |
--------------------------------------------
|   0 | INSERT STATEMENT         |         |
|   1 |  LOAD TABLE CONVENTIONAL |         |
|   2 |   TABLE ACCESS FULL      | ORIGTBL |
--------------------------------------------

Direct load insert 方式

SQL> insert /*+APPEND*/ into partbl (qty, name) select * from origtbl;
--------------------------------------
| Id  | Operation          | Name    |
--------------------------------------
|   0 | INSERT STATEMENT   |         |
|   1 |  LOAD AS SELECT    |         |
|   2 |   TABLE ACCESS FULL| ORIGTBL |
--------------------------------------

Direct load insert 并且在查询部分开启并行

SQL> insert /*+APPEND PARALLEL*/ into partbl (qty, name) select * from origtbl;
------------------------------------------
| Id  | Operation             | Name     |
------------------------------------------
|   0 | INSERT STATEMENT      |          |
|   1 |  LOAD AS SELECT       |          |
|   2 |   PX COORDINATOR      |          |
|   3 |    PX SEND QC (RANDOM)| :TQ10000 |
|   4 |     PX BLOCK ITERATOR |          |
|*  5 |      TABLE ACCESS FULL| ORIGTBL  |
------------------------------------------

注意以上执行计划中 LOAD AS SELECT 在 PX COORDINATOR 的上面。

Direct load insert 并且在查询部分和 insert 部分都开启并行

SQL>alter session enable parallel dml;
SQL> insert /*+APPEND PARALLEL*/ into partbl (qty, name) select * from origtbl;
------------------------------------------
| Id  | Operation             | Name     |
------------------------------------------
|   0 | INSERT STATEMENT      |          |
|   1 |  PX COORDINATOR       |          |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |
|   3 |    LOAD AS SELECT     |          |
|   4 |     PX BLOCK ITERATOR |          |
|*  5 |      TABLE ACCESS FULL| ORIGTBL  |
------------------------------------------

注意在以上执行计划中 LOAD AS SELECT 在 PX COORDINATOR 的下面。

另外一种可选的方式是直接通过 select 来创建新的分区表:一次性创建新的分区表并且加载数据。
执行计划同时显示 direct path load 并且 dml 以及 select 部分全部并行。

SQL>alter session enable parallel dml;
SQL> create table partbl (qty, name) partition by range (qty) (partition p1 values less than (501),partition p2 values less than (maxvalue))
  2  as select /*+PARALLEL*/ * from origtbl;
-------------------------------------------
| Id  | Operation              | Name     |
-------------------------------------------
|   0 | CREATE TABLE STATEMENT |          |
|   1 |  PX COORDINATOR        |          |
|   2 |   PX SEND QC (RANDOM)  | :TQ10000 |
|   3 |    LOAD AS SELECT      |          |
|   4 |     PX BLOCK ITERATOR  |          |
|*  5 |      TABLE ACCESS FULL | ORIGTBL  |
-------------------------------------------

 通过 Partition Exchange 方法

ALTER TABLE EXCHANGE PARTITION 可以通过交换数据和索引 segment 来将一个分区(或子分区)转换成一个非分区表,也可以将一个非分区表转换成一个分区表的分区(或子分区)。 除了需要更新索引以外,ALTER TABLE … EXCHANGE PARTITION 命令是一个字典操作不需要数据移动。更多关于此方法的信息参见 Oracle 联机文档(比如 11.2)和 Note 198120.1。

此方法简要步骤如下:

1) 根据所需的分区来创建新的分区表
2) 保持需要交换的非分区表与分区表的分区有相同的结构,并且确保您需要交换的非分区表具有您想要交换的内容
3) 执行:Alter table exchange partition partition_name with table exchange table

注意在交换过程中,所有交换的数据必须满足分区表的分区定义,否则如下错误将抛出:ORA-14099: all rows in table do not qualify for specified partition.

这是因为默认情况下分区交换是有校验的。

例子(基于 SCOTT 示例 schema)
———

本例创建了与分区表 p_emp 的分区相同结构的交换表。

SQL> CREATE TABLE p_emp
2 (sal NUMBER(7,2))
3 PARTITION BY RANGE(sal)
4 (partition emp_p1 VALUES LESS THAN (2000),
5 partition emp_p2 VALUES LESS THAN (4000));

Table created.

SQL> SELECT * FROM emp;

EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM     DEPTNO
———- ———- ——— ———- ——— ———- ———- ———-
7369 SMITH      CLERK           7902 17-DEC-80        800                    20
7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300         30
7521 WARD       SALESMAN        7698 22-FEB-81       1250        500         30
7566 JONES      MANAGER         7839 02-APR-81       2975                    20
7654 MARTIN     SALESMAN        7698 28-SEP-81       1250       1400         30
7698 BLAKE      MANAGER         7839 01-MAY-81       2850                    30
7782 CLARK      MANAGER         7839 09-JUN-81       2450                    10
7788 SCOTT      ANALYST         7566 19-APR-87       3000                    20
7839 KING       PRESIDENT            17-NOV-81       5000                    10
7844 TURNER     SALESMAN        7698 08-SEP-81       1500          0         30
7876 ADAMS      CLERK           7788 23-MAY-87       1100                    20
7900 JAMES      CLERK           7698 03-DEC-81        950                    30
7902 FORD       ANALYST         7566 03-DEC-81       3000                    20
7934 MILLER     CLERK           7782 23-JAN-82       1300                    10

14 rows selected.

SQL> CREATE TABLE exchtab1 as SELECT sal FROM emp WHERE sal<2000;

Table created.

SQL> CREATE TABLE exchtab2 as SELECT sal FROM emp WHERE sal BETWEEN 2000 AND 3999;

Table created.

SQL> alter table p_emp exchange partition emp_p1 with table exchtab1;

Table altered.

SQL> alter table p_emp exchange partition emp_p2 with table exchtab2;

Table altered.

D. 通过在线重定义的方式(DBMS_REDEFINITION) 方法

详情参见:

Note 472449.1 How To Partition Existing Table Using DBMS_Redefinition
Note 1481558.1  DBMS_REDEFINITION: Case Study for a Large Non-Partition Table to a Partition Table with Online Transactions occuring
Note 177407.1 How to Re-Organize a Table Online

 

 

E. MODIFY clause added to the ALTER TABLE SQL Statement. (From 12.2 to 20c)

从 12.2开始,可以使用 Alter Table <table_name> MODIFY 把非分区表转换成分区表。

更多详情请参考 https://docs.oracle.com/en/database/oracle/oracle-database/12.2/vldbg/evolve-nopartition-table.html#GUID-5FDB7D59-DD05-40E4-8AB4-AF82EA0D0FE5

Oracle非分区表转换成分区表的若干方式

概述

 

某客户由存储厂商维护人员将LUN存储划分到EMR数据库主机上;在此期间由于不当操作,导致数据库集群状态异常,之后重启了数据库主机节点2,重启主机后集群软件GRID无法正常启动,节点1仍可以正常提供数据库服务。通过远程方式接入数据库环境进行故障排查,通过对相关日志等信息的深入分析,最终找到问题点并进行处理,数据库集群恢复正常。

 

集群及数据库日志分析

1. 15:36分主机重启前的错误信息

从此处日志可以看出,在15:36分时数据库实例报IO读写错误,同时数据库集群日志中也在显示报错,之后发生了主机重启动作。

 

2. 节点2重启主机后相关日志信息

节点2重启主机后,检查集群启动状态可以发现在启动到ASM实例时停止,检查ASM ALERT日志有如下信息:

集群ALERT日志:

2021-12-12 15:48:34.081:

[client(10295)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

[client(10405)]CRS-10001:12-Dec-19 15:48 ACFS-9391: Checking for existing ADVM/ACFS installation.

[client(10410)]CRS-10001:12-Dec-19 15:48 ACFS-9392: Validating ADVM/ACFS installation files for operating system.

[client(10412)]CRS-10001:12-Dec-19 15:48 ACFS-9393: Verifying ASM Administrator setup.

[client(10415)]CRS-10001:12-Dec-19 15:48 ACFS-9308: Loading installed ADVM/ACFS drivers.

[client(10418)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleoks.ko’ driver.

[client(10446)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleadvm.ko’ driver.

[client(10473)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleacfs.ko’ driver.

[client(10560)]CRS-10001:12-Dec-19 15:48 ACFS-9327: Verifying ADVM/ACFS devices.

[client(10564)]CRS-10001:12-Dec-19 15:48 ACFS-9156: Detecting control device ‘/dev/asm/.asm_ctl_spec’.

[client(10568)]CRS-10001:12-Dec-19 15:48 ACFS-9156: Detecting control device ‘/dev/ofsctl’.

[client(10573)]CRS-10001:12-Dec-19 15:48 ACFS-9322: completed

2021-12-12 15:48:43.930:

[client(10374)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:48:53.698:

[client(10598)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:49:03.129:

[client(10905)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:50:00.511:

[client(12458)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:51:07.276:

?

节点2:ASM ALERT日志

Thu Dec 12 16:22:11 2021

MMNL started with pid=21, OS id=26248

lmon registered with NM – instance number 2 (internal mem no 1)

Thu Dec 12 16:24:08 2021

LMON received an instance eviction notification from instance 1

The instance eviction reason is 0x20000000

The instance eviction map is 2

Thu Dec 12 16:24:11 2021

PMON (ospid: 26206): terminating the instance due to error 481

Thu Dec 12 16:24:11 2021

ORA-1092 : opitsk aborting process

Thu Dec 12 16:24:13 2021

System state dump requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216_20211212162413.trc

Dumping diagnostic data in directory=[cdmp_20211212162411], requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].

Instance terminated by PMON, pid = 26206

检查对应的LMON及DIAG日志:

[grid@cxemrdb02 trace]$ more ./+ASM2_diag_26216.trc

Trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216.trc

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production

With the Real Application Clusters and Automatic Storage Management options

ORACLE_HOME = /u01/app/11.2.0/grid

System name: ???Linux

Node name: ?????cxemrdb02

Release: ???????2.6.32-431.el6.x86_64

Version: ???????#1 SMP Sun Nov 10 22:19:54 EST 2013

Machine: ???????x86_64

Instance name: +ASM2

Redo thread mounted by this instance: 0 <none>

Oracle process number: 6

Unix process pid: 26216, image: oracle@cxemrdb02 (DIAG)

*** 2021-12-12 16:22:10.788

*** SESSION ID:(373.1) 2021-12-12 16:22:10.788

*** CLIENT ID:() 2021-12-12 16:22:10.788

*** SERVICE NAME:() 2021-12-12 16:22:10.788

*** MODULE NAME:() 2021-12-12 16:22:10.788

*** ACTION NAME:() 2021-12-12 16:22:10.788

Node id: 1

List of nodes: 0, 1,

*** 2021-12-12 16:22:10.790

Reconfiguration starts [incarn=0]

*** 2021-12-12 16:22:10.790

I’m the voting node

Group reconfiguration cleanup

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

 

3. 问题分析描述

 

通过对数据库集群相关日志的分析,可以发现在15:36分时的非正常操作,导致了数据库及集群报错并状态异常,此时进行了数据库主机的重启动作;重启后集群日志中显示ASM无法启动,根据ASM日志及TRACE文件信息,结合ORACLE官方文档中ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)文档中描述的信息,可以判断是集群私网通信出现了问题,通过对集群私网相关配置进行分析、排查,最终确认是集群通信的socket文件(/var/tmp/.oracle)内容由于集群异常关闭导致不一致,从而引起集群节点2重启后不能正常启动。通过清理socket文件后集群可以正常启动 恢复正常。

 

 

 

 

 

三、总结与后续处理建议

1. 问题总结

 

通过对数据库集群相关日志的分析,可以发现在15:36分时的非正常操作,导致了数据库及集群报错并状态异常,此时进行了数据库主机的重启动作;重启后集群日志中显示ASM无法启动,根据ASM日志及TRACE文件信息,结合ORACLE官方文档中ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)文档中描述的信息,可以判断是集群私网通信出现了问题;通过对集群私网相关配置进行分析、排查,最终确认是集群通信的socket文件(/var/tmp/.oracle)内容由于集群异常关闭导致不一致,从而引起集群节点2重启后与节点1通信异常,不能正常启动。通过清理socket文件后集群可以正常启动,数据库集群恢复正常。

 

 

 

4. 后续对于集群添加磁盘的建议

对于生产环境数据库集群,向ASM环境中添加磁盘是一个重要且有一点风险的动作,由于数据库所有数据都存放在ASM磁盘组中,因此如果向已有ASM磁盘组添加磁盘不当,可能导致ASM磁盘组读写异常或不能正常MOUNT,从而影响正常的数据库业务运行。因此结合ORACLE官方的标准步骤”如何在 RAC 集群或单机 ASM 环境中对已经存在的 Diskgroup 添加新磁盘(最佳实践) (文档 ID 1946668.1)”以及我们多年的数据库维护经验,

建议后续添加ASM磁盘时遵从如下步骤:

步骤 操作明细
1 对现有集群状态、ASM磁盘状态、使用率、磁盘大小检查
2 存储层面进行相应的LUN划分,建议划分的LUN大小与原ASM磁盘组中的磁盘大小一致
3 存储层新划分LUN到主机操作系统层面的映射,确认集群中每个主机可以正常识别读写LUN
4 参照已有的ASM磁盘配置相应的别名或多路径映射,集群kfed命令可以读取
5 使用新划分LUN新建临时测试磁盘组确认新划分LUN可以被ASM使用
6 删除新加的测试磁盘组
7 将新划分LUN加入所需的ASM磁盘组中
8 检查ASM磁盘组rebalance完成状态,检查集群及数据库的状态
Oracle Asm 增加磁盘 操作不当引起ASM无法启动故障处理记录

概述

医院数据库监控平台显示连接HIS数据库1节点异常,通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。

问题及相关日志分析

1. 检查监听

通过2021/05/18的巡检报告中发现监听中存在1.80的IP地址,而此次检查发现监听中无1.80的IP地址。

 

2. 检查集群状态

检查集群状态,发现集群中多个资源状态显示OFFLINE。

[grid@cxhisdb02 ~]$ crsctl stat res -t

——————————————————————————–

NAME           TARGET  STATE        SERVER                   STATE_DETAILS

——————————————————————————–

Local Resources

——————————————————————————–

ora.DATA.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.LISTENER.lsnr

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.OCR.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.SSD.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.asm

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02                Started

ora.gsd

OFFLINE OFFLINE      cxhisdb01

OFFLINE OFFLINE      cxhisdb02

ora.net1.network

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.ons

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.registry.acfs

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

——————————————————————————–

Cluster Resources

——————————————————————————–

ora.LISTENER_SCAN1.lsnr

1        ONLINE  ONLINE       cxhisdb02

ora.cvu

1        ONLINE  OFFLINE

ora.cxhisdb01.vip

1        ONLINE  OFFLINE

ora.cxhisdb02.vip

1        ONLINE  ONLINE       cxhisdb02

ora.hospital.db

1        ONLINE  OFFLINE

2        ONLINE  ONLINE       cxhisdb02                Open

ora.oc4j

1        ONLINE  OFFLINE

ora.scan1.vip

1        ONLINE  ONLINE       cxhisdb02

 

 

3. 检查集群日志

仅有CRSD服务异常,其他集群资源均正常,因此数据库未宕机。

2021-05-26 13:23:46.059:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:46.068:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:56.293:

[/u01/app/11.2.0/grid/bin/oraagent.bin(66885)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/oraagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:21:18} in

/u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/oraagent_grid/oraagent_grid.log.

2021-05-26 13:23:56.294:

[/u01/app/11.2.0/grid/bin/oraagent.bin(31320)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/oraagent_oracle’ disconnected from server. Details at (:CRSAGF00117:) {0:19:5060

3} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/oraagent_oracle/oraagent_oracle.log.

2021-05-26 13:23:56.461:

[/u01/app/11.2.0/grid/bin/orarootagent.bin(145347)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/orarootagent_root’ disconnected from server. Details at (:CRSAGF00117:) {0:

5:1568} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/orarootagent_root/orarootagent_root.log.

2021-05-26 13:23:56.485:

[/u01/app/11.2.0/grid/bin/scriptagent.bin(145549)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/scriptagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:9:

68} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/scriptagent_grid/scriptagent_grid.log.

2021-05-26 13:23:56.651:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:23:58.540:

[crsd(5795)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:58.548:

[crsd(5795)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:58.964:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:00.374:

[crsd(5834)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:00.382:

[crsd(5834)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:01.010:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:02.447:

[crsd(5886)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:02.455:

[crsd(5886)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:03.068:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:04.457:

[crsd(5909)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:04.465:

[crsd(5909)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:05.102:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:06.492:

[crsd(5937)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:06.501:

[crsd(5937)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:07.132:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:08.517:

[crsd(5986)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:08.525:

[crsd(5986)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:09.162:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:10.544:

[crsd(6015)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:10.552:

[crsd(6015)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:11.193:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:12.581:

[crsd(6051)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:12.589:

[crsd(6051)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:13.223:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:14.614:

[crsd(6070)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:14.622:

[crsd(6070)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:15.253:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:16.643:

[crsd(6090)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:16.650:

[crsd(6090)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:17.284:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:17.284:

[ohasd(144192)]CRS-2771:Maximum restart attempts reached for resource ‘ora.crsd’; will not restart.

2021-05-26 13:24:17.315:

[ohasd(144192)]CRS-2769:Unable to failover resource ‘ora.crsd’.

 

4. 检查ASM日志

ASM日志中2021/05/26 12:19:57显示OCR仲裁盘有问题,13:23出现读写错误。

Wed May 26 12:19:57 2021

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.

Wed May 26 12:19:57 2021

NOTE: process _b000_+asm1 (160488) initiating offline of disk 0.1409468596 (OCR_0000) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 2.1409468594 (OCR_0002) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 3.1409468595 (OCR_0003) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 4.1409468592 (OCR_0004) with mask 0x7e in group 3

NOTE: checking PST: grp = 3

GMON checking disk modes for group 3 at 15 for pid 46, osid 160488

ERROR: no read quorum in group: required 3, found 1 disks

NOTE: checking PST for grp 3 done.

NOTE: initiating PST update: grp = 3, dsk = 0/0x5402c8b4, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 2/0x5402c8b2, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 3/0x5402c8b3, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 4/0x5402c8b0, mask = 0x6a, op = clear

GMON updating disk modes for group 3 at 16 for pid 46, osid 160488

ERROR: no read quorum in group: required 3, found 1 disks

Wed May 26 12:19:57 2021

NOTE: cache dismounting (not clean) group 3/0xA242386E (OCR)

NOTE: messaging CKPT to quiesce pins Unix process pid: 160495, image: oracle@cxhisdb01 (B001)

Wed May 26 12:19:57 2021

NOTE: halting all I/Os to diskgroup 3 (OCR)

Wed May 26 12:19:57 2021

NOTE: LGWR doing non-clean dismount of group 3 (OCR)

NOTE: LGWR sync ABA=15.85 last written ABA 15.85

WARNING: Offline for disk OCR_0000 in mode 0x7f failed.

WARNING: Offline for disk OCR_0002 in mode 0x7f failed.

WARNING: Offline for disk OCR_0003 in mode 0x7f failed.

WARNING: Offline for disk OCR_0004 in mode 0x7f failed.

Wed May 26 12:19:58 2021

kjbdomdet send to inst 2

detach from dom 3, sending detach message to inst 2

Wed May 26 12:19:58 2021

NOTE: No asm libraries found in the system

Wed May 26 12:19:58 2021

List of instances:

1 2

Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)

Global Resource Directory partially frozen for dirty detach

* dirty detach – domain 3 invalid = TRUE

2 GCS resources traversed, 0 cancelled

Dirty Detach Reconfiguration complete

Wed May 26 12:19:58 2021

WARNING: dirty detached from domain 3

NOTE: cache dismounted group 3/0xA242386E (OCR)

 

 

2021-05-26 13:23:46.059:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:46.068:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

 

 

问题处理步骤

由于数据库仅1节点OCR磁盘组访问异常,2节点以及其他盘访问正常,因而导致1节点CRS资源异常,且数据库未宕机,处理步骤需启动CRS资源,并且重启监听。

1. 1节点挂载OCR磁盘组

[root@cxhisdb01 ~]# su – grid

[grid@cxhisdb01 ~]$ sqlplus / as sysasm

 

SQL*Plus: Release 11.2.0.4.0 Production on Thu May 27 12:12:17 2021

 

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

 

 

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production

With the Real Application Clusters and Automatic Storage Management options

 

SQL> alter diskgroup ocr mount;

 

Diskgroup altered.

 

SQL> exit

2. 启动CRS

[grid@cxhisdb01 ~]$ crsctl start res ora.crsd -init

CRS-2672: Attempting to start ‘ora.crsd’ on ‘cxhisdb01’

CRS-2676: Start of ‘ora.crsd’ on ‘cxhisdb01’ succeeded

3. 检查集群资源

[grid@cxhisdb01 ~]$ crsctl stat res -t

——————————————————————————–

NAME           TARGET  STATE        SERVER                   STATE_DETAILS

——————————————————————————–

Local Resources

——————————————————————————–

ora.DATA.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.LISTENER.lsnr

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.OCR.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.SSD.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.asm

ONLINE  ONLINE       cxhisdb01                Started

ONLINE  ONLINE       cxhisdb02                Started

ora.gsd

OFFLINE OFFLINE      cxhisdb01

OFFLINE OFFLINE      cxhisdb02

ora.net1.network

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.ons

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.registry.acfs

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

——————————————————————————–

Cluster Resources

——————————————————————————–

ora.LISTENER_SCAN1.lsnr

1        ONLINE  ONLINE       cxhisdb02

ora.cvu

1        ONLINE  ONLINE       cxhisdb01

ora.cxhisdb01.vip

1        ONLINE  ONLINE       cxhisdb01

ora.cxhisdb02.vip

1        ONLINE  ONLINE       cxhisdb02

ora.hospital.db

1        ONLINE  ONLINE       cxhisdb01                Open

2        ONLINE  ONLINE       cxhisdb02                Open

ora.oc4j

1        ONLINE  ONLINE       cxhisdb01

ora.scan1.vip

1        ONLINE  ONLINE       cxhisdb02

4. 重启监听

[grid@cxhisdb01 ~]$ srvctl stop listener -n cxhisdb01

[grid@cxhisdb01 ~]$ srvctl start listener -n cxhisdb01

[grid@cxhisdb01 ~]$ lsnrctl status

 

LSNRCTL for Linux: Version 11.2.0.4.0 – Production on 27-MAY-2021 12:20:44

 

Copyright (c) 1991, 2013, Oracle.  All rights reserved.

 

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))

STATUS of the LISTENER

————————

Alias                     LISTENER

Version                   TNSLSNR for Linux: Version 11.2.0.4.0 – Production

Start Date                27-MAY-2021 12:20:40

Uptime                    0 days 0 hr. 0 min. 4 sec

Trace Level               off

Security                  ON: Local OS Authentication

SNMP                      OFF

Listener Parameter File   /u01/app/11.2.0/grid/network/admin/listener.ora

Listener Log File         /u01/app/grid/diag/tnslsnr/cxhisdb01/listener/alert/log.xml

Listening Endpoints Summary…

(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))

(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.80)(PORT=1521)))

(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.10)(PORT=1521)))

Services Summary…

Service “hospital” has 1 instance(s).

Instance “hospital1”, status READY, has 1 handler(s) for this service…

Service “hospitalXDB” has 1 instance(s).

Instance “hospital1”, status READY, has 1 handler(s) for this service…

The command completed successfully

[grid@cxhisdb01 ~]$ cd /u01/app/grid/diag/tnslsnr/cxhisdb01/listener/trace/

[grid@cxhisdb01 trace]$ tail -f listener.log

27-MAY-2021 12:20:51 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVICE_NAME=hospital)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.171)(PORT=62045)) * establish * hospital * 0

27-MAY-2021 12:20:51 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3676)) * establish * hospital * 0

27-MAY-2021 12:20:51 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3677)) * establish * hospital * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23431)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4442)) * establish * hospital * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4444)) * establish * hospital * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SID=hospital1)(CID=(PROGRAM=配置数据库.exe)(HOST=YYJQZZJFW)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.244)(PORT=58397)) * establish * hospital1 * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23432)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3681)) * establish * hospital * 0

27-MAY-2021 12:20:57 * service_update * hospital1 * 0

27-MAY-2021 12:20:58 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23438)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:59 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4460)) * establish * hospital * 0

27-MAY-2021 12:21:00 * service_update * hospital1 * 0

Thu May 27 12:21:01 2021

27-MAY-2021 12:21:01 * (CONNECT_DATA=(SID=hospital1)(CID=(PROGRAM=配置数据库.exe)(HOST=YYJQZZJFW)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.244)(PORT=58398)) * establish * hospital1 * 0

27-MAY-2021 12:21:03 * service_update * hospital1 * 0

27-MAY-2021 12:21:03 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23444)) * establish * HOSPITAL * 0

 

总结与后续处理建议

1. 问题总结

HIS数据库1节点访问OCR仲裁盘异常导致CRSD异常,进而引起1节点上多个集群资源offline,监听资源中相较于以前也少了192.2.1.80的IP,因此监控平台显示连接HIS数据库1节点异常,由于2节点和其他磁盘组无异常,因此数据库未发生宕机现象。

2. 处理操作

  • HIS数据库1节点手工mount OCR磁盘组
  • 启动CRS资源
  • 1节点重启监听资源

3. 后续建议

此次故障发生是由于集群资源访问OCR磁盘组异常导致,建议联系存储工程师,排查相关时间点存储运行情况,并定期做好存储巡检和状态监控工作。

Oracle voting disk 故障处理一例

概述

某医院与2021/10/22上午10点48分左右,出现应用程序无法响应,前台业务无法办理等情况。通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。

问题及相关日志分析

1. 查看数据库日志

Fri Oct 22 10:27:29 2021

Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_98357.trc  (incident=677598):

ORA-04031: ?T·¨·??? 4000 ×??úμ?12?í?ú′? (“shared pool”,”SELECT /*+ index(BL_BINGLIJS…”,”sga heap(4,0)”,”kglsim heap”)

Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_677598/emrdb1_ora_98357_i677598.trc

Fri Oct 22 10:27:31 2021

Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_46120.trc  (incident=675174):

ORA-04031: ?T·¨·??? 3896 ×??úμ?12?í?ú′? (“shared pool”,”select jilunr from bl_bingli…”,”sga heap(6,0)”,”kglsim object batch”)

Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_675174/emrdb1_ora_46120_i675174.trc

Fri Oct 22 10:27:31 2021

Dumping diagnostic data in directory=[cdmp_20211022102731], requested by (instance=1, osid=98357), summary=[incident=677598].

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Fri Oct 22 10:27:33 2021

Sweep [inc][677598]: completed

Sweep [inc][675174]: completed

Sweep [inc2][677598]: completed

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Fri Oct 22 10:27:36 2021

Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_85236.trc  (incident=675246):

ORA-04031: ?T·¨·??? 3896 ×??úμ?12?í?ú′? (“shared pool”,”update BL_BINGLIJLNR set Val…”,”sga heap(3,0)”,”kglsim object batch”)

Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_675246/emrdb1_ora_85236_i675246.trc

Fri Oct 22 10:27:37 2021

Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_81262.trc  (incident=675606):

ORA-04031: unable to allocate 4000 bytes of shared memory (“shared pool”,”DELETE BL_BINGANSY WHERE BIN…”,”sga heap(1,0)”,”kglsim heap”)

Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_675606/emrdb1_ora_81262_i675606.trc

Fri Oct 22 10:27:40 2021

Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_36572.trc  (incident=674934):

ORA-04031: ?T·¨·??? 3896 ×??úμ?12?í?ú′? (“shared pool”,”update BL_BINGCHENGJLNR set …”,”sga heap(2,0)”,”kglsim object batch”)

Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_674934/emrdb1_ora_36572_i674934.trc

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

 

通过数据库alert日志发现,数据库在10:27开始发生了大量的ORA-04031错误。

2. 查看操作系统日志

Oct 22 00:06:07 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 00:36:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 01:06:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 01:21:51 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 01:36:17 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 02:08:17 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 02:49:54 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 03:09:42 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:1:7 —  1 2002.

Oct 22 03:22:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 03:22:03 cxemrdb01 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login t

o ensure your system is up-to-date.

Oct 22 03:35:22 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:0:12 —  1 2002.

Oct 22 03:46:37 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:0:7 —  1 2002.

Oct 22 03:52:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 03:56:21 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:1:6 —  1 2002.

Oct 22 03:57:05 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:1:9 —  1 2002.

Oct 22 03:57:48 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:0:14 —  1 2002.

Oct 22 04:00:52 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 04:22:04 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 04:54:28 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 05:25:57 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 05:59:49 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 06:35:21 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 06:46:06 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 07:17:35 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 07:50:07 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 08:20:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 09:02:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 09:14:12 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 09:35:29 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 10:14:02 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 10:54:50 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 11:26:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 11:52:33 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 11:56:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 12:26:04 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 12:56:20 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 13:27:57 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 13:59:17 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 14:24:03 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 14:32:22 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 15:04:22 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 15:38:23 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 16:12:00 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 16:42:24 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 16:58:54 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 17:14:01 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 17:44:01 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

操作系统中未发现内存不足等相关报错和告警信息。

3. 监听日志信息分析 

每小时的请求连接数

930 22-OCT-2021 00

797 22-OCT-2021 01

715 22-OCT-2021 02

719 22-OCT-2021 03

631 22-OCT-2021 04

671 22-OCT-2021 05

683 22-OCT-2021 06

1220 22-OCT-2021 07

1789 22-OCT-2021 08

1619 22-OCT-2021 09

2157 22-OCT-2021 10

8656 22-OCT-2021 11

654 22-OCT-2021 12

1085 22-OCT-2021 13

920 22-OCT-2021 14

855 22-OCT-2021 15

954 22-OCT-2021 16

417 22-OCT-2021 17

63 22-OCT-2021 18

10:00 – 10:59内每分钟连接数统计

97 22-OCT-2021 10:00

15 22-OCT-2021 10:01

29 22-OCT-2021 10:02

8 22-OCT-2021 10:03

12 22-OCT-2021 10:04

54 22-OCT-2021 10:05

13 22-OCT-2021 10:06

21 22-OCT-2021 10:07

13 22-OCT-2021 10:08

22 22-OCT-2021 10:09

50 22-OCT-2021 10:10

13 22-OCT-2021 10:11

6 22-OCT-2021 10:12

16 22-OCT-2021 10:13

21 22-OCT-2021 10:14

42 22-OCT-2021 10:15

18 22-OCT-2021 10:16

15 22-OCT-2021 10:17

19 22-OCT-2021 10:18

18 22-OCT-2021 10:19

45 22-OCT-2021 10:20

2 22-OCT-2021 10:21

12 22-OCT-2021 10:22

15 22-OCT-2021 10:23

17 22-OCT-2021 10:24

58 22-OCT-2021 10:25

18 22-OCT-2021 10:26

19 22-OCT-2021 10:27

26 22-OCT-2021 10:28

41 22-OCT-2021 10:29

109 22-OCT-2021 10:30

54 22-OCT-2021 10:31

62 22-OCT-2021 10:32

58 22-OCT-2021 10:33

45 22-OCT-2021 10:34

76 22-OCT-2021 10:35

37 22-OCT-2021 10:36

33 22-OCT-2021 10:37

45 22-OCT-2021 10:38

36 22-OCT-2021 10:39

79 22-OCT-2021 10:40

30 22-OCT-2021 10:41

41 22-OCT-2021 10:42

34 22-OCT-2021 10:43

55 22-OCT-2021 10:44

53 22-OCT-2021 10:45

28 22-OCT-2021 10:46

48 22-OCT-2021 10:47

65 22-OCT-2021 10:48

54 22-OCT-2021 10:49

72 22-OCT-2021 10:50

43 22-OCT-2021 10:51

30 22-OCT-2021 10:52

29 22-OCT-2021 10:53

20 22-OCT-2021 10:54

76 22-OCT-2021 10:55

28 22-OCT-2021 10:56

16 22-OCT-2021 10:57

27 22-OCT-2021 10:58

19 22-OCT-2021 10:59

 

查询得知,数据库未发现有短时间的大量连接。

4. 查看数据库运行信息

SQL> select inst_id,instance_name,to_char(startup_time,’yyyy-mm-dd hh24:mi:ss’) from gv$instance;

 

INST_ID INSTANCE_NAME    TO_CHAR(STARTUP_TIM

———- —————- ——————-

1 emrdb1           2021-10-22 11:11:06

2 emrdb2           2019-12-25 17:39:15

查询得知,emrdb实例子2019年12月25日启动一来,至今运行月1年10个月的时间,1节点实例今日处理故障时重启,在重启前时间与2节点一致。

5. 查看AWR报告

 

如上图,数据库中存在大量的无绑定变量的低效SQL,均占用1MB+的内存空间,列举几个SQL:

8mfvdk3v7xcnu

select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where b.DANGQIANBQ = ‘600700’ and b.CHUYUANRQ is null and (( t.mobanid = ‘CXZX003′ and t.zongfen > 44 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX003’)) or ( t.mobanid = ‘CXZX008′ and t.zongfen > 11 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX008’)))

 

gjv7strrc2mr6

select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where b.DANGQIANBQ = ‘600800’ and b.CHUYUANRQ is null and (( t.mobanid = ‘CXZX003′ and t.zongfen > 44 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX003’)) or ( t.mobanid = ‘CXZX008′ and t.zongfen > 11 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX008’)))

 

7gc2xtvupk227

select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where t.mobanid = ‘CXZX002’ and t.zongfen < 13 and b.DANGQIANBQ = ‘601800’ and b.CHUYUANRQ is null and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX002′)

 

3nh5uf64xy8sa

select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where t.mobanid = ‘CXZX002’ and t.zongfen < 13 and b.DANGQIANBQ = ‘600800’ and b.CHUYUANRQ is null and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX002′)

 

6. 维护人员处理动作

由于检查只有1节点实例出现ORA-04031错误,而2节点实例正常,尝试手工刷新shared pool,但刷新失败。

 

为了尽快恢复业务,保证业务正常运行,对1节点实例进行了重启

Fri Oct 22 11:04:04 2021

Shutting down instance (immediate)

Stopping background process SMCO

Shutting down instance: further logons disabled

Stopping background process QMNC

Fri Oct 22 11:04:06 2021

Stopping background process CJQ0

Stopping background process MMNL

Stopping background process MMON

License high water mark = 1049

Fri Oct 22 11:05:02 2021

All dispatchers and shared servers shutdown

ALTER DATABASE CLOSE NORMAL

Fri Oct 22 11:05:05 2021

SMON: disabling tx recovery

Stopping background process RCBG

SMON: disabling cache recovery

 

总结与后续处理建议

1. 问题总结

  • 数据库2节点实例运行正常,1节点实例出现了ORA-04031错误;
  • 数据库自2019/12/25运行至今已达1年10个月之久,内存中存在大量的内存碎片;
  • 数据库中存在较多的无绑定变量SQL,无绑定变量的SQL均需要硬解析,消耗shared pool内存资源;
  • 重启数据库后,内存资源释放,内存刷新,内存碎片问题消失,业务恢复正常。

7. 处理操作

此次处理过程中,通过手工刷新shared pool失败,进而通过重启1节点实例的方式释放内存资源、重整内存碎片后,ORA-04031问题得以解决。

典型的Ora-04031错误 – 记某人民医院核心系统故障处理纪要