Skip to content

概述

 

某客户于2021/06/07凌晨01点43分左右,出现应用程序无法连接,前台业务无法办理等情况。在此情况下代维科技工程师在接到故障支持电话后,及时进行响应,通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。

问题及相关日志分析

1. 数据库告警日志分析分析

通过对数据库告警日志查看,节点1在01:22:14出现ORA-29970错误,LMD进程无法响应,在01:22:29被LMHB进程将实例终止,实例终止后自动启动。节点2在01:29:31被LMS0进程将实例终止,实例终止后自动启动。

1节点:

Mon Jun 07 01:22:14 2021

LMD0 (ospid: 10613) has not called a wait for sub 0 secs.

LMD1 (ospid: 10617) has not called a wait for sub 0 secs.

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmhb_10637.trc  (incident=688375) (PDBNAME=CDB$ROOT):

ORA-29770: global enqueue process LMD0 (OSID 10613) is hung for more than 70 seconds

Incident details in: /u01/app/oracle/diag/rdbms/cdb/cdb1/incident/incdir_688375/cdb1_lmhb_10637_i688375.trc

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmhb_10637.trc  (incident=688376) (PDBNAME=CDB$ROOT):

ORA-29770: global enqueue process LMD1 (OSID 10617) is hung for more than 70 seconds

Incident details in: /u01/app/oracle/diag/rdbms/cdb/cdb1/incident/incdir_688376/cdb1_lmhb_10637_i688376.trc

LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_CDB by memno 0

ERROR: Some process(s) is not making progress.

LMHB (ospid: 10637) is terminating the instance.

Please check LMHB trace file for more details.

Please also check the CPU load, I/O load and other system properties for anomalous behavior

ERROR: Some process(s) is not making progress.

LMHB (ospid: 10637): terminating the instance due to error 29770

Mon Jun 07 01:22:24 2021

System state dump requested by (instance=1, osid=10637 (LMHB)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_diag_10575_20210607012224.trc

Mon Jun 07 01:22:26 2021

License high water mark = 591

Mon Jun 07 01:22:29 2021

Instance terminated by LMHB, pid = 10637

Mon Jun 07 01:22:29 2021

USER (ospid: 21601): terminating the instance

Mon Jun 07 01:22:29 2021

Instance terminated by USER, pid = 21601

Mon Jun 07 01:22:32 2021

Starting ORACLE instance (normal) (OS id: 21660)

Mon Jun 07 01:22:32 2021

CLI notifier numLatches:37 maxDescs:3986

Mon Jun 07 01:22:32 2021

**********************************************************************

Mon Jun 07 01:22:32 2021

Dump of system resources acquired for SHARED GLOBAL AREA (SGA)

 

Mon Jun 07 01:22:32 2021

Per process system memlock (soft) limit = UNLIMITED

Mon Jun 07 01:22:32 2021

Expected per process system memlock (soft) limit to lock

SHARED GLOBAL AREA (SGA) into memory: 128G

Mon Jun 07 01:22:32 2021

Available system pagesizes:

4K, 2048K

Mon Jun 07 01:22:32 2021

Supported system pagesize(s):

Mon Jun 07 01:22:32 2021

PAGESIZE  AVAILABLE_PAGES  EXPECTED_PAGES  ALLOCATED_PAGES  ERROR(s)

Mon Jun 07 01:22:32 2021

2048K            66823           65538           65538        NONE

Mon Jun 07 01:22:32 2021

Reason for not supporting certain system pagesizes:

Mon Jun 07 01:22:32 2021

4K – Large pagesizes only

Mon Jun 07 01:22:32 2021

**********************************************************************

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Initial number of CPU is 24

 

2节点:

Mon Jun 07 01:25:05 2021

Set master node info

Mon Jun 07 01:26:05 2021

Auto-tuning: Shutting down background process GTXb

Mon Jun 07 01:27:53 2021

IPC Send timeout detected. Sender: ospid 62152 [oracle@node1 (PING)]

Receiver: inst 1 binc 912546309 ospid 10589

Mon Jun 07 01:29:25 2021

LMD0 (ospid: 62168) received an instance eviction notification from instance 1 [2]

Mon Jun 07 01:29:26 2021

Received an instance abort message from instance 1

Mon Jun 07 01:29:26 2021

Received an instance abort message from instance 1

Mon Jun 07 01:29:26 2021

Please check instance 1 alert and LMON trace files for detail.

Mon Jun 07 01:29:26 2021

Please check instance 1 alert and LMON trace files for detail.

Mon Jun 07 01:29:26 2021

LMS0 (ospid: 62192): terminating the instance due to error 481

Mon Jun 07 01:29:26 2021

System state dump requested by (instance=2, osid=62192 (LMS0)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/rdbms/cdb/cdb2/trace/cdb2_diag_62123_20210607012926.trc

Mon Jun 07 01:29:26 2021

ORA-1092 : opitsk aborting process

Mon Jun 07 01:29:27 2021

License high water mark = 1251

Mon Jun 07 01:29:31 2021

Instance terminated by LMS0, pid = 62192

Mon Jun 07 01:29:31 2021

USER (ospid: 85262): terminating the instance

Mon Jun 07 01:29:31 2021

Instance terminated by USER, pid = 85262

Mon Jun 07 01:29:33 2021

Starting ORACLE instance (normal) (OS id: 85397)

Mon Jun 07 01:29:33 2021

CLI notifier numLatches:37 maxDescs:3986

Mon Jun 07 01:29:33 2021

**********************************************************************

Mon Jun 07 01:29:33 2021

Dump of system resources acquired for SHARED GLOBAL AREA (SGA)

 

Mon Jun 07 01:29:33 2021

Per process system memlock (soft) limit = UNLIMITED

Mon Jun 07 01:29:33 2021

Expected per process system memlock (soft) limit to lock

SHARED GLOBAL AREA (SGA) into memory: 128G

Mon Jun 07 01:29:33 2021

Available system pagesizes:

4K, 2048K

Mon Jun 07 01:29:33 2021

Supported system pagesize(s):

Mon Jun 07 01:29:33 2021

PAGESIZE  AVAILABLE_PAGES  EXPECTED_PAGES  ALLOCATED_PAGES  ERROR(s)

Mon Jun 07 01:29:33 2021

2048K            67200           65538           65538        NONE

Mon Jun 07 01:29:33 2021

Reason for not supporting certain system pagesizes:

Mon Jun 07 01:29:33 2021

4K – Large pagesizes only

Mon Jun 07 01:29:33 2021

**********************************************************************

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Initial number of CPU is 24

查看更多日志,发现在关闭之前,出现了较多的ORA-04031错误。

Mon Jun 07 01:18:14 2021

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14795.trc  (incident=691727) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^1069″,”kglseshtTable”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmd0_10613.trc  (incident=707486) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 8504 bytes of shared memory (“shared pool”,”unknown object”,”sga heap(5,0)”,”ges big msg pool”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Mon Jun 07 01:18:16 2021

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14832.trc  (incident=691688) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^320″,”kglseshtTable”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Mon Jun 07 01:18:16 2021

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14837.trc  (incident=691735) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^1234″,”kglseshtTable”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Mon Jun 07 01:18:18 2021

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14912.trc  (incident=691871) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^159″,”kglseshtTable”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmd0_10613.trc  (incident=707487) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 8504 bytes of shared memory (“shared pool”,”unknown object”,”sga heap(5,0)”,”ges big msg pool”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

DDE: Problem Key ‘ORA 4031’ was completely flood controlled (0x6)

Further messages for this problem key will be suppressed for up to 10 minutes

Mon Jun 07 01:18:33 2021

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_m000_15479.trc:

ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^2147″,”kglseshtTable”)

 

 

2. AWR报告信息分析

 

拉取了宕机前半小时的AWR报告,从AWR报告中得知,shared pool中ges enqueues和ges resource dynamic分别达到了18G和16G。

3. Oracle官方参考文献

 

根据Oracle官方文档记载:ORA-04031 Errors Occurring with High “ges resource dynamic” & “ges enqueues” Memory Usage In The Shared Pool (Doc ID 2063751.1),该问题是Oracle的BUG,该BUG出现于Oracle RAC 12.1.0.1至12.1.0.2,此问题在Oracle RAC 12.2中被修复,移动护理数据库版本为Oracle RAC 12.1.0.2。

 

 

4. 数据库硬解析比例高的SQL语句

SQL_TEXT

———————————————————————————————————————————————————————————————————————————————————-

unshared count

————–

SELECT A.PATIENT_ID, A.VISIT_ID,A.ORDER_NO,A.ORDER_SUB_NO,TO_CHAR(T.BED_NO) BED_NO,TO_CHAR(START_DATE_TIME,’YYYY-MM-DD HH24:MI:SS’) AS START_DATE_TIME, TO_   FROM  V_EMR_TIME,’YYYY-MM-DD HH24:MI:SS’) AS STOP_DATE_TIME

ORDERS_PERFORMAN  AND  A.PATIENT_ID=’690086500ID’,’7′)

8′              AND (A.STOP_DATE_TIME IS NULL OR A.STOP_DATE_TIME >= TO_DATE(‘2021-06-07’||’ 00:00:00′,’YYYY-MM-DD HH24:MI:SS’)) AND A.START_DATE_TIME <=

TO_DATE(‘2021-06-07 07:52:53’, ‘YYYY-MM-DD HH24:MI:SS’)

35162

 

update KYEEHIS.hl7_send_records set type = ‘1’ where rec_id = ‘1003839362’

31808

 

AND T.BARCODE<>’1# IS NOT NULL2021-06-07′, ‘YYYY-MM-DD’)

ORDER BY T.OPER_TIME  DESC日(Q12)’, ‘1/12小时’, ‘1/8小时’)

14151

 

SELECT RECORD_TIME,ALL_S     WHERE PATIENT_ID=’1110234022′ AND VISIT_ID=’3′ AND PG_CODE=’8′ –AND RECORD_TIME BETWEEN TO_DATE(‘2021-06-07 00:00:00′,’YYYY-MM-DD HH24:MI:SS’) AND TO_DATE(‘2021-06-07 23:59:59

‘,’YYYY-MM-DD HH24:MI:SS     ORDER BY RECORD_TIME DESC

7829

 

select distinct barcode,’1’ exec_fg,nurse_name,nurse oper_nurse,” oper_time,order_exec_id,order_id,order_no,order_text,patient_id,visit_id                         where patient_id = ‘6900_exec_log

767269’                  and order_exec_id = ‘288132019’

6180

 

SELECT T.PATIENT_ID,            TO_DATE(TO_CHAR(T.EVENT_DATE,’YYYY-MM-DD’)||’ ‘|| TO_CHAR(T.TIME_POINT, ‘HH24:MI:SS’),

‘YYY      WHAND T.PATIENT_ID = ’40TAMP_CODEP_DICT K

10479126’                       AND (T.STAMP_CODE = ‘10003083’ OR T.STAMP_CODE = ‘10003084’)

5894

 

select t.* from KYEEHIS.docs_ord and order_no = ‘49712239’47427′

and exec_time = TO_DATE(‘2021-06-07 09:00:00′,’YYYY-MM-DD HH24:MI:SS’)

5070

 

SELECT REC_ID, PATIENT_ID, VISIT_ID, MODULE_CODE, MODULE_NAME, CREATE_TIME, ENBALE_VALUE FROM DOCS_OPERATION_MODULE_REC WHERE PATIENT_ID=’4010110870′ AND VISIT_ID=’3′ AND MODULE_CODE=’19’ AND ENBALE_VALUE=’Y’

4849

 

SELECT PATIENT_ID,TO_CHAR(VISIT_ID) AS VISIT_ID,INP_NO,WARD_CODE,WARD_NAME,DEPT_CODE,DEPT_NAME,PATIENT_NAME,SEX,TO_CHAR(BED_NO) AS BED_NO, BED_LABEL,DATE_OF_BIRTH,DECODE(NURSE_LEVEL,’特级护理’,’0′,’一级护理’,’1′,’二级护理’,’2′,’三级护理’,’3′) NURSE_L

EVEL,STATUS,CHARGE_TYPE,ADMISSION_DATE,DEPTED_TIME,DIAGNOSIS,MEAL,ALLERGY,ADDRESS, LEAVE_TIME,DOCTOR_IN_CHARGE ,       DOCTOR_NAME,       NURSE_IN_CHARGE,  NURSE_NAME,H_CODE,TYPE FROM KYEEHIS.v_his_pats_discharge T WHERE T.PATIENT_ID=’6901544170

‘ AND T.VISIT_ID=5 UNION  SELECT PATIENT_ID,TO_CHAR(VISIT_ID) AS VISIT_ID,INP_NO,WARD_CODE,WARD_NAME,DEPT_CODE,DEPT_NAME,PATIENT_NAME,SEX,TO_CHAR(BED_NO) AS BED_NO, BED_LABEL,DATE_OF_BIRTH,DECODE(NURSE_LEVEL,’特级护理’,’0′,’一级护理’,’1′,’二级护理’,’

2′,’三级护理’,’3′) NURSE_LEVEL,STATUS,CHARGE_TYPE,ADMISSION_DATE,DEPTED_TIME,DIAGNOSIS,MEAL,ALLERGY,ADDRESS,SYSDATE AS LEAVE_TIME,DOCTOR_IN_CHARGE ,          DOCTOR_NAME,       NURSE_IN_CHARGE,       NURSE_NAME,H_CODE,TYPE FROM KYEEHIS.V_HIS_PATS_IN_HOS

4139

 

select d.item_name,d.item_value  and t.patient_id = ‘6900510081’_eval_report_detail_rec d

and t.visit_id = ‘3’          and d.enabled_value = ‘Y’sionRecords’

4061

 

 

10 rows selected.


三、总结与后续处理建议

1. 问题总结

移动护理数据库版本是ORACLE RAC 12.1.0.2.0,本次故障发生前出现内存不足相关ORA-04031以及后台进程LMD无影响等报错信息,并最终被LMHB监测进程终止了数据库实例的运行。检查数据库后台TRACE及AWR信息报告,可以发现问题发生前shared pool内在中主要有组件ges resource dynamic和ges enqueues占用了35GB内存;

参照ORACLE官方文档信息该版本存在BUG 21373473,会导致shared pool中的ges resource dynamic和ges enqueues等组件占用内存不断增多,会导致数据库LMD后台进程无响应,从而导致实例终止宕机,这与本次问题的理解及后台TRACE吻合,因此判断为此BUG导致的数据库宕机。

同时数据库中存在硬解析比例高的问题,在2.4章节中已经将数据库中TOP 10的硬解析语句进行了抓取提供,需要软件开发商配合对此类SQL进行优化及使用SQL绑定变量等改写。

2. 问题详细分析过程

1.故障时间状态:在2021年6月7日凌晨00:30 – 01:00,移动护理数据库shared pool内存池中的ges enqueues和ges resource dynamic组件分别达到了18GB和16.8GB;

移动护理数据库在正常运行状态时ges enqueues和ges resource dynamic分别大约为3G和1G。

 

2.故障相关报错信息:问题时段这两个内存组件占用的内存显著异常,导致Oracle内存池剩余内存不足,引起数据库中进程出现ORA-04031的内存错误及后台进程LMD进程无响应(ORA-29770: global enqueue process LMD1 (OSID 10617) is hung for more than 70 seconds)。

报错信息为:ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^1234″,”kglseshtTable”)

 

3.报错信息与BUG的对应匹配:以上信息与ORA-04031 Errors Occurring with High “ges resource dynamic” & “ges enqueues” Memory Usage In The Shared Pool (Doc ID 2063751.1)及ORA-4031 due to Huge “ges resource dynamic” and “ges enqueues” and Instance Terminates: ORA-29770: ‘SGA: allocation forcing component growth’ (Doc ID 1951758.1)文档中的BUG 21373473描述吻合:ORA-04031 errors occurring in a RAC instance with the trace file showing high memory usage for “ges resource dynamic” and “ges enqueues” memory in the shared pool.  This can cause LMD processes to become unresponsive leading to an instance termination.

 

4.数据库实例被LMHB进程终止的原因:从11gR2版本,新增加LMHB进程,通过LMHB进程来定期监控lmon/lms/lmd/lck0等进程,如果LMHB进程发现某一个重要后台进程一段时间内(默认70秒)未更新自己的状态信息,会进行分析,如必要则通过重启本地节点的方式来解决问题。本次即为lmd进程超过70秒无响应后,LMHB对数据库进程进行了KILL,从而引起数据库宕机。

 

5. 后续处理及建议

  1. BUG问题的解决

根据Oracle官方参考文献,解决该问题有两种方式:

  1. 解决方案1

修改数据库参数,_GES_DIRECT_FREE_RES_TYPE=”CTARAHDXBB”,该参数是静态参数,需要重启数据库生效,可采取2个节点轮流修改的方法,减少业务停机时间,推荐采用此方案。

  1. 解决方案2

安装数据库补丁,补丁号Patch:21260431,需要安排业务停机时间安装补丁。

 

  1. 数据库硬解析比例高的解决

在2.4章节中已经将数据库中TOP 10的硬解析语句进行了抓取提供,需要软件开发商配合对此类SQL进行优化及使用SQL绑定变量等改写,从而提升系统性能,提高系统的稳定性。

ORA-29970 故障处理一例

概述

 

数据库对应的业务程序在2021/3/7日凌晨0:02分报无法连接数据库,在0:04分接到故障申告后,代维科技工程师通过远程方式接入数据库环境进行故障排查,在0:08分处理完成。通过对相关日志等信息的深入分析,找出问题点并提出相应的解决建议,根据相关问题日志及分析处理情况整理汇总此文档。

 

无法连接的原因分析

1. 宁帆数据库无法连接分析

 

登陆数据库进行查看,数据库无法连接原因为数据库主机磁盘空间使用率100%,归档日志无法切换,应用程序此时无法连接到数据库。紧急通过人工删除一部分数据库日志,并进行日志切换动作后,程序可以正常连接。

[root@nf-dboracle ~]# df -h

Filesystem                          Size  Used Avail Use% Mounted on

/dev/mapper/vg_nfdboracle-LogVol00  476G  452G   16M 100% /

tmpfs                               7.8G  160K  7.8G   1% /dev/shm

/dev/sda1                          1008M   63M  895M   7% /boot

/dev/sr0                            3.6G  3.6G     0 100% /mnt

[root@nf-dboracle ~]# su – oracle

[oracle@nf-dboracle ~]$ rman target /

 

Recovery Manager: Release 11.2.0.3.0 – Production on Sun Mar 7 00:07:32 2021

 

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

 

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-00554: initialization of internal recovery manager package failed

RMAN-04005: error from target database:

ORA-09945: Unable to initialize the audit trail file

Linux-x86_64 Error: 28: No space left on device

 

 

2. 磁盘空间不足原因的分析

通过对数据库归档日志产生数量的分析,可以发现,从3月4号以来,每天的数据库归档日志产生量是原来的15至20倍左右,当前每天产生120GB的归档日志,按原有的备份策略,归档日志保留2天,因此最终引起了磁盘空间不足。

 

 

 

3. 产生归档日志量多的操作分析

数据库层面看,70%的数据变更都在KLB_MICROHOSP用户的 NFSYS_LOG_THIRD表上,在TOP SQL中也发现了对这个表的频率插入;结合表名看,有可能是新开启了写入日志的功能,具体需要应用软件开发商确认一下。

INSERT INTO NFSYS_LOG_THIRD ( ID, TYPE, METHOD, MSG, DESCRIPTION, NAME, OPENID, THIRD_METHOD ) VALUES ( NFSYS_LOG_THIRD_S.nextval, :type0, :method0, :msg, :description, :name, :openid, :third_method ) RETURNING ID INTO :IDENTITY_PARAMETER

 

 


总结与后续处理建议

通过对数据库相关日志以及相关数据的分析,可以得出如下信息:

1. 数据库无法连接原因的分析:

登陆数据库进行查看,数据库无法连接原因为数据库主机磁盘空间使用率100%,归档日志无法切换,应用程序此时无法连接到数据库。紧急通过人工删除一部分数据库日志,并进行日志切换动作后,程序可以正常连接。(0点4分接到电话,0点8分处理完成,用时4分钟。)

2. 磁盘空间不足分析:

通过对数据库归档日志产生数量的分析,可以发现,从3月4号以来,每天的数据库归档日志产生量是原来的15至20倍左右,当前每天产生120GB的归档日志,按原有的备份策略,归档日志保留2天,因此最终引起了磁盘空间不足。已经临时修改了备份及删除归档日志的策略来解决此问题;

3. 产生日志多的应用程序语句:

从数据库层面看,70%的数据变更都在KLB_MICROHOSP用户的 NFSYS_LOG_THIRD表上,在TOP SQL中也发现了对这个表的频率插入;结合表名看,有可能是新开启了写入日志的功能,具体需要应用软件开发商确认一下。

INSERT INTO NFSYS_LOG_THIRD ( ID, TYPE, METHOD, MSG, DESCRIPTION, NAME, OPENID, THIRD_METHOD ) VALUES ( NFSYS_LOG_THIRD_S.nextval, :type0, :method0, :msg, :description, :name, :openid, :third_method ) RETURNING ID INTO :IDENTITY_PARAMETER

 

 

4. 下一步排查及处理方向

因此,接下来需要及时排查的问题为:

  1. 应用程序开发商根据对应的SQL语句,排查程序在3/4日做了哪些升级或更新操作。
  2. 程序的调整是短期还是长期需要按此模式运行;这决定了后续数据库层面如何制定相应的备份及删除策略。
某客户报表数据库无法连接故障分析
某医疗客户最近在做大表归档以及优化的项目,加上oracle版本的变更,最近参考oracle support结合以前的版本经验总结如下的集中方式:
参考 NOTE:1390871.1
A) 通过 Export/import 方法

B) 通过 Insert with a subquery 方法

C) 通过 Partition Exchange 方法

D) 通过 DBMS_REDEFINITION 方法

E) ALTER TABLE 命令的MODIFY语法 (从 12.2 开始)

以上5种方法都会从一个已经存在的非分区表创建出一个分区表。尽管我们已经对这四种方法的维护时间预期做了倒序排列,但实际维护时间针对不同情况有所不同。

通过 Export/import 方法

这种方法的实现是先 export 一个非分区表,创建一个新的分区表,然后 import 数据到新创建的分区表中。

1) Export 您的非分区表:

$ exp usr/pswd tables=numbers file=exp.dmp

2) Drop 掉该非分区表:

SQL> drop table numbers;

3) 重新创建该表成为一个分区表:

SQL> create table numbers (qty number(3), name varchar2(15)) partition by range (qty)(partition p1 values less than (501),
partition p2 values less than (maxvalue));

4) 通过 import 的 ignore=y 方式来还原备份的数据:

$ imp usr/pswd file=exp.dmp ignore=y

ignore=y 语句会让 import 忽略掉表的创建,直接加载所有数据。

如果使用 Data Pump export/import(expdp/impdp)您可以采用 impdp 的 table_exists_action 选项,例如 table_exists_action = APPEND 或者 table_exists_action = REPLACE。

可以参考 Note 552424.1 Export/Import DataPump Parameter ACCESS_METHOD – How to Enforce a Method of Loading and Unloading Data?

通过 Insert with a subquery 方法

1) 创建一个分区表:

SQL> create table partbl (qty number(3), name varchar2(15)) partition by range (qty) (partition p1 values less than (501),partition p2 values less than (maxvalue));

2) 将原来非分区表中的数据通过子查询 insert 到新创建的分区表中:

SQL> insert into partbl (qty, name) select * from origtbl;

3) 如果您想让新建的分区表与原表名相同,那么 drop 掉原来的非分区表然后重命名新表:

SQL> drop table origtbl;
SQL> alter table partbl rename to origtbl;

您可以通过 direct path insert 和利用并行来改善 insert 的性能。如下的例子演示了如何实现并且如何从执行计划中来验证。

传统的 insert

SQL> insert into partbl (qty, name) select * from origtbl;
--------------------------------------------
| Id  | Operation                | Name    |
--------------------------------------------
|   0 | INSERT STATEMENT         |         |
|   1 |  LOAD TABLE CONVENTIONAL |         |
|   2 |   TABLE ACCESS FULL      | ORIGTBL |
--------------------------------------------

Direct load insert 方式

SQL> insert /*+APPEND*/ into partbl (qty, name) select * from origtbl;
--------------------------------------
| Id  | Operation          | Name    |
--------------------------------------
|   0 | INSERT STATEMENT   |         |
|   1 |  LOAD AS SELECT    |         |
|   2 |   TABLE ACCESS FULL| ORIGTBL |
--------------------------------------

Direct load insert 并且在查询部分开启并行

SQL> insert /*+APPEND PARALLEL*/ into partbl (qty, name) select * from origtbl;
------------------------------------------
| Id  | Operation             | Name     |
------------------------------------------
|   0 | INSERT STATEMENT      |          |
|   1 |  LOAD AS SELECT       |          |
|   2 |   PX COORDINATOR      |          |
|   3 |    PX SEND QC (RANDOM)| :TQ10000 |
|   4 |     PX BLOCK ITERATOR |          |
|*  5 |      TABLE ACCESS FULL| ORIGTBL  |
------------------------------------------

注意以上执行计划中 LOAD AS SELECT 在 PX COORDINATOR 的上面。

Direct load insert 并且在查询部分和 insert 部分都开启并行

SQL>alter session enable parallel dml;
SQL> insert /*+APPEND PARALLEL*/ into partbl (qty, name) select * from origtbl;
------------------------------------------
| Id  | Operation             | Name     |
------------------------------------------
|   0 | INSERT STATEMENT      |          |
|   1 |  PX COORDINATOR       |          |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |
|   3 |    LOAD AS SELECT     |          |
|   4 |     PX BLOCK ITERATOR |          |
|*  5 |      TABLE ACCESS FULL| ORIGTBL  |
------------------------------------------

注意在以上执行计划中 LOAD AS SELECT 在 PX COORDINATOR 的下面。

另外一种可选的方式是直接通过 select 来创建新的分区表:一次性创建新的分区表并且加载数据。
执行计划同时显示 direct path load 并且 dml 以及 select 部分全部并行。

SQL>alter session enable parallel dml;
SQL> create table partbl (qty, name) partition by range (qty) (partition p1 values less than (501),partition p2 values less than (maxvalue))
  2  as select /*+PARALLEL*/ * from origtbl;
-------------------------------------------
| Id  | Operation              | Name     |
-------------------------------------------
|   0 | CREATE TABLE STATEMENT |          |
|   1 |  PX COORDINATOR        |          |
|   2 |   PX SEND QC (RANDOM)  | :TQ10000 |
|   3 |    LOAD AS SELECT      |          |
|   4 |     PX BLOCK ITERATOR  |          |
|*  5 |      TABLE ACCESS FULL | ORIGTBL  |
-------------------------------------------

 通过 Partition Exchange 方法

ALTER TABLE EXCHANGE PARTITION 可以通过交换数据和索引 segment 来将一个分区(或子分区)转换成一个非分区表,也可以将一个非分区表转换成一个分区表的分区(或子分区)。 除了需要更新索引以外,ALTER TABLE … EXCHANGE PARTITION 命令是一个字典操作不需要数据移动。更多关于此方法的信息参见 Oracle 联机文档(比如 11.2)和 Note 198120.1。

此方法简要步骤如下:

1) 根据所需的分区来创建新的分区表
2) 保持需要交换的非分区表与分区表的分区有相同的结构,并且确保您需要交换的非分区表具有您想要交换的内容
3) 执行:Alter table exchange partition partition_name with table exchange table

注意在交换过程中,所有交换的数据必须满足分区表的分区定义,否则如下错误将抛出:ORA-14099: all rows in table do not qualify for specified partition.

这是因为默认情况下分区交换是有校验的。

例子(基于 SCOTT 示例 schema)
———

本例创建了与分区表 p_emp 的分区相同结构的交换表。

SQL> CREATE TABLE p_emp
2 (sal NUMBER(7,2))
3 PARTITION BY RANGE(sal)
4 (partition emp_p1 VALUES LESS THAN (2000),
5 partition emp_p2 VALUES LESS THAN (4000));

Table created.

SQL> SELECT * FROM emp;

EMPNO ENAME      JOB              MGR HIREDATE         SAL       COMM     DEPTNO
———- ———- ——— ———- ——— ———- ———- ———-
7369 SMITH      CLERK           7902 17-DEC-80        800                    20
7499 ALLEN      SALESMAN        7698 20-FEB-81       1600        300         30
7521 WARD       SALESMAN        7698 22-FEB-81       1250        500         30
7566 JONES      MANAGER         7839 02-APR-81       2975                    20
7654 MARTIN     SALESMAN        7698 28-SEP-81       1250       1400         30
7698 BLAKE      MANAGER         7839 01-MAY-81       2850                    30
7782 CLARK      MANAGER         7839 09-JUN-81       2450                    10
7788 SCOTT      ANALYST         7566 19-APR-87       3000                    20
7839 KING       PRESIDENT            17-NOV-81       5000                    10
7844 TURNER     SALESMAN        7698 08-SEP-81       1500          0         30
7876 ADAMS      CLERK           7788 23-MAY-87       1100                    20
7900 JAMES      CLERK           7698 03-DEC-81        950                    30
7902 FORD       ANALYST         7566 03-DEC-81       3000                    20
7934 MILLER     CLERK           7782 23-JAN-82       1300                    10

14 rows selected.

SQL> CREATE TABLE exchtab1 as SELECT sal FROM emp WHERE sal<2000;

Table created.

SQL> CREATE TABLE exchtab2 as SELECT sal FROM emp WHERE sal BETWEEN 2000 AND 3999;

Table created.

SQL> alter table p_emp exchange partition emp_p1 with table exchtab1;

Table altered.

SQL> alter table p_emp exchange partition emp_p2 with table exchtab2;

Table altered.

D. 通过在线重定义的方式(DBMS_REDEFINITION) 方法

详情参见:

Note 472449.1 How To Partition Existing Table Using DBMS_Redefinition
Note 1481558.1  DBMS_REDEFINITION: Case Study for a Large Non-Partition Table to a Partition Table with Online Transactions occuring
Note 177407.1 How to Re-Organize a Table Online

 

 

E. MODIFY clause added to the ALTER TABLE SQL Statement. (From 12.2 to 20c)

从 12.2开始,可以使用 Alter Table <table_name> MODIFY 把非分区表转换成分区表。

更多详情请参考 https://docs.oracle.com/en/database/oracle/oracle-database/12.2/vldbg/evolve-nopartition-table.html#GUID-5FDB7D59-DD05-40E4-8AB4-AF82EA0D0FE5

Oracle非分区表转换成分区表的若干方式

概述

 

某客户由存储厂商维护人员将LUN存储划分到EMR数据库主机上;在此期间由于不当操作,导致数据库集群状态异常,之后重启了数据库主机节点2,重启主机后集群软件GRID无法正常启动,节点1仍可以正常提供数据库服务。通过远程方式接入数据库环境进行故障排查,通过对相关日志等信息的深入分析,最终找到问题点并进行处理,数据库集群恢复正常。

 

集群及数据库日志分析

1. 15:36分主机重启前的错误信息

从此处日志可以看出,在15:36分时数据库实例报IO读写错误,同时数据库集群日志中也在显示报错,之后发生了主机重启动作。

 

2. 节点2重启主机后相关日志信息

节点2重启主机后,检查集群启动状态可以发现在启动到ASM实例时停止,检查ASM ALERT日志有如下信息:

集群ALERT日志:

2021-12-12 15:48:34.081:

[client(10295)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

[client(10405)]CRS-10001:12-Dec-19 15:48 ACFS-9391: Checking for existing ADVM/ACFS installation.

[client(10410)]CRS-10001:12-Dec-19 15:48 ACFS-9392: Validating ADVM/ACFS installation files for operating system.

[client(10412)]CRS-10001:12-Dec-19 15:48 ACFS-9393: Verifying ASM Administrator setup.

[client(10415)]CRS-10001:12-Dec-19 15:48 ACFS-9308: Loading installed ADVM/ACFS drivers.

[client(10418)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleoks.ko’ driver.

[client(10446)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleadvm.ko’ driver.

[client(10473)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleacfs.ko’ driver.

[client(10560)]CRS-10001:12-Dec-19 15:48 ACFS-9327: Verifying ADVM/ACFS devices.

[client(10564)]CRS-10001:12-Dec-19 15:48 ACFS-9156: Detecting control device ‘/dev/asm/.asm_ctl_spec’.

[client(10568)]CRS-10001:12-Dec-19 15:48 ACFS-9156: Detecting control device ‘/dev/ofsctl’.

[client(10573)]CRS-10001:12-Dec-19 15:48 ACFS-9322: completed

2021-12-12 15:48:43.930:

[client(10374)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:48:53.698:

[client(10598)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:49:03.129:

[client(10905)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:50:00.511:

[client(12458)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:51:07.276:

?

节点2:ASM ALERT日志

Thu Dec 12 16:22:11 2021

MMNL started with pid=21, OS id=26248

lmon registered with NM – instance number 2 (internal mem no 1)

Thu Dec 12 16:24:08 2021

LMON received an instance eviction notification from instance 1

The instance eviction reason is 0x20000000

The instance eviction map is 2

Thu Dec 12 16:24:11 2021

PMON (ospid: 26206): terminating the instance due to error 481

Thu Dec 12 16:24:11 2021

ORA-1092 : opitsk aborting process

Thu Dec 12 16:24:13 2021

System state dump requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216_20211212162413.trc

Dumping diagnostic data in directory=[cdmp_20211212162411], requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].

Instance terminated by PMON, pid = 26206

检查对应的LMON及DIAG日志:

[grid@cxemrdb02 trace]$ more ./+ASM2_diag_26216.trc

Trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216.trc

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production

With the Real Application Clusters and Automatic Storage Management options

ORACLE_HOME = /u01/app/11.2.0/grid

System name: ???Linux

Node name: ?????cxemrdb02

Release: ???????2.6.32-431.el6.x86_64

Version: ???????#1 SMP Sun Nov 10 22:19:54 EST 2013

Machine: ???????x86_64

Instance name: +ASM2

Redo thread mounted by this instance: 0 <none>

Oracle process number: 6

Unix process pid: 26216, image: oracle@cxemrdb02 (DIAG)

*** 2021-12-12 16:22:10.788

*** SESSION ID:(373.1) 2021-12-12 16:22:10.788

*** CLIENT ID:() 2021-12-12 16:22:10.788

*** SERVICE NAME:() 2021-12-12 16:22:10.788

*** MODULE NAME:() 2021-12-12 16:22:10.788

*** ACTION NAME:() 2021-12-12 16:22:10.788

Node id: 1

List of nodes: 0, 1,

*** 2021-12-12 16:22:10.790

Reconfiguration starts [incarn=0]

*** 2021-12-12 16:22:10.790

I’m the voting node

Group reconfiguration cleanup

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

 

3. 问题分析描述

 

通过对数据库集群相关日志的分析,可以发现在15:36分时的非正常操作,导致了数据库及集群报错并状态异常,此时进行了数据库主机的重启动作;重启后集群日志中显示ASM无法启动,根据ASM日志及TRACE文件信息,结合ORACLE官方文档中ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)文档中描述的信息,可以判断是集群私网通信出现了问题,通过对集群私网相关配置进行分析、排查,最终确认是集群通信的socket文件(/var/tmp/.oracle)内容由于集群异常关闭导致不一致,从而引起集群节点2重启后不能正常启动。通过清理socket文件后集群可以正常启动 恢复正常。

 

 

 

 

 

三、总结与后续处理建议

1. 问题总结

 

通过对数据库集群相关日志的分析,可以发现在15:36分时的非正常操作,导致了数据库及集群报错并状态异常,此时进行了数据库主机的重启动作;重启后集群日志中显示ASM无法启动,根据ASM日志及TRACE文件信息,结合ORACLE官方文档中ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)文档中描述的信息,可以判断是集群私网通信出现了问题;通过对集群私网相关配置进行分析、排查,最终确认是集群通信的socket文件(/var/tmp/.oracle)内容由于集群异常关闭导致不一致,从而引起集群节点2重启后与节点1通信异常,不能正常启动。通过清理socket文件后集群可以正常启动,数据库集群恢复正常。

 

 

 

4. 后续对于集群添加磁盘的建议

对于生产环境数据库集群,向ASM环境中添加磁盘是一个重要且有一点风险的动作,由于数据库所有数据都存放在ASM磁盘组中,因此如果向已有ASM磁盘组添加磁盘不当,可能导致ASM磁盘组读写异常或不能正常MOUNT,从而影响正常的数据库业务运行。因此结合ORACLE官方的标准步骤”如何在 RAC 集群或单机 ASM 环境中对已经存在的 Diskgroup 添加新磁盘(最佳实践) (文档 ID 1946668.1)”以及我们多年的数据库维护经验,

建议后续添加ASM磁盘时遵从如下步骤:

步骤 操作明细
1 对现有集群状态、ASM磁盘状态、使用率、磁盘大小检查
2 存储层面进行相应的LUN划分,建议划分的LUN大小与原ASM磁盘组中的磁盘大小一致
3 存储层新划分LUN到主机操作系统层面的映射,确认集群中每个主机可以正常识别读写LUN
4 参照已有的ASM磁盘配置相应的别名或多路径映射,集群kfed命令可以读取
5 使用新划分LUN新建临时测试磁盘组确认新划分LUN可以被ASM使用
6 删除新加的测试磁盘组
7 将新划分LUN加入所需的ASM磁盘组中
8 检查ASM磁盘组rebalance完成状态,检查集群及数据库的状态
Oracle Asm 增加磁盘 操作不当引起ASM无法启动故障处理记录

概述

医院数据库监控平台显示连接HIS数据库1节点异常,通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。

问题及相关日志分析

1. 检查监听

通过2021/05/18的巡检报告中发现监听中存在1.80的IP地址,而此次检查发现监听中无1.80的IP地址。

 

2. 检查集群状态

检查集群状态,发现集群中多个资源状态显示OFFLINE。

[grid@cxhisdb02 ~]$ crsctl stat res -t

——————————————————————————–

NAME           TARGET  STATE        SERVER                   STATE_DETAILS

——————————————————————————–

Local Resources

——————————————————————————–

ora.DATA.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.LISTENER.lsnr

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.OCR.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.SSD.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.asm

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02                Started

ora.gsd

OFFLINE OFFLINE      cxhisdb01

OFFLINE OFFLINE      cxhisdb02

ora.net1.network

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.ons

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.registry.acfs

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

——————————————————————————–

Cluster Resources

——————————————————————————–

ora.LISTENER_SCAN1.lsnr

1        ONLINE  ONLINE       cxhisdb02

ora.cvu

1        ONLINE  OFFLINE

ora.cxhisdb01.vip

1        ONLINE  OFFLINE

ora.cxhisdb02.vip

1        ONLINE  ONLINE       cxhisdb02

ora.hospital.db

1        ONLINE  OFFLINE

2        ONLINE  ONLINE       cxhisdb02                Open

ora.oc4j

1        ONLINE  OFFLINE

ora.scan1.vip

1        ONLINE  ONLINE       cxhisdb02

 

 

3. 检查集群日志

仅有CRSD服务异常,其他集群资源均正常,因此数据库未宕机。

2021-05-26 13:23:46.059:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:46.068:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:56.293:

[/u01/app/11.2.0/grid/bin/oraagent.bin(66885)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/oraagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:21:18} in

/u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/oraagent_grid/oraagent_grid.log.

2021-05-26 13:23:56.294:

[/u01/app/11.2.0/grid/bin/oraagent.bin(31320)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/oraagent_oracle’ disconnected from server. Details at (:CRSAGF00117:) {0:19:5060

3} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/oraagent_oracle/oraagent_oracle.log.

2021-05-26 13:23:56.461:

[/u01/app/11.2.0/grid/bin/orarootagent.bin(145347)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/orarootagent_root’ disconnected from server. Details at (:CRSAGF00117:) {0:

5:1568} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/orarootagent_root/orarootagent_root.log.

2021-05-26 13:23:56.485:

[/u01/app/11.2.0/grid/bin/scriptagent.bin(145549)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/scriptagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:9:

68} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/scriptagent_grid/scriptagent_grid.log.

2021-05-26 13:23:56.651:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:23:58.540:

[crsd(5795)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:58.548:

[crsd(5795)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:58.964:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:00.374:

[crsd(5834)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:00.382:

[crsd(5834)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:01.010:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:02.447:

[crsd(5886)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:02.455:

[crsd(5886)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:03.068:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:04.457:

[crsd(5909)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:04.465:

[crsd(5909)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:05.102:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:06.492:

[crsd(5937)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:06.501:

[crsd(5937)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:07.132:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:08.517:

[crsd(5986)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:08.525:

[crsd(5986)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:09.162:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:10.544:

[crsd(6015)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:10.552:

[crsd(6015)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:11.193:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:12.581:

[crsd(6051)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:12.589:

[crsd(6051)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:13.223:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:14.614:

[crsd(6070)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:14.622:

[crsd(6070)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:15.253:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:16.643:

[crsd(6090)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:16.650:

[crsd(6090)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:17.284:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:17.284:

[ohasd(144192)]CRS-2771:Maximum restart attempts reached for resource ‘ora.crsd’; will not restart.

2021-05-26 13:24:17.315:

[ohasd(144192)]CRS-2769:Unable to failover resource ‘ora.crsd’.

 

4. 检查ASM日志

ASM日志中2021/05/26 12:19:57显示OCR仲裁盘有问题,13:23出现读写错误。

Wed May 26 12:19:57 2021

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.

Wed May 26 12:19:57 2021

NOTE: process _b000_+asm1 (160488) initiating offline of disk 0.1409468596 (OCR_0000) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 2.1409468594 (OCR_0002) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 3.1409468595 (OCR_0003) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 4.1409468592 (OCR_0004) with mask 0x7e in group 3

NOTE: checking PST: grp = 3

GMON checking disk modes for group 3 at 15 for pid 46, osid 160488

ERROR: no read quorum in group: required 3, found 1 disks

NOTE: checking PST for grp 3 done.

NOTE: initiating PST update: grp = 3, dsk = 0/0x5402c8b4, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 2/0x5402c8b2, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 3/0x5402c8b3, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 4/0x5402c8b0, mask = 0x6a, op = clear

GMON updating disk modes for group 3 at 16 for pid 46, osid 160488

ERROR: no read quorum in group: required 3, found 1 disks

Wed May 26 12:19:57 2021

NOTE: cache dismounting (not clean) group 3/0xA242386E (OCR)

NOTE: messaging CKPT to quiesce pins Unix process pid: 160495, image: oracle@cxhisdb01 (B001)

Wed May 26 12:19:57 2021

NOTE: halting all I/Os to diskgroup 3 (OCR)

Wed May 26 12:19:57 2021

NOTE: LGWR doing non-clean dismount of group 3 (OCR)

NOTE: LGWR sync ABA=15.85 last written ABA 15.85

WARNING: Offline for disk OCR_0000 in mode 0x7f failed.

WARNING: Offline for disk OCR_0002 in mode 0x7f failed.

WARNING: Offline for disk OCR_0003 in mode 0x7f failed.

WARNING: Offline for disk OCR_0004 in mode 0x7f failed.

Wed May 26 12:19:58 2021

kjbdomdet send to inst 2

detach from dom 3, sending detach message to inst 2

Wed May 26 12:19:58 2021

NOTE: No asm libraries found in the system

Wed May 26 12:19:58 2021

List of instances:

1 2

Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)

Global Resource Directory partially frozen for dirty detach

* dirty detach – domain 3 invalid = TRUE

2 GCS resources traversed, 0 cancelled

Dirty Detach Reconfiguration complete

Wed May 26 12:19:58 2021

WARNING: dirty detached from domain 3

NOTE: cache dismounted group 3/0xA242386E (OCR)

 

 

2021-05-26 13:23:46.059:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:46.068:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

 

 

问题处理步骤

由于数据库仅1节点OCR磁盘组访问异常,2节点以及其他盘访问正常,因而导致1节点CRS资源异常,且数据库未宕机,处理步骤需启动CRS资源,并且重启监听。

1. 1节点挂载OCR磁盘组

[root@cxhisdb01 ~]# su – grid

[grid@cxhisdb01 ~]$ sqlplus / as sysasm

 

SQL*Plus: Release 11.2.0.4.0 Production on Thu May 27 12:12:17 2021

 

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

 

 

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production

With the Real Application Clusters and Automatic Storage Management options

 

SQL> alter diskgroup ocr mount;

 

Diskgroup altered.

 

SQL> exit

2. 启动CRS

[grid@cxhisdb01 ~]$ crsctl start res ora.crsd -init

CRS-2672: Attempting to start ‘ora.crsd’ on ‘cxhisdb01’

CRS-2676: Start of ‘ora.crsd’ on ‘cxhisdb01’ succeeded

3. 检查集群资源

[grid@cxhisdb01 ~]$ crsctl stat res -t

——————————————————————————–

NAME           TARGET  STATE        SERVER                   STATE_DETAILS

——————————————————————————–

Local Resources

——————————————————————————–

ora.DATA.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.LISTENER.lsnr

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.OCR.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.SSD.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.asm

ONLINE  ONLINE       cxhisdb01                Started

ONLINE  ONLINE       cxhisdb02                Started

ora.gsd

OFFLINE OFFLINE      cxhisdb01

OFFLINE OFFLINE      cxhisdb02

ora.net1.network

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.ons

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.registry.acfs

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

——————————————————————————–

Cluster Resources

——————————————————————————–

ora.LISTENER_SCAN1.lsnr

1        ONLINE  ONLINE       cxhisdb02

ora.cvu

1        ONLINE  ONLINE       cxhisdb01

ora.cxhisdb01.vip

1        ONLINE  ONLINE       cxhisdb01

ora.cxhisdb02.vip

1        ONLINE  ONLINE       cxhisdb02

ora.hospital.db

1        ONLINE  ONLINE       cxhisdb01                Open

2        ONLINE  ONLINE       cxhisdb02                Open

ora.oc4j

1        ONLINE  ONLINE       cxhisdb01

ora.scan1.vip

1        ONLINE  ONLINE       cxhisdb02

4. 重启监听

[grid@cxhisdb01 ~]$ srvctl stop listener -n cxhisdb01

[grid@cxhisdb01 ~]$ srvctl start listener -n cxhisdb01

[grid@cxhisdb01 ~]$ lsnrctl status

 

LSNRCTL for Linux: Version 11.2.0.4.0 – Production on 27-MAY-2021 12:20:44

 

Copyright (c) 1991, 2013, Oracle.  All rights reserved.

 

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))

STATUS of the LISTENER

————————

Alias                     LISTENER

Version                   TNSLSNR for Linux: Version 11.2.0.4.0 – Production

Start Date                27-MAY-2021 12:20:40

Uptime                    0 days 0 hr. 0 min. 4 sec

Trace Level               off

Security                  ON: Local OS Authentication

SNMP                      OFF

Listener Parameter File   /u01/app/11.2.0/grid/network/admin/listener.ora

Listener Log File         /u01/app/grid/diag/tnslsnr/cxhisdb01/listener/alert/log.xml

Listening Endpoints Summary…

(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))

(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.80)(PORT=1521)))

(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.10)(PORT=1521)))

Services Summary…

Service “hospital” has 1 instance(s).

Instance “hospital1”, status READY, has 1 handler(s) for this service…

Service “hospitalXDB” has 1 instance(s).

Instance “hospital1”, status READY, has 1 handler(s) for this service…

The command completed successfully

[grid@cxhisdb01 ~]$ cd /u01/app/grid/diag/tnslsnr/cxhisdb01/listener/trace/

[grid@cxhisdb01 trace]$ tail -f listener.log

27-MAY-2021 12:20:51 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVICE_NAME=hospital)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.171)(PORT=62045)) * establish * hospital * 0

27-MAY-2021 12:20:51 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3676)) * establish * hospital * 0

27-MAY-2021 12:20:51 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3677)) * establish * hospital * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23431)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4442)) * establish * hospital * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4444)) * establish * hospital * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SID=hospital1)(CID=(PROGRAM=配置数据库.exe)(HOST=YYJQZZJFW)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.244)(PORT=58397)) * establish * hospital1 * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23432)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3681)) * establish * hospital * 0

27-MAY-2021 12:20:57 * service_update * hospital1 * 0

27-MAY-2021 12:20:58 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23438)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:59 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4460)) * establish * hospital * 0

27-MAY-2021 12:21:00 * service_update * hospital1 * 0

Thu May 27 12:21:01 2021

27-MAY-2021 12:21:01 * (CONNECT_DATA=(SID=hospital1)(CID=(PROGRAM=配置数据库.exe)(HOST=YYJQZZJFW)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.244)(PORT=58398)) * establish * hospital1 * 0

27-MAY-2021 12:21:03 * service_update * hospital1 * 0

27-MAY-2021 12:21:03 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23444)) * establish * HOSPITAL * 0

 

总结与后续处理建议

1. 问题总结

HIS数据库1节点访问OCR仲裁盘异常导致CRSD异常,进而引起1节点上多个集群资源offline,监听资源中相较于以前也少了192.2.1.80的IP,因此监控平台显示连接HIS数据库1节点异常,由于2节点和其他磁盘组无异常,因此数据库未发生宕机现象。

2. 处理操作

  • HIS数据库1节点手工mount OCR磁盘组
  • 启动CRS资源
  • 1节点重启监听资源

3. 后续建议

此次故障发生是由于集群资源访问OCR磁盘组异常导致,建议联系存储工程师,排查相关时间点存储运行情况,并定期做好存储巡检和状态监控工作。

Oracle voting disk 故障处理一例