概述
某客户于2021/06/07凌晨01点43分左右,出现应用程序无法连接,前台业务无法办理等情况。在此情况下代维科技工程师在接到故障支持电话后,及时进行响应,通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。
问题及相关日志分析
1. 数据库告警日志分析分析
通过对数据库告警日志查看,节点1在01:22:14出现ORA-29970错误,LMD进程无法响应,在01:22:29被LMHB进程将实例终止,实例终止后自动启动。节点2在01:29:31被LMS0进程将实例终止,实例终止后自动启动。
1节点:
Mon Jun 07 01:22:14 2021
LMD0 (ospid: 10613) has not called a wait for sub 0 secs.
LMD1 (ospid: 10617) has not called a wait for sub 0 secs.
Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmhb_10637.trc (incident=688375) (PDBNAME=CDB$ROOT):
ORA-29770: global enqueue process LMD0 (OSID 10613) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/cdb/cdb1/incident/incdir_688375/cdb1_lmhb_10637_i688375.trc
Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmhb_10637.trc (incident=688376) (PDBNAME=CDB$ROOT):
ORA-29770: global enqueue process LMD1 (OSID 10617) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/cdb/cdb1/incident/incdir_688376/cdb1_lmhb_10637_i688376.trc
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_CDB by memno 0
ERROR: Some process(s) is not making progress.
LMHB (ospid: 10637) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 10637): terminating the instance due to error 29770
Mon Jun 07 01:22:24 2021
System state dump requested by (instance=1, osid=10637 (LMHB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_diag_10575_20210607012224.trc
Mon Jun 07 01:22:26 2021
License high water mark = 591
Mon Jun 07 01:22:29 2021
Instance terminated by LMHB, pid = 10637
Mon Jun 07 01:22:29 2021
USER (ospid: 21601): terminating the instance
Mon Jun 07 01:22:29 2021
Instance terminated by USER, pid = 21601
Mon Jun 07 01:22:32 2021
Starting ORACLE instance (normal) (OS id: 21660)
Mon Jun 07 01:22:32 2021
CLI notifier numLatches:37 maxDescs:3986
Mon Jun 07 01:22:32 2021
**********************************************************************
Mon Jun 07 01:22:32 2021
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)
Mon Jun 07 01:22:32 2021
Per process system memlock (soft) limit = UNLIMITED
Mon Jun 07 01:22:32 2021
Expected per process system memlock (soft) limit to lock
SHARED GLOBAL AREA (SGA) into memory: 128G
Mon Jun 07 01:22:32 2021
Available system pagesizes:
4K, 2048K
Mon Jun 07 01:22:32 2021
Supported system pagesize(s):
Mon Jun 07 01:22:32 2021
PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s)
Mon Jun 07 01:22:32 2021
2048K 66823 65538 65538 NONE
Mon Jun 07 01:22:32 2021
Reason for not supporting certain system pagesizes:
Mon Jun 07 01:22:32 2021
4K – Large pagesizes only
Mon Jun 07 01:22:32 2021
**********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 24
2节点:
Mon Jun 07 01:25:05 2021
Set master node info
Mon Jun 07 01:26:05 2021
Auto-tuning: Shutting down background process GTXb
Mon Jun 07 01:27:53 2021
IPC Send timeout detected. Sender: ospid 62152 [oracle@node1 (PING)]
Receiver: inst 1 binc 912546309 ospid 10589
Mon Jun 07 01:29:25 2021
LMD0 (ospid: 62168) received an instance eviction notification from instance 1 [2]
Mon Jun 07 01:29:26 2021
Received an instance abort message from instance 1
Mon Jun 07 01:29:26 2021
Received an instance abort message from instance 1
Mon Jun 07 01:29:26 2021
Please check instance 1 alert and LMON trace files for detail.
Mon Jun 07 01:29:26 2021
Please check instance 1 alert and LMON trace files for detail.
Mon Jun 07 01:29:26 2021
LMS0 (ospid: 62192): terminating the instance due to error 481
Mon Jun 07 01:29:26 2021
System state dump requested by (instance=2, osid=62192 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/cdb/cdb2/trace/cdb2_diag_62123_20210607012926.trc
Mon Jun 07 01:29:26 2021
ORA-1092 : opitsk aborting process
Mon Jun 07 01:29:27 2021
License high water mark = 1251
Mon Jun 07 01:29:31 2021
Instance terminated by LMS0, pid = 62192
Mon Jun 07 01:29:31 2021
USER (ospid: 85262): terminating the instance
Mon Jun 07 01:29:31 2021
Instance terminated by USER, pid = 85262
Mon Jun 07 01:29:33 2021
Starting ORACLE instance (normal) (OS id: 85397)
Mon Jun 07 01:29:33 2021
CLI notifier numLatches:37 maxDescs:3986
Mon Jun 07 01:29:33 2021
**********************************************************************
Mon Jun 07 01:29:33 2021
Dump of system resources acquired for SHARED GLOBAL AREA (SGA)
Mon Jun 07 01:29:33 2021
Per process system memlock (soft) limit = UNLIMITED
Mon Jun 07 01:29:33 2021
Expected per process system memlock (soft) limit to lock
SHARED GLOBAL AREA (SGA) into memory: 128G
Mon Jun 07 01:29:33 2021
Available system pagesizes:
4K, 2048K
Mon Jun 07 01:29:33 2021
Supported system pagesize(s):
Mon Jun 07 01:29:33 2021
PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s)
Mon Jun 07 01:29:33 2021
2048K 67200 65538 65538 NONE
Mon Jun 07 01:29:33 2021
Reason for not supporting certain system pagesizes:
Mon Jun 07 01:29:33 2021
4K – Large pagesizes only
Mon Jun 07 01:29:33 2021
**********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 24
查看更多日志,发现在关闭之前,出现了较多的ORA-04031错误。
Mon Jun 07 01:18:14 2021
Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14795.trc (incident=691727) (PDBNAME=CDB$ROOT):
ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^1069″,”kglseshtTable”)
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmd0_10613.trc (incident=707486) (PDBNAME=CDB$ROOT):
ORA-04031: unable to allocate 8504 bytes of shared memory (“shared pool”,”unknown object”,”sga heap(5,0)”,”ges big msg pool”)
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Jun 07 01:18:16 2021
Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14832.trc (incident=691688) (PDBNAME=CDB$ROOT):
ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^320″,”kglseshtTable”)
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Jun 07 01:18:16 2021
Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14837.trc (incident=691735) (PDBNAME=CDB$ROOT):
ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^1234″,”kglseshtTable”)
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Jun 07 01:18:18 2021
Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14912.trc (incident=691871) (PDBNAME=CDB$ROOT):
ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^159″,”kglseshtTable”)
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmd0_10613.trc (incident=707487) (PDBNAME=CDB$ROOT):
ORA-04031: unable to allocate 8504 bytes of shared memory (“shared pool”,”unknown object”,”sga heap(5,0)”,”ges big msg pool”)
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
DDE: Problem Key ‘ORA 4031’ was completely flood controlled (0x6)
Further messages for this problem key will be suppressed for up to 10 minutes
Mon Jun 07 01:18:33 2021
Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_m000_15479.trc:
ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^2147″,”kglseshtTable”)
2. AWR报告信息分析
拉取了宕机前半小时的AWR报告,从AWR报告中得知,shared pool中ges enqueues和ges resource dynamic分别达到了18G和16G。
3. Oracle官方参考文献
根据Oracle官方文档记载:ORA-04031 Errors Occurring with High “ges resource dynamic” & “ges enqueues” Memory Usage In The Shared Pool (Doc ID 2063751.1),该问题是Oracle的BUG,该BUG出现于Oracle RAC 12.1.0.1至12.1.0.2,此问题在Oracle RAC 12.2中被修复,移动护理数据库版本为Oracle RAC 12.1.0.2。
4. 数据库硬解析比例高的SQL语句
SQL_TEXT
———————————————————————————————————————————————————————————————————————————————————-
unshared count
————–
SELECT A.PATIENT_ID, A.VISIT_ID,A.ORDER_NO,A.ORDER_SUB_NO,TO_CHAR(T.BED_NO) BED_NO,TO_CHAR(START_DATE_TIME,’YYYY-MM-DD HH24:MI:SS’) AS START_DATE_TIME, TO_ FROM V_EMR_TIME,’YYYY-MM-DD HH24:MI:SS’) AS STOP_DATE_TIME
ORDERS_PERFORMAN AND A.PATIENT_ID=’690086500ID’,’7′)
8′ AND (A.STOP_DATE_TIME IS NULL OR A.STOP_DATE_TIME >= TO_DATE(‘2021-06-07’||’ 00:00:00′,’YYYY-MM-DD HH24:MI:SS’)) AND A.START_DATE_TIME <=
TO_DATE(‘2021-06-07 07:52:53’, ‘YYYY-MM-DD HH24:MI:SS’)
35162
update KYEEHIS.hl7_send_records set type = ‘1’ where rec_id = ‘1003839362’
31808
AND T.BARCODE<>’1# IS NOT NULL2021-06-07′, ‘YYYY-MM-DD’)
ORDER BY T.OPER_TIME DESC日(Q12)’, ‘1/12小时’, ‘1/8小时’)
14151
SELECT RECORD_TIME,ALL_S WHERE PATIENT_ID=’1110234022′ AND VISIT_ID=’3′ AND PG_CODE=’8′ –AND RECORD_TIME BETWEEN TO_DATE(‘2021-06-07 00:00:00′,’YYYY-MM-DD HH24:MI:SS’) AND TO_DATE(‘2021-06-07 23:59:59
‘,’YYYY-MM-DD HH24:MI:SS ORDER BY RECORD_TIME DESC
7829
select distinct barcode,’1’ exec_fg,nurse_name,nurse oper_nurse,” oper_time,order_exec_id,order_id,order_no,order_text,patient_id,visit_id where patient_id = ‘6900_exec_log
767269’ and order_exec_id = ‘288132019’
6180
SELECT T.PATIENT_ID, TO_DATE(TO_CHAR(T.EVENT_DATE,’YYYY-MM-DD’)||’ ‘|| TO_CHAR(T.TIME_POINT, ‘HH24:MI:SS’),
‘YYY WHAND T.PATIENT_ID = ’40TAMP_CODEP_DICT K
10479126’ AND (T.STAMP_CODE = ‘10003083’ OR T.STAMP_CODE = ‘10003084’)
5894
select t.* from KYEEHIS.docs_ord and order_no = ‘49712239’47427′
and exec_time = TO_DATE(‘2021-06-07 09:00:00′,’YYYY-MM-DD HH24:MI:SS’)
5070
SELECT REC_ID, PATIENT_ID, VISIT_ID, MODULE_CODE, MODULE_NAME, CREATE_TIME, ENBALE_VALUE FROM DOCS_OPERATION_MODULE_REC WHERE PATIENT_ID=’4010110870′ AND VISIT_ID=’3′ AND MODULE_CODE=’19’ AND ENBALE_VALUE=’Y’
4849
SELECT PATIENT_ID,TO_CHAR(VISIT_ID) AS VISIT_ID,INP_NO,WARD_CODE,WARD_NAME,DEPT_CODE,DEPT_NAME,PATIENT_NAME,SEX,TO_CHAR(BED_NO) AS BED_NO, BED_LABEL,DATE_OF_BIRTH,DECODE(NURSE_LEVEL,’特级护理’,’0′,’一级护理’,’1′,’二级护理’,’2′,’三级护理’,’3′) NURSE_L
EVEL,STATUS,CHARGE_TYPE,ADMISSION_DATE,DEPTED_TIME,DIAGNOSIS,MEAL,ALLERGY,ADDRESS, LEAVE_TIME,DOCTOR_IN_CHARGE , DOCTOR_NAME, NURSE_IN_CHARGE, NURSE_NAME,H_CODE,TYPE FROM KYEEHIS.v_his_pats_discharge T WHERE T.PATIENT_ID=’6901544170
‘ AND T.VISIT_ID=5 UNION SELECT PATIENT_ID,TO_CHAR(VISIT_ID) AS VISIT_ID,INP_NO,WARD_CODE,WARD_NAME,DEPT_CODE,DEPT_NAME,PATIENT_NAME,SEX,TO_CHAR(BED_NO) AS BED_NO, BED_LABEL,DATE_OF_BIRTH,DECODE(NURSE_LEVEL,’特级护理’,’0′,’一级护理’,’1′,’二级护理’,’
2′,’三级护理’,’3′) NURSE_LEVEL,STATUS,CHARGE_TYPE,ADMISSION_DATE,DEPTED_TIME,DIAGNOSIS,MEAL,ALLERGY,ADDRESS,SYSDATE AS LEAVE_TIME,DOCTOR_IN_CHARGE , DOCTOR_NAME, NURSE_IN_CHARGE, NURSE_NAME,H_CODE,TYPE FROM KYEEHIS.V_HIS_PATS_IN_HOS
4139
select d.item_name,d.item_value and t.patient_id = ‘6900510081’_eval_report_detail_rec d
and t.visit_id = ‘3’ and d.enabled_value = ‘Y’sionRecords’
4061
10 rows selected.
三、总结与后续处理建议
1. 问题总结
移动护理数据库版本是ORACLE RAC 12.1.0.2.0,本次故障发生前出现内存不足相关ORA-04031以及后台进程LMD无影响等报错信息,并最终被LMHB监测进程终止了数据库实例的运行。检查数据库后台TRACE及AWR信息报告,可以发现问题发生前shared pool内在中主要有组件ges resource dynamic和ges enqueues占用了35GB内存;
参照ORACLE官方文档信息该版本存在BUG 21373473,会导致shared pool中的ges resource dynamic和ges enqueues等组件占用内存不断增多,会导致数据库LMD后台进程无响应,从而导致实例终止宕机,这与本次问题的理解及后台TRACE吻合,因此判断为此BUG导致的数据库宕机。
同时数据库中存在硬解析比例高的问题,在2.4章节中已经将数据库中TOP 10的硬解析语句进行了抓取提供,需要软件开发商配合对此类SQL进行优化及使用SQL绑定变量等改写。
2. 问题详细分析过程
1.故障时间状态:在2021年6月7日凌晨00:30 – 01:00,移动护理数据库shared pool内存池中的ges enqueues和ges resource dynamic组件分别达到了18GB和16.8GB;
移动护理数据库在正常运行状态时ges enqueues和ges resource dynamic分别大约为3G和1G。
2.故障相关报错信息:问题时段这两个内存组件占用的内存显著异常,导致Oracle内存池剩余内存不足,引起数据库中进程出现ORA-04031的内存错误及后台进程LMD进程无响应(ORA-29770: global enqueue process LMD1 (OSID 10617) is hung for more than 70 seconds)。
报错信息为:ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^1234″,”kglseshtTable”)
3.报错信息与BUG的对应匹配:以上信息与ORA-04031 Errors Occurring with High “ges resource dynamic” & “ges enqueues” Memory Usage In The Shared Pool (Doc ID 2063751.1)及ORA-4031 due to Huge “ges resource dynamic” and “ges enqueues” and Instance Terminates: ORA-29770: ‘SGA: allocation forcing component growth’ (Doc ID 1951758.1)文档中的BUG 21373473描述吻合:ORA-04031 errors occurring in a RAC instance with the trace file showing high memory usage for “ges resource dynamic” and “ges enqueues” memory in the shared pool. This can cause LMD processes to become unresponsive leading to an instance termination.
4.数据库实例被LMHB进程终止的原因:从11gR2版本,新增加LMHB进程,通过LMHB进程来定期监控lmon/lms/lmd/lck0等进程,如果LMHB进程发现某一个重要后台进程一段时间内(默认70秒)未更新自己的状态信息,会进行分析,如必要则通过重启本地节点的方式来解决问题。本次即为lmd进程超过70秒无响应后,LMHB对数据库进程进行了KILL,从而引起数据库宕机。
5. 后续处理及建议
- BUG问题的解决
根据Oracle官方参考文献,解决该问题有两种方式:
- 解决方案1
修改数据库参数,_GES_DIRECT_FREE_RES_TYPE=”CTARAHDXBB”,该参数是静态参数,需要重启数据库生效,可采取2个节点轮流修改的方法,减少业务停机时间,推荐采用此方案。
- 解决方案2
安装数据库补丁,补丁号Patch:21260431,需要安排业务停机时间安装补丁。
- 数据库硬解析比例高的解决
在2.4章节中已经将数据库中TOP 10的硬解析语句进行了抓取提供,需要软件开发商配合对此类SQL进行优化及使用SQL绑定变量等改写,从而提升系统性能,提高系统的稳定性。