概述
某医院与2021/10/22上午10点48分左右,出现应用程序无法响应,前台业务无法办理等情况。通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。
问题及相关日志分析
1. 查看数据库日志
Fri Oct 22 10:27:29 2021
Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_98357.trc (incident=677598): ORA-04031: ?T·¨·??? 4000 ×??úμ?12?í?ú′? (“shared pool”,”SELECT /*+ index(BL_BINGLIJS…”,”sga heap(4,0)”,”kglsim heap”) Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_677598/emrdb1_ora_98357_i677598.trc Fri Oct 22 10:27:31 2021 Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_46120.trc (incident=675174): ORA-04031: ?T·¨·??? 3896 ×??úμ?12?í?ú′? (“shared pool”,”select jilunr from bl_bingli…”,”sga heap(6,0)”,”kglsim object batch”) Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_675174/emrdb1_ora_46120_i675174.trc Fri Oct 22 10:27:31 2021 Dumping diagnostic data in directory=[cdmp_20211022102731], requested by (instance=1, osid=98357), summary=[incident=677598]. Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Fri Oct 22 10:27:33 2021 Sweep [inc][677598]: completed Sweep [inc][675174]: completed Sweep [inc2][677598]: completed Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Fri Oct 22 10:27:36 2021 Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_85236.trc (incident=675246): ORA-04031: ?T·¨·??? 3896 ×??úμ?12?í?ú′? (“shared pool”,”update BL_BINGLIJLNR set Val…”,”sga heap(3,0)”,”kglsim object batch”) Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_675246/emrdb1_ora_85236_i675246.trc Fri Oct 22 10:27:37 2021 Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_81262.trc (incident=675606): ORA-04031: unable to allocate 4000 bytes of shared memory (“shared pool”,”DELETE BL_BINGANSY WHERE BIN…”,”sga heap(1,0)”,”kglsim heap”) Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_675606/emrdb1_ora_81262_i675606.trc Fri Oct 22 10:27:40 2021 Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_36572.trc (incident=674934): ORA-04031: ?T·¨·??? 3896 ×??úμ?12?í?ú′? (“shared pool”,”update BL_BINGCHENGJLNR set …”,”sga heap(2,0)”,”kglsim object batch”) Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_674934/emrdb1_ora_36572_i674934.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details.
|
通过数据库alert日志发现,数据库在10:27开始发生了大量的ORA-04031错误。
2. 查看操作系统日志
Oct 22 00:06:07 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
Oct 22 00:36:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 01:06:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 01:21:51 cxemrdb01 auditd[7192]: Audit daemon rotating log files Oct 22 01:36:17 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 02:08:17 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 02:49:54 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 03:09:42 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:1:7 — 1 2002. Oct 22 03:22:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 03:22:03 cxemrdb01 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login t o ensure your system is up-to-date. Oct 22 03:35:22 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:0:12 — 1 2002. Oct 22 03:46:37 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:0:7 — 1 2002. Oct 22 03:52:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 03:56:21 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:1:6 — 1 2002. Oct 22 03:57:05 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:1:9 — 1 2002. Oct 22 03:57:48 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:0:14 — 1 2002. Oct 22 04:00:52 cxemrdb01 auditd[7192]: Audit daemon rotating log files Oct 22 04:22:04 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 04:54:28 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 05:25:57 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 05:59:49 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 06:35:21 cxemrdb01 auditd[7192]: Audit daemon rotating log files Oct 22 06:46:06 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 07:17:35 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 07:50:07 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 08:20:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 09:02:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 09:14:12 cxemrdb01 auditd[7192]: Audit daemon rotating log files Oct 22 09:35:29 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 10:14:02 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 10:54:50 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 11:26:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 11:52:33 cxemrdb01 auditd[7192]: Audit daemon rotating log files Oct 22 11:56:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 12:26:04 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 12:56:20 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 13:27:57 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 13:59:17 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 14:24:03 cxemrdb01 auditd[7192]: Audit daemon rotating log files Oct 22 14:32:22 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 15:04:22 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 15:38:23 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 16:12:00 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 16:42:24 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 16:58:54 cxemrdb01 auditd[7192]: Audit daemon rotating log files Oct 22 17:14:01 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. Oct 22 17:44:01 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time. |
操作系统中未发现内存不足等相关报错和告警信息。
3. 监听日志信息分析
每小时的请求连接数
930 22-OCT-2021 00 797 22-OCT-2021 01 715 22-OCT-2021 02 719 22-OCT-2021 03 631 22-OCT-2021 04 671 22-OCT-2021 05 683 22-OCT-2021 06 1220 22-OCT-2021 07 1789 22-OCT-2021 08 1619 22-OCT-2021 09 2157 22-OCT-2021 10 8656 22-OCT-2021 11 654 22-OCT-2021 12 1085 22-OCT-2021 13 920 22-OCT-2021 14 855 22-OCT-2021 15 954 22-OCT-2021 16 417 22-OCT-2021 17 63 22-OCT-2021 18 10:00 – 10:59内每分钟连接数统计 97 22-OCT-2021 10:00 15 22-OCT-2021 10:01 29 22-OCT-2021 10:02 8 22-OCT-2021 10:03 12 22-OCT-2021 10:04 54 22-OCT-2021 10:05 13 22-OCT-2021 10:06 21 22-OCT-2021 10:07 13 22-OCT-2021 10:08 22 22-OCT-2021 10:09 50 22-OCT-2021 10:10 13 22-OCT-2021 10:11 6 22-OCT-2021 10:12 16 22-OCT-2021 10:13 21 22-OCT-2021 10:14 42 22-OCT-2021 10:15 18 22-OCT-2021 10:16 15 22-OCT-2021 10:17 19 22-OCT-2021 10:18 18 22-OCT-2021 10:19 45 22-OCT-2021 10:20 2 22-OCT-2021 10:21 12 22-OCT-2021 10:22 15 22-OCT-2021 10:23 17 22-OCT-2021 10:24 58 22-OCT-2021 10:25 18 22-OCT-2021 10:26 19 22-OCT-2021 10:27 26 22-OCT-2021 10:28 41 22-OCT-2021 10:29 109 22-OCT-2021 10:30 54 22-OCT-2021 10:31 62 22-OCT-2021 10:32 58 22-OCT-2021 10:33 45 22-OCT-2021 10:34 76 22-OCT-2021 10:35 37 22-OCT-2021 10:36 33 22-OCT-2021 10:37 45 22-OCT-2021 10:38 36 22-OCT-2021 10:39 79 22-OCT-2021 10:40 30 22-OCT-2021 10:41 41 22-OCT-2021 10:42 34 22-OCT-2021 10:43 55 22-OCT-2021 10:44 53 22-OCT-2021 10:45 28 22-OCT-2021 10:46 48 22-OCT-2021 10:47 65 22-OCT-2021 10:48 54 22-OCT-2021 10:49 72 22-OCT-2021 10:50 43 22-OCT-2021 10:51 30 22-OCT-2021 10:52 29 22-OCT-2021 10:53 20 22-OCT-2021 10:54 76 22-OCT-2021 10:55 28 22-OCT-2021 10:56 16 22-OCT-2021 10:57 27 22-OCT-2021 10:58 19 22-OCT-2021 10:59 |
查询得知,数据库未发现有短时间的大量连接。
4. 查看数据库运行信息
SQL> select inst_id,instance_name,to_char(startup_time,’yyyy-mm-dd hh24:mi:ss’) from gv$instance;
INST_ID INSTANCE_NAME TO_CHAR(STARTUP_TIM ———- —————- ——————- 1 emrdb1 2021-10-22 11:11:06 2 emrdb2 2019-12-25 17:39:15 |
查询得知,emrdb实例子2019年12月25日启动一来,至今运行月1年10个月的时间,1节点实例今日处理故障时重启,在重启前时间与2节点一致。
5. 查看AWR报告
如上图,数据库中存在大量的无绑定变量的低效SQL,均占用1MB+的内存空间,列举几个SQL:
8mfvdk3v7xcnu
select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where b.DANGQIANBQ = ‘600700’ and b.CHUYUANRQ is null and (( t.mobanid = ‘CXZX003′ and t.zongfen > 44 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX003’)) or ( t.mobanid = ‘CXZX008′ and t.zongfen > 11 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX008’)))
gjv7strrc2mr6 select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where b.DANGQIANBQ = ‘600800’ and b.CHUYUANRQ is null and (( t.mobanid = ‘CXZX003′ and t.zongfen > 44 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX003’)) or ( t.mobanid = ‘CXZX008′ and t.zongfen > 11 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX008’)))
7gc2xtvupk227 select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where t.mobanid = ‘CXZX002’ and t.zongfen < 13 and b.DANGQIANBQ = ‘601800’ and b.CHUYUANRQ is null and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX002′)
3nh5uf64xy8sa select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where t.mobanid = ‘CXZX002’ and t.zongfen < 13 and b.DANGQIANBQ = ‘600800’ and b.CHUYUANRQ is null and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX002′) |
6. 维护人员处理动作
由于检查只有1节点实例出现ORA-04031错误,而2节点实例正常,尝试手工刷新shared pool,但刷新失败。
为了尽快恢复业务,保证业务正常运行,对1节点实例进行了重启
Fri Oct 22 11:04:04 2021
Shutting down instance (immediate) Stopping background process SMCO Shutting down instance: further logons disabled Stopping background process QMNC Fri Oct 22 11:04:06 2021 Stopping background process CJQ0 Stopping background process MMNL Stopping background process MMON License high water mark = 1049 Fri Oct 22 11:05:02 2021 All dispatchers and shared servers shutdown ALTER DATABASE CLOSE NORMAL Fri Oct 22 11:05:05 2021 SMON: disabling tx recovery Stopping background process RCBG SMON: disabling cache recovery |
总结与后续处理建议
1. 问题总结
- 数据库2节点实例运行正常,1节点实例出现了ORA-04031错误;
- 数据库自2019/12/25运行至今已达1年10个月之久,内存中存在大量的内存碎片;
- 数据库中存在较多的无绑定变量SQL,无绑定变量的SQL均需要硬解析,消耗shared pool内存资源;
- 重启数据库后,内存资源释放,内存刷新,内存碎片问题消失,业务恢复正常。
7. 处理操作
此次处理过程中,通过手工刷新shared pool失败,进而通过重启1节点实例的方式释放内存资源、重整内存碎片后,ORA-04031问题得以解决。