Skip to content

典型的Ora-04031错误 – 记某人民医院核心系统故障处理纪要

概述

某医院与2021/10/22上午10点48分左右,出现应用程序无法响应,前台业务无法办理等情况。通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。

问题及相关日志分析

1. 查看数据库日志

Fri Oct 22 10:27:29 2021

Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_98357.trc  (incident=677598):

ORA-04031: ?T·¨·??? 4000 ×??úμ?12?í?ú′? (“shared pool”,”SELECT /*+ index(BL_BINGLIJS…”,”sga heap(4,0)”,”kglsim heap”)

Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_677598/emrdb1_ora_98357_i677598.trc

Fri Oct 22 10:27:31 2021

Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_46120.trc  (incident=675174):

ORA-04031: ?T·¨·??? 3896 ×??úμ?12?í?ú′? (“shared pool”,”select jilunr from bl_bingli…”,”sga heap(6,0)”,”kglsim object batch”)

Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_675174/emrdb1_ora_46120_i675174.trc

Fri Oct 22 10:27:31 2021

Dumping diagnostic data in directory=[cdmp_20211022102731], requested by (instance=1, osid=98357), summary=[incident=677598].

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Fri Oct 22 10:27:33 2021

Sweep [inc][677598]: completed

Sweep [inc][675174]: completed

Sweep [inc2][677598]: completed

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Fri Oct 22 10:27:36 2021

Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_85236.trc  (incident=675246):

ORA-04031: ?T·¨·??? 3896 ×??úμ?12?í?ú′? (“shared pool”,”update BL_BINGLIJLNR set Val…”,”sga heap(3,0)”,”kglsim object batch”)

Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_675246/emrdb1_ora_85236_i675246.trc

Fri Oct 22 10:27:37 2021

Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_81262.trc  (incident=675606):

ORA-04031: unable to allocate 4000 bytes of shared memory (“shared pool”,”DELETE BL_BINGANSY WHERE BIN…”,”sga heap(1,0)”,”kglsim heap”)

Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_675606/emrdb1_ora_81262_i675606.trc

Fri Oct 22 10:27:40 2021

Errors in file /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/trace/emrdb1_ora_36572.trc  (incident=674934):

ORA-04031: ?T·¨·??? 3896 ×??úμ?12?í?ú′? (“shared pool”,”update BL_BINGCHENGJLNR set …”,”sga heap(2,0)”,”kglsim object batch”)

Incident details in: /u01/app/oracle/diag/rdbms/emrdb_new/emrdb1/incident/incdir_674934/emrdb1_ora_36572_i674934.trc

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

 

通过数据库alert日志发现,数据库在10:27开始发生了大量的ORA-04031错误。

2. 查看操作系统日志

Oct 22 00:06:07 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 00:36:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 01:06:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 01:21:51 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 01:36:17 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 02:08:17 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 02:49:54 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 03:09:42 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:1:7 —  1 2002.

Oct 22 03:22:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 03:22:03 cxemrdb01 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login t

o ensure your system is up-to-date.

Oct 22 03:35:22 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:0:12 —  1 2002.

Oct 22 03:46:37 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:0:7 —  1 2002.

Oct 22 03:52:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 03:56:21 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:1:6 —  1 2002.

Oct 22 03:57:05 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:1:9 —  1 2002.

Oct 22 03:57:48 cxemrdb01 kernel: qla2xxx [0000:0c:00.0]-801c:0: Abort command issued nexus=0:0:14 —  1 2002.

Oct 22 04:00:52 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 04:22:04 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 04:54:28 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 05:25:57 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 05:59:49 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 06:35:21 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 06:46:06 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 07:17:35 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 07:50:07 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 08:20:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 09:02:08 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 09:14:12 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 09:35:29 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 10:14:02 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 10:54:50 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 11:26:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 11:52:33 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 11:56:03 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 12:26:04 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 12:56:20 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 13:27:57 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 13:59:17 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 14:24:03 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 14:32:22 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 15:04:22 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 15:38:23 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 16:12:00 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 16:42:24 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 16:58:54 cxemrdb01 auditd[7192]: Audit daemon rotating log files

Oct 22 17:14:01 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

Oct 22 17:44:01 cxemrdb01 CLSD: The clock on host cxemrdb01 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.

操作系统中未发现内存不足等相关报错和告警信息。

3. 监听日志信息分析 

每小时的请求连接数

930 22-OCT-2021 00

797 22-OCT-2021 01

715 22-OCT-2021 02

719 22-OCT-2021 03

631 22-OCT-2021 04

671 22-OCT-2021 05

683 22-OCT-2021 06

1220 22-OCT-2021 07

1789 22-OCT-2021 08

1619 22-OCT-2021 09

2157 22-OCT-2021 10

8656 22-OCT-2021 11

654 22-OCT-2021 12

1085 22-OCT-2021 13

920 22-OCT-2021 14

855 22-OCT-2021 15

954 22-OCT-2021 16

417 22-OCT-2021 17

63 22-OCT-2021 18

10:00 – 10:59内每分钟连接数统计

97 22-OCT-2021 10:00

15 22-OCT-2021 10:01

29 22-OCT-2021 10:02

8 22-OCT-2021 10:03

12 22-OCT-2021 10:04

54 22-OCT-2021 10:05

13 22-OCT-2021 10:06

21 22-OCT-2021 10:07

13 22-OCT-2021 10:08

22 22-OCT-2021 10:09

50 22-OCT-2021 10:10

13 22-OCT-2021 10:11

6 22-OCT-2021 10:12

16 22-OCT-2021 10:13

21 22-OCT-2021 10:14

42 22-OCT-2021 10:15

18 22-OCT-2021 10:16

15 22-OCT-2021 10:17

19 22-OCT-2021 10:18

18 22-OCT-2021 10:19

45 22-OCT-2021 10:20

2 22-OCT-2021 10:21

12 22-OCT-2021 10:22

15 22-OCT-2021 10:23

17 22-OCT-2021 10:24

58 22-OCT-2021 10:25

18 22-OCT-2021 10:26

19 22-OCT-2021 10:27

26 22-OCT-2021 10:28

41 22-OCT-2021 10:29

109 22-OCT-2021 10:30

54 22-OCT-2021 10:31

62 22-OCT-2021 10:32

58 22-OCT-2021 10:33

45 22-OCT-2021 10:34

76 22-OCT-2021 10:35

37 22-OCT-2021 10:36

33 22-OCT-2021 10:37

45 22-OCT-2021 10:38

36 22-OCT-2021 10:39

79 22-OCT-2021 10:40

30 22-OCT-2021 10:41

41 22-OCT-2021 10:42

34 22-OCT-2021 10:43

55 22-OCT-2021 10:44

53 22-OCT-2021 10:45

28 22-OCT-2021 10:46

48 22-OCT-2021 10:47

65 22-OCT-2021 10:48

54 22-OCT-2021 10:49

72 22-OCT-2021 10:50

43 22-OCT-2021 10:51

30 22-OCT-2021 10:52

29 22-OCT-2021 10:53

20 22-OCT-2021 10:54

76 22-OCT-2021 10:55

28 22-OCT-2021 10:56

16 22-OCT-2021 10:57

27 22-OCT-2021 10:58

19 22-OCT-2021 10:59

 

查询得知,数据库未发现有短时间的大量连接。

4. 查看数据库运行信息

SQL> select inst_id,instance_name,to_char(startup_time,’yyyy-mm-dd hh24:mi:ss’) from gv$instance;

 

INST_ID INSTANCE_NAME    TO_CHAR(STARTUP_TIM

———- —————- ——————-

1 emrdb1           2021-10-22 11:11:06

2 emrdb2           2019-12-25 17:39:15

查询得知,emrdb实例子2019年12月25日启动一来,至今运行月1年10个月的时间,1节点实例今日处理故障时重启,在重启前时间与2节点一致。

5. 查看AWR报告

 

如上图,数据库中存在大量的无绑定变量的低效SQL,均占用1MB+的内存空间,列举几个SQL:

8mfvdk3v7xcnu

select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where b.DANGQIANBQ = ‘600700’ and b.CHUYUANRQ is null and (( t.mobanid = ‘CXZX003′ and t.zongfen > 44 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX003’)) or ( t.mobanid = ‘CXZX008′ and t.zongfen > 11 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX008’)))

 

gjv7strrc2mr6

select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where b.DANGQIANBQ = ‘600800’ and b.CHUYUANRQ is null and (( t.mobanid = ‘CXZX003′ and t.zongfen > 44 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX003’)) or ( t.mobanid = ‘CXZX008′ and t.zongfen > 11 and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX008’)))

 

7gc2xtvupk227

select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where t.mobanid = ‘CXZX002’ and t.zongfen < 13 and b.DANGQIANBQ = ‘601800’ and b.CHUYUANRQ is null and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX002′)

 

3nh5uf64xy8sa

select * from bl_webwenshujl t left join v_zy_bingrenjbxx b on t.bingrenid = b.BINGRENID where t.mobanid = ‘CXZX002’ and t.zongfen < 13 and b.DANGQIANBQ = ‘600800’ and b.CHUYUANRQ is null and t.jilusj=(select max(jilusj) from bl_webwenshujl where bingrenid=t.bingrenid and mobanid=’CXZX002′)

 

6. 维护人员处理动作

由于检查只有1节点实例出现ORA-04031错误,而2节点实例正常,尝试手工刷新shared pool,但刷新失败。

 

为了尽快恢复业务,保证业务正常运行,对1节点实例进行了重启

Fri Oct 22 11:04:04 2021

Shutting down instance (immediate)

Stopping background process SMCO

Shutting down instance: further logons disabled

Stopping background process QMNC

Fri Oct 22 11:04:06 2021

Stopping background process CJQ0

Stopping background process MMNL

Stopping background process MMON

License high water mark = 1049

Fri Oct 22 11:05:02 2021

All dispatchers and shared servers shutdown

ALTER DATABASE CLOSE NORMAL

Fri Oct 22 11:05:05 2021

SMON: disabling tx recovery

Stopping background process RCBG

SMON: disabling cache recovery

 

总结与后续处理建议

1. 问题总结

  • 数据库2节点实例运行正常,1节点实例出现了ORA-04031错误;
  • 数据库自2019/12/25运行至今已达1年10个月之久,内存中存在大量的内存碎片;
  • 数据库中存在较多的无绑定变量SQL,无绑定变量的SQL均需要硬解析,消耗shared pool内存资源;
  • 重启数据库后,内存资源释放,内存刷新,内存碎片问题消失,业务恢复正常。

7. 处理操作

此次处理过程中,通过手工刷新shared pool失败,进而通过重启1节点实例的方式释放内存资源、重整内存碎片后,ORA-04031问题得以解决。