Skip to content

未分类 - 10. page

概述

医院数据库监控平台显示连接HIS数据库1节点异常,通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。

问题及相关日志分析

1. 检查监听

通过2021/05/18的巡检报告中发现监听中存在1.80的IP地址,而此次检查发现监听中无1.80的IP地址。

 

2. 检查集群状态

检查集群状态,发现集群中多个资源状态显示OFFLINE。

[grid@cxhisdb02 ~]$ crsctl stat res -t

——————————————————————————–

NAME           TARGET  STATE        SERVER                   STATE_DETAILS

——————————————————————————–

Local Resources

——————————————————————————–

ora.DATA.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.LISTENER.lsnr

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.OCR.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.SSD.dg

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.asm

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02                Started

ora.gsd

OFFLINE OFFLINE      cxhisdb01

OFFLINE OFFLINE      cxhisdb02

ora.net1.network

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.ons

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.registry.acfs

ONLINE  OFFLINE      cxhisdb01

ONLINE  ONLINE       cxhisdb02

——————————————————————————–

Cluster Resources

——————————————————————————–

ora.LISTENER_SCAN1.lsnr

1        ONLINE  ONLINE       cxhisdb02

ora.cvu

1        ONLINE  OFFLINE

ora.cxhisdb01.vip

1        ONLINE  OFFLINE

ora.cxhisdb02.vip

1        ONLINE  ONLINE       cxhisdb02

ora.hospital.db

1        ONLINE  OFFLINE

2        ONLINE  ONLINE       cxhisdb02                Open

ora.oc4j

1        ONLINE  OFFLINE

ora.scan1.vip

1        ONLINE  ONLINE       cxhisdb02

 

 

3. 检查集群日志

仅有CRSD服务异常,其他集群资源均正常,因此数据库未宕机。

2021-05-26 13:23:46.059:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:46.068:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:56.293:

[/u01/app/11.2.0/grid/bin/oraagent.bin(66885)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/oraagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:21:18} in

/u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/oraagent_grid/oraagent_grid.log.

2021-05-26 13:23:56.294:

[/u01/app/11.2.0/grid/bin/oraagent.bin(31320)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/oraagent_oracle’ disconnected from server. Details at (:CRSAGF00117:) {0:19:5060

3} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/oraagent_oracle/oraagent_oracle.log.

2021-05-26 13:23:56.461:

[/u01/app/11.2.0/grid/bin/orarootagent.bin(145347)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/orarootagent_root’ disconnected from server. Details at (:CRSAGF00117:) {0:

5:1568} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/orarootagent_root/orarootagent_root.log.

2021-05-26 13:23:56.485:

[/u01/app/11.2.0/grid/bin/scriptagent.bin(145549)]CRS-5822:Agent ‘/u01/app/11.2.0/grid/bin/scriptagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:9:

68} in /u01/app/11.2.0/grid/log/cxhisdb01/agent/crsd/scriptagent_grid/scriptagent_grid.log.

2021-05-26 13:23:56.651:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:23:58.540:

[crsd(5795)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:58.548:

[crsd(5795)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:58.964:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:00.374:

[crsd(5834)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:00.382:

[crsd(5834)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:01.010:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:02.447:

[crsd(5886)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:02.455:

[crsd(5886)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:03.068:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:04.457:

[crsd(5909)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:04.465:

[crsd(5909)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:05.102:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:06.492:

[crsd(5937)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:06.501:

[crsd(5937)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:07.132:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:08.517:

[crsd(5986)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:08.525:

[crsd(5986)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:09.162:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:10.544:

[crsd(6015)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:10.552:

[crsd(6015)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:11.193:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:12.581:

[crsd(6051)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:12.589:

[crsd(6051)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:13.223:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:14.614:

[crsd(6070)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:14.622:

[crsd(6070)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:15.253:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:16.643:

[crsd(6090)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:16.650:

[crsd(6090)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage

]. Details at (:CRSD00111:) in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:24:17.284:

[ohasd(144192)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘cxhisdb01’.

2021-05-26 13:24:17.284:

[ohasd(144192)]CRS-2771:Maximum restart attempts reached for resource ‘ora.crsd’; will not restart.

2021-05-26 13:24:17.315:

[ohasd(144192)]CRS-2769:Unable to failover resource ‘ora.crsd’.

 

4. 检查ASM日志

ASM日志中2021/05/26 12:19:57显示OCR仲裁盘有问题,13:23出现读写错误。

Wed May 26 12:19:57 2021

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 2 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 3 in group 3.

WARNING: Waited 15 secs for write IO to PST disk 4 in group 3.

Wed May 26 12:19:57 2021

NOTE: process _b000_+asm1 (160488) initiating offline of disk 0.1409468596 (OCR_0000) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 2.1409468594 (OCR_0002) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 3.1409468595 (OCR_0003) with mask 0x7e in group 3

NOTE: process _b000_+asm1 (160488) initiating offline of disk 4.1409468592 (OCR_0004) with mask 0x7e in group 3

NOTE: checking PST: grp = 3

GMON checking disk modes for group 3 at 15 for pid 46, osid 160488

ERROR: no read quorum in group: required 3, found 1 disks

NOTE: checking PST for grp 3 done.

NOTE: initiating PST update: grp = 3, dsk = 0/0x5402c8b4, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 2/0x5402c8b2, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 3/0x5402c8b3, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 3, dsk = 4/0x5402c8b0, mask = 0x6a, op = clear

GMON updating disk modes for group 3 at 16 for pid 46, osid 160488

ERROR: no read quorum in group: required 3, found 1 disks

Wed May 26 12:19:57 2021

NOTE: cache dismounting (not clean) group 3/0xA242386E (OCR)

NOTE: messaging CKPT to quiesce pins Unix process pid: 160495, image: oracle@cxhisdb01 (B001)

Wed May 26 12:19:57 2021

NOTE: halting all I/Os to diskgroup 3 (OCR)

Wed May 26 12:19:57 2021

NOTE: LGWR doing non-clean dismount of group 3 (OCR)

NOTE: LGWR sync ABA=15.85 last written ABA 15.85

WARNING: Offline for disk OCR_0000 in mode 0x7f failed.

WARNING: Offline for disk OCR_0002 in mode 0x7f failed.

WARNING: Offline for disk OCR_0003 in mode 0x7f failed.

WARNING: Offline for disk OCR_0004 in mode 0x7f failed.

Wed May 26 12:19:58 2021

kjbdomdet send to inst 2

detach from dom 3, sending detach message to inst 2

Wed May 26 12:19:58 2021

NOTE: No asm libraries found in the system

Wed May 26 12:19:58 2021

List of instances:

1 2

Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)

Global Resource Directory partially frozen for dirty detach

* dirty detach – domain 3 invalid = TRUE

2 GCS resources traversed, 0 cancelled

Dirty Detach Reconfiguration complete

Wed May 26 12:19:58 2021

WARNING: dirty detached from domain 3

NOTE: cache dismounted group 3/0xA242386E (OCR)

 

 

2021-05-26 13:23:46.059:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

2021-05-26 13:23:46.068:

[crsd(145215)]CRS-1006:The OCR location +OCR is inaccessible. Details in /u01/app/11.2.0/grid/log/cxhisdb01/crsd/crsd.log.

 

 

问题处理步骤

由于数据库仅1节点OCR磁盘组访问异常,2节点以及其他盘访问正常,因而导致1节点CRS资源异常,且数据库未宕机,处理步骤需启动CRS资源,并且重启监听。

1. 1节点挂载OCR磁盘组

[root@cxhisdb01 ~]# su – grid

[grid@cxhisdb01 ~]$ sqlplus / as sysasm

 

SQL*Plus: Release 11.2.0.4.0 Production on Thu May 27 12:12:17 2021

 

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

 

 

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production

With the Real Application Clusters and Automatic Storage Management options

 

SQL> alter diskgroup ocr mount;

 

Diskgroup altered.

 

SQL> exit

2. 启动CRS

[grid@cxhisdb01 ~]$ crsctl start res ora.crsd -init

CRS-2672: Attempting to start ‘ora.crsd’ on ‘cxhisdb01’

CRS-2676: Start of ‘ora.crsd’ on ‘cxhisdb01’ succeeded

3. 检查集群资源

[grid@cxhisdb01 ~]$ crsctl stat res -t

——————————————————————————–

NAME           TARGET  STATE        SERVER                   STATE_DETAILS

——————————————————————————–

Local Resources

——————————————————————————–

ora.DATA.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.LISTENER.lsnr

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.OCR.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.SSD.dg

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.asm

ONLINE  ONLINE       cxhisdb01                Started

ONLINE  ONLINE       cxhisdb02                Started

ora.gsd

OFFLINE OFFLINE      cxhisdb01

OFFLINE OFFLINE      cxhisdb02

ora.net1.network

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.ons

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

ora.registry.acfs

ONLINE  ONLINE       cxhisdb01

ONLINE  ONLINE       cxhisdb02

——————————————————————————–

Cluster Resources

——————————————————————————–

ora.LISTENER_SCAN1.lsnr

1        ONLINE  ONLINE       cxhisdb02

ora.cvu

1        ONLINE  ONLINE       cxhisdb01

ora.cxhisdb01.vip

1        ONLINE  ONLINE       cxhisdb01

ora.cxhisdb02.vip

1        ONLINE  ONLINE       cxhisdb02

ora.hospital.db

1        ONLINE  ONLINE       cxhisdb01                Open

2        ONLINE  ONLINE       cxhisdb02                Open

ora.oc4j

1        ONLINE  ONLINE       cxhisdb01

ora.scan1.vip

1        ONLINE  ONLINE       cxhisdb02

4. 重启监听

[grid@cxhisdb01 ~]$ srvctl stop listener -n cxhisdb01

[grid@cxhisdb01 ~]$ srvctl start listener -n cxhisdb01

[grid@cxhisdb01 ~]$ lsnrctl status

 

LSNRCTL for Linux: Version 11.2.0.4.0 – Production on 27-MAY-2021 12:20:44

 

Copyright (c) 1991, 2013, Oracle.  All rights reserved.

 

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))

STATUS of the LISTENER

————————

Alias                     LISTENER

Version                   TNSLSNR for Linux: Version 11.2.0.4.0 – Production

Start Date                27-MAY-2021 12:20:40

Uptime                    0 days 0 hr. 0 min. 4 sec

Trace Level               off

Security                  ON: Local OS Authentication

SNMP                      OFF

Listener Parameter File   /u01/app/11.2.0/grid/network/admin/listener.ora

Listener Log File         /u01/app/grid/diag/tnslsnr/cxhisdb01/listener/alert/log.xml

Listening Endpoints Summary…

(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))

(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.80)(PORT=1521)))

(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.10)(PORT=1521)))

Services Summary…

Service “hospital” has 1 instance(s).

Instance “hospital1”, status READY, has 1 handler(s) for this service…

Service “hospitalXDB” has 1 instance(s).

Instance “hospital1”, status READY, has 1 handler(s) for this service…

The command completed successfully

[grid@cxhisdb01 ~]$ cd /u01/app/grid/diag/tnslsnr/cxhisdb01/listener/trace/

[grid@cxhisdb01 trace]$ tail -f listener.log

27-MAY-2021 12:20:51 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVICE_NAME=hospital)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.171)(PORT=62045)) * establish * hospital * 0

27-MAY-2021 12:20:51 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3676)) * establish * hospital * 0

27-MAY-2021 12:20:51 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3677)) * establish * hospital * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23431)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4442)) * establish * hospital * 0

27-MAY-2021 12:20:55 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4444)) * establish * hospital * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SID=hospital1)(CID=(PROGRAM=配置数据库.exe)(HOST=YYJQZZJFW)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.244)(PORT=58397)) * establish * hospital1 * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23432)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:56 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=ZY-603300-02-YS)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.9.58)(PORT=3681)) * establish * hospital * 0

27-MAY-2021 12:20:57 * service_update * hospital1 * 0

27-MAY-2021 12:20:58 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23438)) * establish * HOSPITAL * 0

27-MAY-2021 12:20:59 * (CONNECT_DATA=(SERVICE_NAME=hospital)(CID=(PROGRAM=e:\zjhis\电子病历PB9\emrproject.exe)(HOST=JZ-EK-001)(USER=his))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.3.99)(PORT=4460)) * establish * hospital * 0

27-MAY-2021 12:21:00 * service_update * hospital1 * 0

Thu May 27 12:21:01 2021

27-MAY-2021 12:21:01 * (CONNECT_DATA=(SID=hospital1)(CID=(PROGRAM=配置数据库.exe)(HOST=YYJQZZJFW)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.244)(PORT=58398)) * establish * hospital1 * 0

27-MAY-2021 12:21:03 * service_update * hospital1 * 0

27-MAY-2021 12:21:03 * (CONNECT_DATA=(SERVICE_NAME=HOSPITAL)(CID=(PROGRAM=oracle)(HOST=lis-server)(USER=Administrator))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.2.1.11)(PORT=23444)) * establish * HOSPITAL * 0

 

总结与后续处理建议

1. 问题总结

HIS数据库1节点访问OCR仲裁盘异常导致CRSD异常,进而引起1节点上多个集群资源offline,监听资源中相较于以前也少了192.2.1.80的IP,因此监控平台显示连接HIS数据库1节点异常,由于2节点和其他磁盘组无异常,因此数据库未发生宕机现象。

2. 处理操作

  • HIS数据库1节点手工mount OCR磁盘组
  • 启动CRS资源
  • 1节点重启监听资源

3. 后续建议

此次故障发生是由于集群资源访问OCR磁盘组异常导致,建议联系存储工程师,排查相关时间点存储运行情况,并定期做好存储巡检和状态监控工作。

Oracle voting disk 故障处理一例

概述

某医院客户于2021年11月5日上午9时左右出现HIS数据库连接异常的情况,在此情况下通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。

问题及相关日志分析

1. 问题现状

前台业务反映HIS数据库时而连得上时而连不上,连不上时候报错ORA-12516,报错内容截图如下:

 

由于业务通过节点1的vip(192.2.1.10)连接数据库,测试节点1的vip连接,结果如下:

[oracle@cxhisdb01 ~]$ sqlplus a/b@192.2.1.10:1521/hospital

 

SQL*Plus: Release 11.2.0.4.0 Production on Fri Nov 5 09:45:15 2021

 

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

 

ERROR:

ORA-12516: TNS:listener could not find available handler with matching protocol

stack

 

2. 检查数据库alert日志

ORA-00020: maximum number of processes (4500) exceeded

ORA-20 errors will not be written to the alert log for

the next minute. Please look at trace files to see all

the ORA-20 errors.

Fri Nov 05 09:35:48 2021

Thread 1 advanced to log sequence 191580 (LGWR switch)

Current log# 1 seq# 191580 mem# 0: +SSD/hospital/onlinelog/redo01.log

Fri Nov 05 09:35:48 2021

LNS: Standby redo logfile selected for thread 1 sequence 191580 for destination LOG_ARCHIVE_DEST_4

Fri Nov 05 09:35:48 2021

LNS: Standby redo logfile selected for thread 1 sequence 191580 for destination LOG_ARCHIVE_DEST_3

Fri Nov 05 09:35:48 2021

LNS: Standby redo logfile selected for thread 1 sequence 191580 for destination LOG_ARCHIVE_DEST_2

Fri Nov 05 09:35:48 2021

Archived Log entry 904316 added for thread 1 sequence 191579 ID 0x431aef8a dest 1:

Fri Nov 05 09:37:30 2021

ORA-00020: maximum number of processes (4500) exceeded

ORA-20 errors will not be written to the alert log for

the next minute. Please look at trace files to see all

the ORA-20 errors.

alert中出现大量的ORA-00020错误告警,数据库连接数达到4500上限。

3. 检查数据库参数和连接

SQL> show parameter process;

 

NAME                                 TYPE        VALUE

———————————— ———– ——————————

aq_tm_processes                      integer     1

cell_offload_processing              boolean     TRUE

db_writer_processes                  integer     10

gcs_server_processes                 integer     4

global_txn_processes                 integer     1

job_queue_processes                  integer     1000

log_archive_max_processes            integer     4

processes                            integer     4500

processor_group_name                 string

 

SQL> select inst_id,count(*) from gv$session group by inst_id;

 

INST_ID   COUNT(*)

———- ———-

1       4496

2        715

查询得知,1节点连接数达到4496,已经达到连接数4500的上限,因此前端业务出现连接不上的情况。

4. 定位异常连接

SQL> select program,machine,count(*)

2  from gv$session group by program,machine,sql_id order by 3;

 

 

PROGRAM                                          MACHINE       COUNT(*)

——————————– ————————–  ———-

w3wp.exe                         WORKGROUP\SXZZ_SERVER_CX            10

Medinfo.Quartz.DbJob.exe         WORKGROUP\QUANGUO-NB                10

yszj.exe                         WORKGROUP\MZ-3F-EBHYS7              10

yszj.exe                         WORKGROUP\MZ-2F-NK2YS4              10

yszj.exe                         WORKGROUP\MZ1F-EKYS7                10

emrproject.exe                   WORKGROUP\ZY1-2X-YS2                10

 

PROGRAM                          MACHINE                       COUNT(*)

——————————– ————————–  ———-

yszj.exe                         WORKGROUP\MZ-2F-WKYS9               11

yszj.exe                         WORKGROUP\MZ-2F-FKYS13              11

emrjz.exe                        WORKGROUP\JZ-QJS-HS1                11

yzxt.exe                         WORKGROUP\ZY-604200-01-HS           11

yszj.exe                         WORKGROUP\MZ-2F-GKYS3               12

yszj.exe                         WORKGROUP\MZ-3F-KQYS6               12

yzxt.exe                         WORKGROUP\ZY-609400-02-HS           12

HisServer.exe                    WORKGROUP\YYGHSERV                  12

yszj.exe                         WORKGROUP\GRK-GBMZ-01               12

oracle@lis-server (TNS V1-V3)    lis-server                          12

yszj.exe                         WORKGROUP\MZ-2F-WKYS10              13

 

PROGRAM                          MACHINE                       COUNT(*)

——————————– ————————–  ———-

yszj.exe                         WORKGROUP\MZ-1F-EKYS6               13

yszj.exe                         WORKGROUP\MZ-NKJ-01                 13

yzxt.exe                         WORKGROUP\ZY-ZHNK1-HS-04            13

yszj.exe                         WORKGROUP\MZ-3F-YKYS2               13

yszj.exe                         WORKGROUP\M3-YK-YS1                 14

yszj.exe                         WORKGROUP\MZ-2F-FKYS-01             14

yszj.exe                         WORKGROUP\MZ-3F-YKYS4               14

yszj.exe                         WORKGROUP\MZ-2F-FKYS10              14

yszj.exe                         WORKGROUP\MZ-3F-TJWKYS1             15

yszj.exe                         WORKGROUP\MZ-3F-WKYS5               15

MAF.exe                          WORKGROUP\QUYUSC1                   16

 

PROGRAM                          MACHINE                       COUNT(*)

——————————– ————————–  ———-

yszj.exe                         WORKGROUP\MZ-3F-WKYS3               16

yzxt.exe                         WORKGROUP\HIS-CXYY                  16

yszj.exe                         WORKGROUP\MZ-2F-NK2YS05             16

yszj.exe                         WORKGROUP\MZ-2F-WKYS5               17

yszj.exe                         WORKGROUP\MZ-3F-WKYS4               17

yszj.exe                         WORKGROUP\MZ-3F-KQKYS2              17

w3wp.exe                         WORKGROUP\WX_DDJY                   17

MediII.Adapter.WinForm.Scanner_V WORKGROUP\YYQZ                      18

yszj.exe                         WORKGROUP\MZ-1F-EKYS4               18

yszj.exe                         WORKGROUP\MZ-2F-NK2YS6              18

emrproject.exe                   WORKGROUP\3020-CXYY                 19

 

PROGRAM                          MACHINE                       COUNT(*)

——————————– ————————–  ———-

yszj.exe                         WORKGROUP\MZ-2F-GK24                19

yszj.exe                         WORKGROUP\ZY-609400-02-HS           19

yszj.exe                         WORKGROUP\MZ-GK-25                  20

emrproject.exe                   WORKGROUP\HIS-PC                    20

yszj.exe                         WORKGROUP\ZY-609400-02-YS           21

yszj.exe                         WORKGROUP\MZ-GUKE-YS04              21

yszj.exe                         WORKGROUP\MZ-2F-FKYS9               22

yszj.exe                         MSHOME\JZ-WKYS2                     22

yszj.exe                         WORKGROUP\MZ-2F-NK2YS-01            22

w3wp.exe                         WORKGROUP\YINYIQIANZHI1             23

yszj.exe                         WORKGROUP\MZ-2F-WKYS4               23

 

PROGRAM                          MACHINE                       COUNT(*)

——————————– ————————–  ———-

yszj.exe                         WORKGROUP\MZ-2F-GKYS2               23

yszj.exe                         WORKGROUP\MZ-2F-FKYS6               24

yszj.exe                         WORKGROUP\MZ-2F-WKYS6               25

yszj.exe                         WORKGROUP\MZ-XEK7                   25

yszj.exe                         WORKGROUP\MZ-3F-EBYHYS2             25

yszj.exe                         WORKGROUP\MZ-1F-EKYS8               26

yszj.exe                         WORKGROUP\MZ-3F-ZYYS-8              26

yszj.exe                         WORKGROUP\MZ-2F-FKYS3               27

yszj.exe                         WORKGROUP\MZ-2F-WKYS2               27

yszj.exe                         WORKGROUP\MZ-2F-WKYS3               28

yszj.exe                         WORKGROUP\MZ-2F-NK2YS2              28

 

PROGRAM                          MACHINE                       COUNT(*)

——————————– ————————–  ———-

yszj.exe                         WORKGROUP\MZ-2F-NK2-YS09            29

oracle@cxemrdb01 (TNS V1-V3)     cxemrdb01                           30

yszj.exe                         WORKGROUP\MZ-2F-NK2-9               30

yszj.exe                         WORKGROUP\MZ-3F-ZYYS2               30

yszj.exe                         WORKGROUP\MZ-3F-EBHYS6              30

yszj.exe                         WORKGROUP\MZ-2F-GKYS21              31

yszj.exe                         WORKGROUP\MZ-WAIKE-YS12             33

yzxt.exe                         WORKGROUP\ZY-7D-HS3                 33

yszj.exe                         WORKGROUP\MZ-2F-FKYS5               34

yszj.exe                         WORKGROUP\MZ-2F-NK1YS3              34

yszj.exe                         WORKGROUP\DESKTOP-VI7O6SC           36

 

PROGRAM                          MACHINE                       COUNT(*)

——————————– ————————–  ———-

yszj.exe                         WORKGROUP\MZ-2F-WKYS1               38

yszj.exe                         WORKGROUP\MZ-NK2-07                 39

yszj.exe                         WORKGROUP\MZ-2F-FKYS11              42

yszj.exe                         WORKGROUP\MZ-2F-NK1YS5              43

yszj.exe                         WORKGROUP\JZ-EK-001                 45

yszj.exe                         WORKGROUP\LHMZ_YS2                  46

yszj.exe                         WORKGROUP\MZ-2F-NK1YS2              50

w3wp.exe                         WORKGROUP\HIS-YIYAO                 50

yszj.exe                         WORKGROUP\MZ-2F-WKYS11              56

yszj.exe                         WORKGROUP\LHMZ_YS1                  56

yszj.exe                         WORKGROUP\MZ-2F-NK1YS14             58

 

PROGRAM                          MACHINE                       COUNT(*)

——————————– ————————–  ———-

yszj.exe                         WORKGROUP\MZ-2F-NK1YS13             62

yszj.exe                         WORKGROUP\JZ-NK-2                   64

yszj.exe                         WORKGROUP\MZ-2F-WKYS007             66

yszj.exe                         WORKGROUP\MZ-2F-FKYS4               66

yszj.exe                         WORKGROUP\MZ-2F-NKYS7               69

yszj.exe                         WORKGROUP\MZ-2F-NK1YS4             103

yszj.exe                         WORKGROUP\MZ-PAOZHEN-01            108

oracle@lis-server (TNS V1-V3)    lis-server                         136

oracle@cxemrdb01 (TNS V1-V3)     cxemrdb01                          395

查询得知,yszj.exe应用程序对HIS数据库建立了较高数量的连接,其中主机名为“MZ-2F”开头的客户端尤为明显。

 

 

 

 

 

 

 

 

 

 

 

 

三、问题处理步骤

1. kill session

查询yszj.exe应用程序的进程ID

SQL> select b.spid

2  from v$session a,v$process b

3  where a.paddr=b.addr

4  and a.program=’yszj.exe’;

5  and a.machine like ‘MZ-2F%’;

 

SPID

————————

70593

72017

76189

77305

78409

78481

78636

78676

79541

80183

80816

杀会话

[oracle@cxhisdb01 ~]$ kill -9 70593

[oracle@cxhisdb01 ~]$ kill -9 72017

[oracle@cxhisdb01 ~]$ kill -9 76189

[oracle@cxhisdb01 ~]$ kill -9 77305

[oracle@cxhisdb01 ~]$ kill -9 78409

[oracle@cxhisdb01 ~]$ kill -9 78481

[oracle@cxhisdb01 ~]$ kill -9 78636

[oracle@cxhisdb01 ~]$ kill -9 78676

[oracle@cxhisdb01 ~]$ kill -9 79541

[oracle@cxhisdb01 ~]$ kill -9 80183

[oracle@cxhisdb01 ~]$ kill -9 80816

执行kill session后,效果不明显,因为会话查杀后,又有新的会话连接进来。

2. 客户沟通

通过与客户沟通了解到,YZSJ.EXE程序新上了检查报告互认功能,会导致程序连接不释放问题,导致程序到数据库连接数不断增加,从而数据库连接数达到上限。

3. 释放进程

客户安排前台业务程序主机重启,释放连接进程,业务端主机重启后,检查连接情况如下:

SQL> select inst_id,count(*) from gv$session group by inst_id;

 

INST_ID   COUNT(*)

———- ———-

1       3427

2        681

 

SQL> select program,machine,count(*)

2  from gv$session  where program=’yszj.exe’ group by program,machine order by 3;

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\XZ-YGK-07                     1

yszj.exe        WORKGROUP\YJ-NKJ-MZFT                   1

yszj.exe        WORKGROUP\MZ-3F-WKYS4                   2

yszj.exe        WORKGROUP\RLS                           2

yszj.exe        WORKGROUP\MZ-3F-KQKYS2                  2

yszj.exe        WORKGROUP\MZ-2F-GKYS3                   2

yszj.exe        WORKGROUP\MZ-FWT-01                     2

yszj.exe        WORKGROUP\FDHJ54365                     2

yszj.exe        WORKGROUP\MZ-2F-WKYS2                   2

yszj.exe        WORKGROUP\MZ-3F-KQK7                    2

yszj.exe        WORKGROUP\MZ-2F-WKYS9                   2

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-2F-NK1-YS6                 2

yszj.exe        WORKGROUP\MZ-2F-FKYS9                   2

yszj.exe        WORKGROUP\MZ-2F-GKYS2                   2

yszj.exe        WORKGROUP\MZ-2F-WKYS6                   2

yszj.exe        WORKGROUP\CXYY-YZ2                      2

yszj.exe        WORKGROUP\MZ-2F-FKYS13                  2

yszj.exe        WORKGROUP\MZ-KQK-YS-1                   2

yszj.exe        WORKGROUP\MZ-2F-NK1YS13                 2

yszj.exe        WORKGROUP\ZY-XYJHZX-01-YS               2

yszj.exe        WORKGROUP\MZ-2F-WKYS8                   2

yszj.exe        WORKGROUP\MZ-2F-FKYS4                   2

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-2F-NK1YS4                  2

yszj.exe        WORKGROUP\MZ-2F-WKYS3                   2

yszj.exe        WORKGROUP\MZ-2F-NKYS7                   2

yszj.exe        WORKGROUP\MZ-3F-WKYS5                   2

yszj.exe        WORKGROUP\MZ-2F-WKYS007                 2

yszj.exe        WORKGROUP\GAMZ-01                       2

yszj.exe        WORKGROUP\MZ-WJ3                        2

yszj.exe        WORKGROUP\MZ-3F-PFYS3                   2

yszj.exe        WORKGROUP\ZY-609400-03-HS               2

yszj.exe        WORKGROUP\MZ-2F-WKYS1                   2

yszj.exe        WORKGROUP\ZY1-4X-YS5                    2

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-2F-FKYS7                   2

yszj.exe        WORKGROUP\MZ-3F-TJWKYS1                 2

yszj.exe        WORKGROUP\ZY-601700-05-YS               2

yszj.exe        WORKGROUP\MZ-PICC-02                    2

yszj.exe        WORKGROUP\JZ-EK-001                     2

yszj.exe        WORKGROUP\FSK-SSZX-02                   2

yszj.exe        WORKGROUP\ZY1-5D-YS4                    2

yszj.exe        WORKGROUP\MZ-3F-KFYJT                   2

yszj.exe        WORKGROUP\ZY-XSR-YS5                    2

yszj.exe        WORKGROUP\MZ-ZY-YS9                     2

yszj.exe        WORKGROUP\MZ-3F-EBYHYS2                 2

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-2F-FKYS15                  2

yszj.exe        WORKGROUP\MZ-2F-FK12                    2

yszj.exe        WORKGROUP\MZ-2F-NK1YS5                  2

yszj.exe        WORKGROUP\MZ-WK-SKZK                    2

yszj.exe        WORKGROUP\MZ-2F-NK1YS-17                2

yszj.exe        WORKGROUP\MZ-2F-FKYS10                  2

yszj.exe        WORKGROUP\MZ-2F-GK24                    2

yszj.exe        WORKGROUP\MZ-2F-NK2YS10                 2

yszj.exe        WORKGROUP\MZ-KOUQIANG-Y11               2

yszj.exe        WORKGROUP\JZ-EK-YS-06                   2

yszj.exe        WORKGROUP\MZ-NKJ-01                     2

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-2F-NK2-9                   2

yszj.exe        WORKGROUP\MZ-3F-ZYYS2                   2

yszj.exe        WORKGROUP\MZ-3F-EBHYS6                  2

yszj.exe        MSHOME\JZ-QJ-04                         2

yszj.exe        WORKGROUP\MZ-3F-YKYS2                   2

yszj.exe        WORKGROUP\MZ-3F-ZY-YS3-                 2

yszj.exe        WORKGROUP\MZ-2F-NK1YS2                  2

yszj.exe        WORKGROUP\MZ-KOUQIANG-Y13               2

yszj.exe        WORKGROUP\MZ-3F-ZYYS10                  2

yszj.exe        WORKGROUP\JZ-EBH-01                     2

yszj.exe        WORKGROUP\JZ-GK-YS1                     2

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-NK1-YS-017                 2

yszj.exe        WORKGROUP\MZ1F-EKYS7                    2

yszj.exe        WORKGROUP\JZ-NK-2                       2

yszj.exe        WORKGROUP\MZ-2F-FKYS11                  2

yszj.exe        WORKGROUP\MZ-2F-FKYS6                   2

yszj.exe        WORKGROUP\MZ-2F-FKYS-01                 2

yszj.exe        WORKGROUP\MZ-2F-NK1YS3                  2

yszj.exe        WORKGROUP\ZY2-11F-YS2                   2

yszj.exe        WORKGROUP\MZ-WAIKE-YS12                 2

yszj.exe        WORKGROUP\MZ-2F-NK2-YS09                2

yszj.exe        WORKGROUP\MZ-FWT-03                     2

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-2F-NK2YS6                  2

yszj.exe        WORKGROUP\MZ-1F-EKYS6                   2

yszj.exe        WORKGROUP\JZ_XIONGTONG                  2

yszj.exe        WORKGROUP\MZ-3F-KQYS6                   2

yszj.exe        WORKGROUP\MZ-3F-YK13                    2

yszj.exe        WORKGROUP\ZY-TXBQ-YYK                   2

yszj.exe        WORKGROUP\LXZL-205                      2

yszj.exe        WORKGROUP\MZ-NK-11-YS                   2

yszj.exe        WORKGROUP\MZ-3F-PFK-2                   2

yszj.exe        WORKGROUP\MZ-NK-YS8                     2

yszj.exe        WORKGROUP\MZ-2F-NK2YS2                  2

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-3F-KQKYS4                  2

yszj.exe        WORKGROUP\MZ-2F-NK2YS-01                2

yszj.exe        WORKGROUP\MZ-2F-NK1YS14                 2

yszj.exe        WORKGROUP\MZ-3F-ZYYS1                   2

yszj.exe        WORKGROUP\FRMZ-YS-03                    2

yszj.exe        WORKGROUP\MZ-NKZJ-01                    2

yszj.exe        WORKGROUP\MZ-2F-NK2YS3                  2

yszj.exe        WORKGROUP\MZ-2F-NK2YS05                 2

yszj.exe        WORKGROUP\MZ-XEK7                       2

yszj.exe        WORKGROUP\MZ-ZJNK-YS8-A                 2

yszj.exe        WORKGROUP\MZ-ZHFW-05                    2

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-2F-FKYS3                   2

yszj.exe        WORKGROUP\ZY2-6F-ZLNKYS01               2

yszj.exe        WORKGROUP\MZ-FWT-04                     2

yszj.exe        WORKGROUP\FRMZ_YS2                      2

yszj.exe        WORKGROUP\MZ-3F-WKYS3                   2

yszj.exe        WORKGROUP\JZ-QJS-YS3                    2

yszj.exe        WORKGROUP\ZY-703700-05-YS               2

yszj.exe        WORKGROUP\TSZX-02                       3

yszj.exe        WORKGROUP\MZ-3F-KQXFS1                  3

yszj.exe        WORKGROUP\DESKTOP-VI7O6SC               3

yszj.exe        WORKGROUP\ZY-600400-06-YS               3

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\FSK-DPS-08                    4

yszj.exe        WORKGROUP\MZ-3F-ZYYS5                   4

yszj.exe        MSHOME\JZ-WKYS01                        4

yszj.exe        WORKGROUP\SF-YWS-YS1                    5

yszj.exe        WORKGROUP\MZ-3F-ZYYS4                   5

yszj.exe        WORKGROUP\ZY1-1X-YS                     5

yszj.exe        WorkGroup\USER-2021SXCVV                5

yszj.exe        WORKGROUP\MZ2-NJZX-001                  5

yszj.exe        WORKGROUP\ZY-GK3-HS4                    6

yszj.exe        WORKGROUP\MZ-GB-CDMZ                    6

yszj.exe        WORKGROUP\MZ-2F-NK2YS11                 8

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-3L-ZYNK-007                9

yszj.exe        WORKGROUP\MZ-3F-ZJK1                   11

yszj.exe        WORKGROUP\MZ-3F-KQKYS5                 14

yszj.exe        WORKGROUP\HIS-PC                       15

yszj.exe        WORKGROUP\MZ-2F-WKYS10                 16

yszj.exe        WORKGROUP\GRK-GBMZ-01                  16

yszj.exe        WORKGROUP\MZ-PIFU-YS3                  18

yszj.exe        WORKGROUP\ZY-609400-02-YS              21

yszj.exe        WORKGROUP\MZ-3F-KQKYS3                 22

yszj.exe        WORKGROUP\MZ-PAOZHEN-01                23

yszj.exe        WORKGROUP\MZ-GUKE-YS04                 23

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\MZ-FR-EK-1                   24

yszj.exe        WORKGROUP\ZY-609400-02-HS              25

yszj.exe        WORKGROUP\MZ-3F-PFKYS24                25

yszj.exe        WORKGROUP\MZ-2F-WKYS5                  34

yszj.exe        WORKGROUP\MZ-3F-YKYS4                  34

yszj.exe        WORKGROUP\MZ-2F-WKYS4                  35

yszj.exe        WORKGROUP\MZ-1F-EKYS8                  35

yszj.exe        MSHOME\JZ-WKYS2                        35

yszj.exe        WORKGROUP\MZ-3F-ZYYS-8                 37

yszj.exe        WORKGROUP\MZ-1F-EKYS4                  45

yszj.exe        WORKGROUP\LHMZ_YS2                     64

 

PROGRAM         MACHINE                          COUNT(*)

————— —————————— ———-

yszj.exe        WORKGROUP\LHMZ_YS1                     78

yszj.exe        WORKGROUP\MZ-2F-WKYS11                 87

 

145 rows selected.

通过查询得知,前端业务主机重启后,释放了一部分进程,节点1连接数已经下降至3400+,前端业务连接正常,未发现ORA-12516错误。

 

 

总结与后续处理建议

1. 问题总结

YZSJ.EXE应用程序通过192.2.1.10(节点1的VIP)连接HIS数据库,由于YZSJ.EXE程序新上了检查报告互认功能,有程序连接不释放问题,导致程序到数据库连接数不断增加,数据库连接数(4500)达到上限,从而导致前端业务连接数据库失败,出现ORA-12516:TNS:监听程序找不到符合协议堆栈要求的可用处理程序。

2. 处理操作

  • 查杀会话,释放进程资源 ==>效果不明显,会话杀掉后程序会自动重连
  • 重启业务前端主机,释放资源 ==>暂时缓解,数据库连接数已经下降至3400+
  • 修复exe应用程序  ==>彻底解决,预计中午业务研发部门对yszj.exe做修复

3. 后续建议

此次故障发生的根本原因是由于YZSJ.EXE程序更新了检查报告互认功能导致的程序连接不释放导致数据库连接数达到上限,从而导致前端业务无法连接数据库,针对此次发生的故障,建议点如下:

  • 业务测试,业务更新上线前,在开发库或测试环境做充分的业务功能性测试和性能测试;
  • 调整数据库连接数参数,当前数据库的process参数为4500,可根据实际需要适当调整增大数据库的process参数,调整此参数需要重启数据库生效,因此需要安排停机时间。

 

Oracle 连接异常处理故障一例

对于在云数据库 (OCI) 创建时 (虚拟机还没有生成) 的问题,只提供第一步的信息就可以。

 

对于其他的问题,需要提供步骤一到步骤三的所有信息。

1. 从 OCI 的控制台搜集数据库(OCI) 的相关信息:

a. Cloud Account Name
b. Region:
Region will be available on the top corner of every page in OCI console
c. Availability Domain:
Availability Domain can be found next to the database system name in OCI console
d. Tenancy ID:
In OCI console, Navigate to Administration –> Tenancy Details, Look OCID under the Tenancy Information
e. Compartment OCID:
In OCI console, Navigate to Identity –> Compartments. Look OCID under the compartment name
f. Database system OCID:
In OCI console, Navigate to Menu –> Bare Metal, VM, and Exadata, Look for Database System OCID near the database system name
g. Database OCID:
In OCI console, navigate to Database System page and click on the problematic database system. In the Database System page, look for the Database OCID under the “Databases”

 

2. 如果数据库 (OCI) 对应的虚拟机可以访问,那么用 root 用户登陆虚拟机,搜集基础架构、代理、数据库和任务详细信息:

基础架构详细信息

 

# sudo -s

# curl -s http://169.254.169.254/opc/v1/instance/ | egrep -v “user_data|ssh_authorized_keys|timeCreated”

代理详细信息

# sudo -s

# rpm -qa | grep dcs
# initctl status initdcsagent
# initctl status initdcsadmin

集群、数据库详细信息

# sudo -s

# Replace the grid home and issue below command
# /u01/app/12.2.0.1/grid/bin/crsctl check crs

# sudo su – oracle

# sqlplus / as sysdba

— Run below SQLs in SQLPLUS prompt

select status from v$instance;
select name, open_mode from v$database;
select banner from v$version where banner like ‘Oracle Database%’;

— if Multitenant

show pdbs;

任务详细信息

# sudo -s

# /opt/oracle/dcs/bin/dbcli describe-component > /tmp/dcs_job_details.log
# /opt/oracle/dcs/bin/dbcli list-databases >> /tmp/dcs_job_details.log
# /opt/oracle/dcs/bin/dbcli list-jobs >> /tmp/dcs_job_details.log
# /opt/oracle/dcs/bin/dbcli describe-job -i <faild_job_ID> >> /tmp/dcs_job_details.log

上传 /tmp/dcs_job_details.log

3. 使用 opc 用户登陆到数据库 (OCI) 对应的虚拟机,并执行如下命令:

# sudo /opt/oracle/dcs/bin/diagcollector.py

Sample Output:
=============[oci@ludatou ~]$ sudo /opt/oracle/dcs/bin/diagcollector.py
Log files collected to :/tmp/dcsdiag/diagLogs-1526004897.zip

Logs are being collected to:
/ludatou/ocidiag/diagLogs-4758698722.zip

该命令会生成 /tmp/dcsdiag/diagLogs-xxxxx.zip 的文件。

用如下命令上传这个文件到 SR:
Note 1547088.2 – How to upload large files to Oracle Support

# curl -T <path_and_filename>” -u “<userID>” https://transport.oracle.com/upload/issue/<sr-number>/

For example:
curl -T /ludatou/ocidiag/diagLogs-4758698722.zip -u ********@oracle.com https://transport.oracle.com/upload/issue/

Oracle cloud 上的云资源诊断信息收集

在升级之前习惯备份整个Oracle程序目录,包括数据文件,这里介绍一种备份oracle主要程序文件(Oracle_home)的方式,可以使用多种方式备份 Oracle home 。你可以使用任何工具来压缩Oracle Home,比如zip,tar,cpio。

备份之前建议关闭源库上的任何数据库,监听进程,从而可以对Oracle Home软件进行冷备份,当然也可以不停机。 如果是在安装补丁或者补丁集,Readme中的步骤会要求关闭,这种情况下,建议关闭数据库和监听再执行ORACLE_HOME冷备份。在Oracle进程活跃状态下执行备份仍然是有效的,因为任何加载static binaries 或者libraries的进程都不应当持有write lock。备份必须由Oracle安装用户或者root用户执行。目的是保证文件的属主和权限正确。

如下是使用tar命令的例子。

1. 关闭数据库,监听或者任何其它关联到你在备份的ORACLE_HOME的进程

2. cd 到ORACLE_HOME所在的目录。例如:

cd /u01/app/oracle/product/11.2

3. 备份 ORACLE_HOME 。

tar -pcvf /u01/app/oracle/backup/oracle_home_bkup.tar db1

在上述命令中, ORACLE_HOME 是 /u01/app/oracle/product/11.2/db1 而备份目录是 /u01/app/oracle/backup/

如下是一个还原ORACLE_HOME的例子:

1. 关闭数据库,监听或者任何其它关联到你在还原的ORACLE_HOME的进程

2. 进入 ORACLE_HOME 所在的目录。例如:

cd /u01/app/oracle/product/11.2

3. 重命名或者移动 ORACLE_HOME 例如:

mv db1 db1_bkup

4. 还原ORACLE_HOME 例如:

tar -pxvf /u01/app/oracle/backup/oracle_home_bkup.tar

在备份前检查是否有足够的空间备份,Oracle_Home 所注册的Central Inventory建议一起与ORACLE_HOME同时备份,从而保证一致性。

 

在升级或者打补丁之前备份ORACLE_HOME目录

2018年2月15日Oracle官方在MOS上新发布预警文档《Recommended patches and actions for Oracle databases versions 12.1.0.1, 11.2.0.3 and earlier – before June 2019 (Doc ID 2361478.1)》中强烈建议所有DBA在2019年6月之前将Oracle数据库版本11.1.0.7、11.2.0.3和12.1.0.1打补丁到下面提到的patchset/PSU级别,以解决未来dblinks互操作性方面的潜在问题。

Oracle只是定义为潜在问题,实际关键是oracle的高版本(11.2.4/12.1.0.2/12.2及以上)数据库在2019年06月23日,默认的SCN兼容性会进行自动变化。

2.1 SCN基础概念

系统更改号(SCN)是Oracle数据库使用的一个逻辑的内部时间戳。SCN对数据库中发生的事件进行排序,这是满足事务的ACID属性所必需的。

 

数据库使用SCN来查询和跟踪更改。例如,如果事务更新了一行,那么数据库将记录更新发生时的SCN。此事务中的其他修改通常具有相同的SCN。当事务提交时,数据库将为此提交记录SCN。每个事务在提交后增加SCN。

 

 

2.2 SCN limit 

SCN以单调递增的顺序出现,Oracle数据库可以使用的SCN数量上限非常大。也称为Reasonable SCN Limit,简称RSL。这个值是一个限制,避免数据库的SCN无限制地增大,甚至达到了SCN的最大值。这个值大约是这样一个公式计算出来的:(当前时间-1988年1月1日)*24*3600*16,384 (16K/sec)。兼容性版本1

由于存在上限,任何给定的Oracle数据库都不能耗尽可用的SCN,这一点很重要。Oracle数据库使用基于时间的配给系统来确保不会发生这种情况。

这样做可以确保Oracle数据库随着时间的推移对SCN进行定量配给,从而允许任何使用版本12.2.0.1或更低版本的Oracle数据库进行超过500年的数据处理。从12.2.0.1开始,数据库的兼容性设置为12.2,即使SCN以96K/sec(兼容性版本3)的速度消耗,Oracle数据库也将允许近300万年的数据处理。

 

2.3 SCN headroom

数据库使用的当前SCN与“不超过”上限之间的差异称为SCN headroom 。

 

然而,Oracle已经确定,一些软件错误可能会导致数据库试图超过当前最大SCN值(或者超出所保证的范围)。

 

通常,如果数据库确实试图超过当前的最大SCN值,那么引起此事件的事务将被数据库取消,应用程序将看到一个错误。下一秒,这个限制就会增加,所以通常应用程序在处理过程中会继续出现轻微的打嗝。然而,在一些非常罕见的情况下,数据库确实需要关闭以保持其完整性。在任何情况下,数据都不会丢失或损坏。

 

2.4 SCN兼容性

SCN 兼容级别是限制SCN增速和SCN RSL。

 

2.5 Auto-rollover 

AUTO-ROLLVOER是一种类例JOB的定时任务,定时修改SCN 兼容性级别。

禁用了AUTO-ROLLOVER 到2019-06-23后SCN兼容级别就不会自动调整,还保持原来的限制。 禁用了Auto-RollOver,可以手动调整scn兼容性(前提是应用了补丁)。

 

 


 三 关于这次SCN兼容性自动触发的原理 

在高版本的数据库中引入了SCN 兼容性参数( Compatibility),而且在这个特性中设置了时间限制。

在Oracle数据库软件内核中, 引入了一个: Auto-RollOver 的机制 。

也就是说Oracle 为不同 SCN 兼容性设定了触发时间,随着时间推移自动迭代,用户会在不知情的情况下自动应用了新的SCN 兼容性。

当在2019年6月23日后,整体数据库环境中,有高版本(定义参看上)创建后, 由于SCN允许最高SCN新增速率会有可能有较高的SCN,而未打补丁的数据库处于较低的SCN级别(可能会出现通过dblink连接这两个数据库,由于DBLINK的特性(两端数据库必须同步SCN),如果SCN增加修改数据库的同步超出它的允许SCN率或当前最大的SCN限制,无法建立连接,那么dblink被拒绝,导致业务中断。

 

 


四 查看SCN兼容性级别和调整时间

数据库版本在11.2.0.3.9之前没有打dblink的补丁则没有DBMS_SCN包。

 

在11g版本可以通过sqlplus连接至数据库执行以下语句

set serveroutput on

declare

rsl number;

headroom_in_scn number;

headroom_in_sec number;

cur_scn_compat number;

next_compat number;

max_scn_compat number;

auto_rollover_ts date;

is_enabled boolean;

begin

dbms_scn.getcurrentscnparams(rsl,headroom_in_scn,headroom_in_sec,cur_scn_compat,max_scn_compat);

dbms_scn.getscnautorolloverparams(auto_rollover_ts,next_compat,is_enabled);

dbms_output.put_line(‘RSL=’||rsl);

dbms_output.put_line(‘headroom_in_scn=’||headroom_in_scn);

dbms_output.put_line(‘headroom_in_sec=’||headroom_in_sec);

dbms_output.put_line(‘CUR_SCN_COMPAT=’||cur_scn_compat);

dbms_output.put_line(‘NEXT_COMPAT=’||next_compat);

dbms_output.put_line(‘MAX_SCN_COMPAT=’||max_scn_compat);

dbms_output.put_line(‘auto_rollover_ts=’||to_char(auto_rollover_ts,’YYYY-MM-DD’));

if(is_enabled) then

dbms_output.put_line(‘Auto_rollover is enabled!’);

else

dbms_output.put_line(‘Auto_rollover is disabled!’);

end if;

end;

/

样例输出

 

当前auto rollover是启用且在2019年6月23日自动调整SCN兼容级别为3

五 如何判断是否需要应用补丁调整策略

5.1符合以下条件则无需关注SCN兼容性自动变化及相关补丁信息

所有有DBLINK传输数据数据库的集合版本均为低版本(11.1.0.7/11.2.0.3/12.1.0.1之前版本)或均为高版本(11.2.4/12.1.0.2/12.2及以上),且未来不会有版本和dblink交叉传输数据需求的变化

 

5.2对于11.1.0.7、11.2.0.3和12.1.0.1版本的数据库

 

这些补丁增加了数据库当前最大SCN限制,从16k/s到96k/s,且默认为2019年6月24日启动生效,允许更高的SCN速率,使数据库能够支持比早期版本高许多倍的事务速率。

 

在2019年6月23日之前安装补丁后,可以通关闭autorollover的特性,来规避scn兼容性自动变更到3。命令如下:

sqlplus连接至数据库

Exec dbms_scn. DisableAutoRollover ;

在2019年6月23日之后安装的高版本(定义参看上)数据库环境设置scn 兼容性至1.命令如下:

sqlplus连接至数据库

ALTER DATABASE SET SCN COMPATIBILITY 1;

 

5.3对于上表中Oracle官方没有提及的数据库版本如10.2或更老的数据库 

充分评估业务将来可能出现的dblink传输数据的场景,考虑升级数据库或不使用dblink通过JDBC等其他方式进行数据传输。

对于10g数据库在安装完10.2.0.5.171017的PSU以及补丁14121009。

 

 

5.4安装相应补丁后将SCN的异常情况记录到alert日志中 

我们建议通过控制SCN Headroom的告警阀值使得数据库可以尽可能的将DBLINK传播SCN的行为在alert日志中提现出来,目前我们建议将_external_scn_logging_threshold_seconds设置为600秒左右,使得通过DBLINK传递导致SCN Headroom过低产生alert告警。

alter system set “_external_scn_logging_threshold_seconds”=600 scope=spfile;

 

参考《Recommended patches and actions for Oracle databases versions 12.1.0.1, 11.2.0.3 and earlier – before June 2019 (Doc ID 2361478.1)》

关于Oracle高版本SCN兼容性 变化的处理建议