Skip to content

Oracle 排障配置与调整

之前发表过smon一些相关的文章,主要是讨论回滚方面的事宜,最近在项目上碰到pmon的相关问题,因此也学习了一些相关的知识,10g之前的版本跟踪方式主要局限于操作系统版本的命令跟踪,这里不做讨论

 

11g 跟踪命令

 

从 11.1.0.7 到 18.0, 可以使用下面的命令来启动 tracing:

alter system set events=’immediate trace name listener_registration level 3′;

当收集结束后,使用下面的命令来停止跟踪:

alter system set events=’immediate trace name listener_registration level 0′;

19c开始使用下面的命令

开启 Trace:
alter system set events ‘trace[LREG] disk highest’;
alter system set events = ‘immediate trace name LREG_STATE level 3’;

要关闭 Trace:
alter system set events ‘trace[LREG] disk disable’;
Trace的默认路径是:
$ORACLE_BASE/diag/rdbms/trace/
ls -l | grep -i lreg

这会将 PMON 的信息写入到名字包含 pmon 的 trace 文件中,存放在后台进程 trace 目录。Trace 文件会显示类似下面的信息:

Start Registration Information
——————————
Last update: 1188938571 (99 seconds ago)
Flag: 0x4, 0x0
State: succ=1, wait=0, fail=0
Listeners:
0 – (ADDRESS=(PROTOCOL=TCP)(HOST=)(PORT=1521)): <— 监听地址
state=1, err=0
nse[0]=0, nse[1]=0, nte[0]=0, nte[1]=0, nte[2]=0
ncre=0
Instance: <— 实例名
flg=0, upd=0
info=(HOST=) <– 主机名
node load=57, max=40960 <– 节点负载
inst load=1, max=170 <– 实例负载
Services:
0 – : <– 服务名
flg=4, upd=6
goodness=0, delta=1 <– Goodness 和 Delta 值
1 – _XPT:
flg=4, upd=0
goodness=0, delta=0
2 – XDB:
flg=5, upd=6
goodness=0, delta=1
Handlers:
0 – Dedicated
flg=80002002, upd=2
services=,_XPT
hdlr load=22, max=149
Dispatchers:
0 – D000:
addr=(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=33099))
inf=DISPATCHER , pid: 10850>
flg=1004, upd=0
services=XDB
hdlr load=0, max=1022
CMON Handlers:
Listen Endpoints:
—————————-
End Registration Information
—————————-

注意:Oracle Net Server trace,TNS 监听 trace 和 event 10257 在 11g 仍然可用,之前的文章介绍过跟踪监听的方法。详见下面 12c 之前版本的介绍。

 

12c LREG 故障排除

 

从 12c 开始,引入了新的后台进程 ora_lreg_sid-name。
在之前的版本中,PMON 负责处理实例注册。12c 中,LREG(Listener REGistration)接管实例注册逻辑。

LREG:

将实例信息注册到监听。
是每个数据库实例的关键后台进程(如果被杀死,oracle 将宕机)。
接管旧版本中 PMON 的一些职责,并且在 listener.log 中更新 service_update,service_register,service_died 信息。

跟踪 LREG 的方法与跟踪 PMON 的方法相同:

开启 Oracle Net 服务器端 sqlnet trace 会从实例启动时开始跟踪 LREG。
旧的 PMON trace 现在跟踪 LREG:alter system set events = ‘10257 trace name context forever, level 5’;
监听注册信息也可以通过这种方式被转储到 ora_lreg trace 文件中:alter system set events = ‘immediate trace name listener_registration level 3’;
可以动态跟踪 LREG。

 

12c 之前版本,使用下面的方法跟踪 PMON 注册问题 A) Oracle Net server 和 listener traces 或者 B) PMON tracing

A) 搜集匹配的 Oracle Net Server trace 和 Listener Trace 文件

服务器端 TRACE:

1. 在文件 SQLNET.ORA 中添加下面的参数来开启 Oracle Net Server tracing:

DIAG_ADR_ENABLED=off # Disable ADR if database version 11g TRACE_LEVEL_SERVER = 16 # Enable level 16 trace
TRACE_DIRECTORY_SERVER = # Control trace file location

2. 使用特权用户通过 SQL*Plus 连接数据库:

SQL> connect / as sysdba
Connected.
SQL> select spid from V$process, V$session where audsid=userenv(‘SESSIONID’) and paddr=addr;

SPID
————
3940
生成的 trace 文件的名字,将包含上面的返回值。

3. 执行注册命令:

SQL > alter system register
SQL > exit
4. 关闭服务器端 trace:

如果需要禁用 trace,那么可以删除 SQLNET.ORA 中刚加入的参数。到 TRACE_DIRECTORY_SERVER 设置的路径下,找到名字包含 SPID 值的 trace 文件。文件中会包含 alter system register 命令:

Listener tracing:
1. 在 listener.ora 中添加下面的参数,然后 reload listener:

DIAG_ADR_ENABLED_ =off # 如果数据库版本是 11g,需要关闭 ADR。
TRACE_LEVEL_ = 16 # 启用 level 16 trace
TRACE_TIMESTAMP_ = ON # 设置 trace 文件中的时间戳
TRACE_DIRECTORY_ = # 设置 trace 文件路径

2. 执行‘alter system register’强制注册:

SQL> alter system register;
System altered.

listener trace 文件中会看到类似下面的信息:
(信息会由于版本不同或者单节点、RAC 等因素有细微差别)

nsglgrDoRegister: inst loads: ld1:17 mld1:10240 ld2:1 mld2:248
nsglgrDoRegister: instance flags – req:0 cur:16
nsglgrDoRegister: Creating new service: “XDB.*****.com”.
nsglgrDoRegister: service:..oracle.com flag:3 goodness:0 delta:1
nsglgrDoRegister: Creating new service: “..oracle.com”.
nsglgrDoRegister: service:..oracle.com flag:2 goodness:0 delta:1

B) 12c 之前的版本启用 PMON trace 的方法:
1. 找到 PMON 的进程 ID:

SQL> select SPID,PROGRAM from v$process;

SPID PROGRAM
———————— ————————————————
PSEUDO
10096 oracle@ (PMON)
10098 oracle@ (PSP0)
10100 oracle@ (VKTM)
10104 oracle@ (GEN0)
10106 oracle@ (DIAG)
10108 oracle@ (DBRM)
10110 oracle@ (DIA0)
10112 oracle@ (MMAN)
10114 oracle@ (DBW0)
10116 oracle@ (LGWR)

SPID PROGRAM
———————— ————————————————
10118 oracle@ (CKPT)
10120 oracle@ (SMON)
10122 oracle@ (RECO)
10124 oracle@ (MMON)
10126 oracle@ (MMNL)
10128 oracle@ (D000)
10130 oracle@ (S000)
10175 oracle@ (Q000)
10280 oracle@ (SMCO)
22191 oracle@ (TNS V1-V3)
10159 oracle@ (QMNC)

SPID PROGRAM
———————— ————————————————
10177 oracle@ (Q001)
10173 oracle@ (CJQ0)
22186 oracle@ (W000)

25 rows selected.

2. 对 PMON 做 oradebug:

SQL> oradebug setospid 10096
Oracle pid: 2, Unix process id: 10096, image: oracle@(PMON)

3. 对进程设置 event:

SQL> oradebug Event 10257 trace name context forever, level 16
Statement processed.
4. trace 文件的位置可以通过下面的命令查看:

SQL> oradebug tracefile_name
Trace file /app/oracle/diag/rdbms///trace/_pmon_10096.trc
5. 执行注册命令,或者等待 PMON 注册(默认轮询时间是60秒):

SQL> alter system register;
System altered.

6. 关闭 event:

SQL> oradebug Event 10257 trace name context OFF;
Statement processed.

到 trace 所在的目录下并上传 trace。
注册成功会显示类似下面的信息:

Trace file /app/oracle/diag/rdbms///trace/_pmon_10096.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 – 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /app/oracle/product/11.2.0/dbhome_1
System name: Linux
Node name:
Release: 2.6.18-238.19.1.0.1.el5
Version: #1 SMP Fri Jul 15 04:42:13 EDT 2011
Machine: x86_64
Instance name:
Redo thread mounted by this instance: 1
Oracle process number: 2
Unix process pid: 10096, image: oracle@ (PMON)

 

Received ORADEBUG command (#1) ‘Event 10257 trace name context forever, level 16’ from process ‘Unix process pid: 22065, image: ‘

Finished processing ORADEBUG command (#1) ‘Event 10257 trace name context forever, level 16’

err=-300 lbflgs=0x0 tbtime=0 tntime=0 etime=300 srvs=1 nreqs=0 sreqs=0 asrvs=1
error=-300 etime=300 control=0 integral=0 lasterr=-300 lastetm=300
kmmlrl: status: succ=1, wait=0, fail=0
kmmlrl: update for process drop delta: 3166 3166 25 28 149
kmmgdnu:
goodness=0, delta=1,
flags=0x5:unblocked/not overloaded, update=0x6:G/D/-
kmmgdnu:
goodness=0, delta=1,
flags=0x4:unblocked/not overloaded, update=0x6:G/D/-
kmmlrl: 25 processes
kmmlrl: instance load 1

 

Bug 5755010 Listener registration never completes
详细信息: 尽管并没有注册失败的错误,但是监听注册会失败
修复版本:10.2.0.4 and 11.1.0.6

Bug 8232287 PMON stops registering its services (ORA-12516 errors)
详细信息:pmon 停止注册到监听,会引发 ORA-12516 错误。
修复版本:10.2.0.5 and 11.2.0.1

Bug 7133740One Instance of Bug ( 2 – NODE RAC DATABASE) I Crashed Due To Ora-600
请参考:

Document 759083.1 Connections get TNS-12520 error on RAC & PMON is stuck on ‘ges cancel’ wait event

Document 779318.1 Repeating ‘* Service_died * 0′ Messages In The 9i R2 Listener.Log File

Document 419824.1RAC Instance Status Shows Ready Zero Handlers For The Service

Document 1130713.1 Pmon Spins While Cleaning Dead Process

诊断方法:

1.检查 Oracle net 名字解析方式是否正确。例如:SQLNET.ORA 文件中的 NAMES.DIRECTORY_PATH 。Oracle net 会尝试正确的名字解析方式。

2.确保使用了正确的网络管理文件:SQLNET.ORA 和 TNSNAMES.ORA。Note:464410.1 Search Order for TNS files – listener.ora, sqlnet.ora, tnsnames.ora ..etc.

3.检查在数据库启动之前是否设置了 TNS_ADMIN ,如果是 RAC 环境,是否使用 srvctl 设置了 TNS_ADMIN,TNS_ADMIN 可以 影响搜索顺序。数据库只在启动的时候读取环境变量。如果在启动数据库之后设置了 TNS_ADMIN,然后做了修改,修改后的值是不会被读取到的。

4.检查 LOCAL_LISTENER 或者 REMOTE_LISTENER 使用的网络服务名是否可以 tnsping 通。如果不通,重建条目或者参考第6步。

5.确保使用的主机名与 nslookup 返回的结果相同,并且返回的地址是预期的。

C:\>nslookup

Server:
Address: ….

Name:
Address:….

6.修改 LOCAL_LISTENER 或者 REMOTE_LISTENER 的值,确保不使用名字解释方式。不使用网络管理文件,确认是否问题出在网络管理文件或是没找到它们。
例如 LOCAL_LISTENER

sqlplus / as sysdba
SQL>alter system set LOCAL_LISTENER='(ADDRESS=(PROTOCOL=TCP)(HOST=)(PORT=1521));
例如两节点 RAC 中的 REMOTE_LISTENER

SQL>alter system set REMOTE_LISTENER=’ (ADDRESS_LIST =(ADDRESS = (PROTOCOL = TCP)(HOST = )(PORT = 1521))(ADDRESS = (PROTOCOL = TCP)(HOST = )(PORT = 1521)))’;

7.检查 HOST 到 ip 地址的转换是否允许注册。

8.使用 IPC 替代 TCP 完成 LOCAL_LISTENER 的注册,验证问题是否与 TCP 或主机名有关。

sqlplus / as sysdba
SQL>alter system set LOCAL_LISTENER='(ADDRESS=(PROTOCOL=IPC)(KEY=KEY1))’;
Key 值必须与 LISTENER.ORA 文件中的 IPC 地址一致。

在oracle 数据库运行时使用event跟踪 pmon进程动态注册

bug: MEMORY_TARGET not supported on this system
SQL> startup open;
ORA-00845: MEMORY_TARGET not supported on this system

解决方案:原因是/dev/shm 必须大于 MEMORY_TARGET。

$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 1.8G 0 1.8G 0% /dev
tmpfs 1.9G 635M 1.2G 35% /dev/shm
tmpfs 1.9G 9.2M 1.8G 1% /run
tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup
/dev/mapper/ol-root 10G 3.9G 6.2G 39% /
/dev/sda1 397M 199M 198M 51% /boot
/dev/mapper/ol-home 3.0G 58M 3.0G 2% /home
/dev/mapper/ol-u01 30G 18G 13G 58% /u01
/dev/mapper/ol-tmp 3.0G 59M 3.0G 2% /tmp
vmhgfs-fuse 159G 133G 27G 84% /mnt/hgfs
tmpfs 370M 0 370M 0% /run/user/54322
tmpfs 370M 16K 370M 1% /run/user/42
tmpfs 370M 0 370M 0% /run/user/54321
tmpfs 370M 0 370M 0% /run/user/0

通过如下命令解决即可

mount -o size=2560m /dev/shm

 

Dataguard bug : MEMORY_TARGET not supported on this system ORA-00845

概述

 

某客户于2021/06/07凌晨01点43分左右,出现应用程序无法连接,前台业务无法办理等情况。在此情况下代维科技工程师在接到故障支持电话后,及时进行响应,通过远程的方式接入数据库进行故障排除,快速解决了问题使得业务恢复正常使用。

问题及相关日志分析

1. 数据库告警日志分析分析

通过对数据库告警日志查看,节点1在01:22:14出现ORA-29970错误,LMD进程无法响应,在01:22:29被LMHB进程将实例终止,实例终止后自动启动。节点2在01:29:31被LMS0进程将实例终止,实例终止后自动启动。

1节点:

Mon Jun 07 01:22:14 2021

LMD0 (ospid: 10613) has not called a wait for sub 0 secs.

LMD1 (ospid: 10617) has not called a wait for sub 0 secs.

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmhb_10637.trc  (incident=688375) (PDBNAME=CDB$ROOT):

ORA-29770: global enqueue process LMD0 (OSID 10613) is hung for more than 70 seconds

Incident details in: /u01/app/oracle/diag/rdbms/cdb/cdb1/incident/incdir_688375/cdb1_lmhb_10637_i688375.trc

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmhb_10637.trc  (incident=688376) (PDBNAME=CDB$ROOT):

ORA-29770: global enqueue process LMD1 (OSID 10617) is hung for more than 70 seconds

Incident details in: /u01/app/oracle/diag/rdbms/cdb/cdb1/incident/incdir_688376/cdb1_lmhb_10637_i688376.trc

LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_CDB by memno 0

ERROR: Some process(s) is not making progress.

LMHB (ospid: 10637) is terminating the instance.

Please check LMHB trace file for more details.

Please also check the CPU load, I/O load and other system properties for anomalous behavior

ERROR: Some process(s) is not making progress.

LMHB (ospid: 10637): terminating the instance due to error 29770

Mon Jun 07 01:22:24 2021

System state dump requested by (instance=1, osid=10637 (LMHB)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_diag_10575_20210607012224.trc

Mon Jun 07 01:22:26 2021

License high water mark = 591

Mon Jun 07 01:22:29 2021

Instance terminated by LMHB, pid = 10637

Mon Jun 07 01:22:29 2021

USER (ospid: 21601): terminating the instance

Mon Jun 07 01:22:29 2021

Instance terminated by USER, pid = 21601

Mon Jun 07 01:22:32 2021

Starting ORACLE instance (normal) (OS id: 21660)

Mon Jun 07 01:22:32 2021

CLI notifier numLatches:37 maxDescs:3986

Mon Jun 07 01:22:32 2021

**********************************************************************

Mon Jun 07 01:22:32 2021

Dump of system resources acquired for SHARED GLOBAL AREA (SGA)

 

Mon Jun 07 01:22:32 2021

Per process system memlock (soft) limit = UNLIMITED

Mon Jun 07 01:22:32 2021

Expected per process system memlock (soft) limit to lock

SHARED GLOBAL AREA (SGA) into memory: 128G

Mon Jun 07 01:22:32 2021

Available system pagesizes:

4K, 2048K

Mon Jun 07 01:22:32 2021

Supported system pagesize(s):

Mon Jun 07 01:22:32 2021

PAGESIZE  AVAILABLE_PAGES  EXPECTED_PAGES  ALLOCATED_PAGES  ERROR(s)

Mon Jun 07 01:22:32 2021

2048K            66823           65538           65538        NONE

Mon Jun 07 01:22:32 2021

Reason for not supporting certain system pagesizes:

Mon Jun 07 01:22:32 2021

4K – Large pagesizes only

Mon Jun 07 01:22:32 2021

**********************************************************************

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Initial number of CPU is 24

 

2节点:

Mon Jun 07 01:25:05 2021

Set master node info

Mon Jun 07 01:26:05 2021

Auto-tuning: Shutting down background process GTXb

Mon Jun 07 01:27:53 2021

IPC Send timeout detected. Sender: ospid 62152 [oracle@node1 (PING)]

Receiver: inst 1 binc 912546309 ospid 10589

Mon Jun 07 01:29:25 2021

LMD0 (ospid: 62168) received an instance eviction notification from instance 1 [2]

Mon Jun 07 01:29:26 2021

Received an instance abort message from instance 1

Mon Jun 07 01:29:26 2021

Received an instance abort message from instance 1

Mon Jun 07 01:29:26 2021

Please check instance 1 alert and LMON trace files for detail.

Mon Jun 07 01:29:26 2021

Please check instance 1 alert and LMON trace files for detail.

Mon Jun 07 01:29:26 2021

LMS0 (ospid: 62192): terminating the instance due to error 481

Mon Jun 07 01:29:26 2021

System state dump requested by (instance=2, osid=62192 (LMS0)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/oracle/diag/rdbms/cdb/cdb2/trace/cdb2_diag_62123_20210607012926.trc

Mon Jun 07 01:29:26 2021

ORA-1092 : opitsk aborting process

Mon Jun 07 01:29:27 2021

License high water mark = 1251

Mon Jun 07 01:29:31 2021

Instance terminated by LMS0, pid = 62192

Mon Jun 07 01:29:31 2021

USER (ospid: 85262): terminating the instance

Mon Jun 07 01:29:31 2021

Instance terminated by USER, pid = 85262

Mon Jun 07 01:29:33 2021

Starting ORACLE instance (normal) (OS id: 85397)

Mon Jun 07 01:29:33 2021

CLI notifier numLatches:37 maxDescs:3986

Mon Jun 07 01:29:33 2021

**********************************************************************

Mon Jun 07 01:29:33 2021

Dump of system resources acquired for SHARED GLOBAL AREA (SGA)

 

Mon Jun 07 01:29:33 2021

Per process system memlock (soft) limit = UNLIMITED

Mon Jun 07 01:29:33 2021

Expected per process system memlock (soft) limit to lock

SHARED GLOBAL AREA (SGA) into memory: 128G

Mon Jun 07 01:29:33 2021

Available system pagesizes:

4K, 2048K

Mon Jun 07 01:29:33 2021

Supported system pagesize(s):

Mon Jun 07 01:29:33 2021

PAGESIZE  AVAILABLE_PAGES  EXPECTED_PAGES  ALLOCATED_PAGES  ERROR(s)

Mon Jun 07 01:29:33 2021

2048K            67200           65538           65538        NONE

Mon Jun 07 01:29:33 2021

Reason for not supporting certain system pagesizes:

Mon Jun 07 01:29:33 2021

4K – Large pagesizes only

Mon Jun 07 01:29:33 2021

**********************************************************************

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Initial number of CPU is 24

查看更多日志,发现在关闭之前,出现了较多的ORA-04031错误。

Mon Jun 07 01:18:14 2021

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14795.trc  (incident=691727) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^1069″,”kglseshtTable”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmd0_10613.trc  (incident=707486) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 8504 bytes of shared memory (“shared pool”,”unknown object”,”sga heap(5,0)”,”ges big msg pool”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Mon Jun 07 01:18:16 2021

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14832.trc  (incident=691688) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^320″,”kglseshtTable”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Mon Jun 07 01:18:16 2021

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14837.trc  (incident=691735) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^1234″,”kglseshtTable”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Mon Jun 07 01:18:18 2021

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_ora_14912.trc  (incident=691871) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^159″,”kglseshtTable”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_lmd0_10613.trc  (incident=707487) (PDBNAME=CDB$ROOT):

ORA-04031: unable to allocate 8504 bytes of shared memory (“shared pool”,”unknown object”,”sga heap(5,0)”,”ges big msg pool”)

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

DDE: Problem Key ‘ORA 4031’ was completely flood controlled (0x6)

Further messages for this problem key will be suppressed for up to 10 minutes

Mon Jun 07 01:18:33 2021

Errors in file /u01/app/oracle/diag/rdbms/cdb/cdb1/trace/cdb1_m000_15479.trc:

ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^2147″,”kglseshtTable”)

 

 

2. AWR报告信息分析

 

拉取了宕机前半小时的AWR报告,从AWR报告中得知,shared pool中ges enqueues和ges resource dynamic分别达到了18G和16G。

3. Oracle官方参考文献

 

根据Oracle官方文档记载:ORA-04031 Errors Occurring with High “ges resource dynamic” & “ges enqueues” Memory Usage In The Shared Pool (Doc ID 2063751.1),该问题是Oracle的BUG,该BUG出现于Oracle RAC 12.1.0.1至12.1.0.2,此问题在Oracle RAC 12.2中被修复,移动护理数据库版本为Oracle RAC 12.1.0.2。

 

 

4. 数据库硬解析比例高的SQL语句

SQL_TEXT

———————————————————————————————————————————————————————————————————————————————————-

unshared count

————–

SELECT A.PATIENT_ID, A.VISIT_ID,A.ORDER_NO,A.ORDER_SUB_NO,TO_CHAR(T.BED_NO) BED_NO,TO_CHAR(START_DATE_TIME,’YYYY-MM-DD HH24:MI:SS’) AS START_DATE_TIME, TO_   FROM  V_EMR_TIME,’YYYY-MM-DD HH24:MI:SS’) AS STOP_DATE_TIME

ORDERS_PERFORMAN  AND  A.PATIENT_ID=’690086500ID’,’7′)

8′              AND (A.STOP_DATE_TIME IS NULL OR A.STOP_DATE_TIME >= TO_DATE(‘2021-06-07’||’ 00:00:00′,’YYYY-MM-DD HH24:MI:SS’)) AND A.START_DATE_TIME <=

TO_DATE(‘2021-06-07 07:52:53’, ‘YYYY-MM-DD HH24:MI:SS’)

35162

 

update KYEEHIS.hl7_send_records set type = ‘1’ where rec_id = ‘1003839362’

31808

 

AND T.BARCODE<>’1# IS NOT NULL2021-06-07′, ‘YYYY-MM-DD’)

ORDER BY T.OPER_TIME  DESC日(Q12)’, ‘1/12小时’, ‘1/8小时’)

14151

 

SELECT RECORD_TIME,ALL_S     WHERE PATIENT_ID=’1110234022′ AND VISIT_ID=’3′ AND PG_CODE=’8′ –AND RECORD_TIME BETWEEN TO_DATE(‘2021-06-07 00:00:00′,’YYYY-MM-DD HH24:MI:SS’) AND TO_DATE(‘2021-06-07 23:59:59

‘,’YYYY-MM-DD HH24:MI:SS     ORDER BY RECORD_TIME DESC

7829

 

select distinct barcode,’1’ exec_fg,nurse_name,nurse oper_nurse,” oper_time,order_exec_id,order_id,order_no,order_text,patient_id,visit_id                         where patient_id = ‘6900_exec_log

767269’                  and order_exec_id = ‘288132019’

6180

 

SELECT T.PATIENT_ID,            TO_DATE(TO_CHAR(T.EVENT_DATE,’YYYY-MM-DD’)||’ ‘|| TO_CHAR(T.TIME_POINT, ‘HH24:MI:SS’),

‘YYY      WHAND T.PATIENT_ID = ’40TAMP_CODEP_DICT K

10479126’                       AND (T.STAMP_CODE = ‘10003083’ OR T.STAMP_CODE = ‘10003084’)

5894

 

select t.* from KYEEHIS.docs_ord and order_no = ‘49712239’47427′

and exec_time = TO_DATE(‘2021-06-07 09:00:00′,’YYYY-MM-DD HH24:MI:SS’)

5070

 

SELECT REC_ID, PATIENT_ID, VISIT_ID, MODULE_CODE, MODULE_NAME, CREATE_TIME, ENBALE_VALUE FROM DOCS_OPERATION_MODULE_REC WHERE PATIENT_ID=’4010110870′ AND VISIT_ID=’3′ AND MODULE_CODE=’19’ AND ENBALE_VALUE=’Y’

4849

 

SELECT PATIENT_ID,TO_CHAR(VISIT_ID) AS VISIT_ID,INP_NO,WARD_CODE,WARD_NAME,DEPT_CODE,DEPT_NAME,PATIENT_NAME,SEX,TO_CHAR(BED_NO) AS BED_NO, BED_LABEL,DATE_OF_BIRTH,DECODE(NURSE_LEVEL,’特级护理’,’0′,’一级护理’,’1′,’二级护理’,’2′,’三级护理’,’3′) NURSE_L

EVEL,STATUS,CHARGE_TYPE,ADMISSION_DATE,DEPTED_TIME,DIAGNOSIS,MEAL,ALLERGY,ADDRESS, LEAVE_TIME,DOCTOR_IN_CHARGE ,       DOCTOR_NAME,       NURSE_IN_CHARGE,  NURSE_NAME,H_CODE,TYPE FROM KYEEHIS.v_his_pats_discharge T WHERE T.PATIENT_ID=’6901544170

‘ AND T.VISIT_ID=5 UNION  SELECT PATIENT_ID,TO_CHAR(VISIT_ID) AS VISIT_ID,INP_NO,WARD_CODE,WARD_NAME,DEPT_CODE,DEPT_NAME,PATIENT_NAME,SEX,TO_CHAR(BED_NO) AS BED_NO, BED_LABEL,DATE_OF_BIRTH,DECODE(NURSE_LEVEL,’特级护理’,’0′,’一级护理’,’1′,’二级护理’,’

2′,’三级护理’,’3′) NURSE_LEVEL,STATUS,CHARGE_TYPE,ADMISSION_DATE,DEPTED_TIME,DIAGNOSIS,MEAL,ALLERGY,ADDRESS,SYSDATE AS LEAVE_TIME,DOCTOR_IN_CHARGE ,          DOCTOR_NAME,       NURSE_IN_CHARGE,       NURSE_NAME,H_CODE,TYPE FROM KYEEHIS.V_HIS_PATS_IN_HOS

4139

 

select d.item_name,d.item_value  and t.patient_id = ‘6900510081’_eval_report_detail_rec d

and t.visit_id = ‘3’          and d.enabled_value = ‘Y’sionRecords’

4061

 

 

10 rows selected.


三、总结与后续处理建议

1. 问题总结

移动护理数据库版本是ORACLE RAC 12.1.0.2.0,本次故障发生前出现内存不足相关ORA-04031以及后台进程LMD无影响等报错信息,并最终被LMHB监测进程终止了数据库实例的运行。检查数据库后台TRACE及AWR信息报告,可以发现问题发生前shared pool内在中主要有组件ges resource dynamic和ges enqueues占用了35GB内存;

参照ORACLE官方文档信息该版本存在BUG 21373473,会导致shared pool中的ges resource dynamic和ges enqueues等组件占用内存不断增多,会导致数据库LMD后台进程无响应,从而导致实例终止宕机,这与本次问题的理解及后台TRACE吻合,因此判断为此BUG导致的数据库宕机。

同时数据库中存在硬解析比例高的问题,在2.4章节中已经将数据库中TOP 10的硬解析语句进行了抓取提供,需要软件开发商配合对此类SQL进行优化及使用SQL绑定变量等改写。

2. 问题详细分析过程

1.故障时间状态:在2021年6月7日凌晨00:30 – 01:00,移动护理数据库shared pool内存池中的ges enqueues和ges resource dynamic组件分别达到了18GB和16.8GB;

移动护理数据库在正常运行状态时ges enqueues和ges resource dynamic分别大约为3G和1G。

 

2.故障相关报错信息:问题时段这两个内存组件占用的内存显著异常,导致Oracle内存池剩余内存不足,引起数据库中进程出现ORA-04031的内存错误及后台进程LMD进程无响应(ORA-29770: global enqueue process LMD1 (OSID 10617) is hung for more than 70 seconds)。

报错信息为:ORA-04031: unable to allocate 12312 bytes of shared memory (“shared pool”,”unknown object”,”KKSSP^1234″,”kglseshtTable”)

 

3.报错信息与BUG的对应匹配:以上信息与ORA-04031 Errors Occurring with High “ges resource dynamic” & “ges enqueues” Memory Usage In The Shared Pool (Doc ID 2063751.1)及ORA-4031 due to Huge “ges resource dynamic” and “ges enqueues” and Instance Terminates: ORA-29770: ‘SGA: allocation forcing component growth’ (Doc ID 1951758.1)文档中的BUG 21373473描述吻合:ORA-04031 errors occurring in a RAC instance with the trace file showing high memory usage for “ges resource dynamic” and “ges enqueues” memory in the shared pool.  This can cause LMD processes to become unresponsive leading to an instance termination.

 

4.数据库实例被LMHB进程终止的原因:从11gR2版本,新增加LMHB进程,通过LMHB进程来定期监控lmon/lms/lmd/lck0等进程,如果LMHB进程发现某一个重要后台进程一段时间内(默认70秒)未更新自己的状态信息,会进行分析,如必要则通过重启本地节点的方式来解决问题。本次即为lmd进程超过70秒无响应后,LMHB对数据库进程进行了KILL,从而引起数据库宕机。

 

5. 后续处理及建议

  1. BUG问题的解决

根据Oracle官方参考文献,解决该问题有两种方式:

  1. 解决方案1

修改数据库参数,_GES_DIRECT_FREE_RES_TYPE=”CTARAHDXBB”,该参数是静态参数,需要重启数据库生效,可采取2个节点轮流修改的方法,减少业务停机时间,推荐采用此方案。

  1. 解决方案2

安装数据库补丁,补丁号Patch:21260431,需要安排业务停机时间安装补丁。

 

  1. 数据库硬解析比例高的解决

在2.4章节中已经将数据库中TOP 10的硬解析语句进行了抓取提供,需要软件开发商配合对此类SQL进行优化及使用SQL绑定变量等改写,从而提升系统性能,提高系统的稳定性。

ORA-29970 故障处理一例

概述

 

数据库对应的业务程序在2021/3/7日凌晨0:02分报无法连接数据库,在0:04分接到故障申告后,代维科技工程师通过远程方式接入数据库环境进行故障排查,在0:08分处理完成。通过对相关日志等信息的深入分析,找出问题点并提出相应的解决建议,根据相关问题日志及分析处理情况整理汇总此文档。

 

无法连接的原因分析

1. 宁帆数据库无法连接分析

 

登陆数据库进行查看,数据库无法连接原因为数据库主机磁盘空间使用率100%,归档日志无法切换,应用程序此时无法连接到数据库。紧急通过人工删除一部分数据库日志,并进行日志切换动作后,程序可以正常连接。

[root@nf-dboracle ~]# df -h

Filesystem                          Size  Used Avail Use% Mounted on

/dev/mapper/vg_nfdboracle-LogVol00  476G  452G   16M 100% /

tmpfs                               7.8G  160K  7.8G   1% /dev/shm

/dev/sda1                          1008M   63M  895M   7% /boot

/dev/sr0                            3.6G  3.6G     0 100% /mnt

[root@nf-dboracle ~]# su – oracle

[oracle@nf-dboracle ~]$ rman target /

 

Recovery Manager: Release 11.2.0.3.0 – Production on Sun Mar 7 00:07:32 2021

 

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

 

RMAN-00571: ===========================================================

RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============

RMAN-00571: ===========================================================

RMAN-00554: initialization of internal recovery manager package failed

RMAN-04005: error from target database:

ORA-09945: Unable to initialize the audit trail file

Linux-x86_64 Error: 28: No space left on device

 

 

2. 磁盘空间不足原因的分析

通过对数据库归档日志产生数量的分析,可以发现,从3月4号以来,每天的数据库归档日志产生量是原来的15至20倍左右,当前每天产生120GB的归档日志,按原有的备份策略,归档日志保留2天,因此最终引起了磁盘空间不足。

 

 

 

3. 产生归档日志量多的操作分析

数据库层面看,70%的数据变更都在KLB_MICROHOSP用户的 NFSYS_LOG_THIRD表上,在TOP SQL中也发现了对这个表的频率插入;结合表名看,有可能是新开启了写入日志的功能,具体需要应用软件开发商确认一下。

INSERT INTO NFSYS_LOG_THIRD ( ID, TYPE, METHOD, MSG, DESCRIPTION, NAME, OPENID, THIRD_METHOD ) VALUES ( NFSYS_LOG_THIRD_S.nextval, :type0, :method0, :msg, :description, :name, :openid, :third_method ) RETURNING ID INTO :IDENTITY_PARAMETER

 

 


总结与后续处理建议

通过对数据库相关日志以及相关数据的分析,可以得出如下信息:

1. 数据库无法连接原因的分析:

登陆数据库进行查看,数据库无法连接原因为数据库主机磁盘空间使用率100%,归档日志无法切换,应用程序此时无法连接到数据库。紧急通过人工删除一部分数据库日志,并进行日志切换动作后,程序可以正常连接。(0点4分接到电话,0点8分处理完成,用时4分钟。)

2. 磁盘空间不足分析:

通过对数据库归档日志产生数量的分析,可以发现,从3月4号以来,每天的数据库归档日志产生量是原来的15至20倍左右,当前每天产生120GB的归档日志,按原有的备份策略,归档日志保留2天,因此最终引起了磁盘空间不足。已经临时修改了备份及删除归档日志的策略来解决此问题;

3. 产生日志多的应用程序语句:

从数据库层面看,70%的数据变更都在KLB_MICROHOSP用户的 NFSYS_LOG_THIRD表上,在TOP SQL中也发现了对这个表的频率插入;结合表名看,有可能是新开启了写入日志的功能,具体需要应用软件开发商确认一下。

INSERT INTO NFSYS_LOG_THIRD ( ID, TYPE, METHOD, MSG, DESCRIPTION, NAME, OPENID, THIRD_METHOD ) VALUES ( NFSYS_LOG_THIRD_S.nextval, :type0, :method0, :msg, :description, :name, :openid, :third_method ) RETURNING ID INTO :IDENTITY_PARAMETER

 

 

4. 下一步排查及处理方向

因此,接下来需要及时排查的问题为:

  1. 应用程序开发商根据对应的SQL语句,排查程序在3/4日做了哪些升级或更新操作。
  2. 程序的调整是短期还是长期需要按此模式运行;这决定了后续数据库层面如何制定相应的备份及删除策略。
某客户报表数据库无法连接故障分析

概述

 

某客户由存储厂商维护人员将LUN存储划分到EMR数据库主机上;在此期间由于不当操作,导致数据库集群状态异常,之后重启了数据库主机节点2,重启主机后集群软件GRID无法正常启动,节点1仍可以正常提供数据库服务。通过远程方式接入数据库环境进行故障排查,通过对相关日志等信息的深入分析,最终找到问题点并进行处理,数据库集群恢复正常。

 

集群及数据库日志分析

1. 15:36分主机重启前的错误信息

从此处日志可以看出,在15:36分时数据库实例报IO读写错误,同时数据库集群日志中也在显示报错,之后发生了主机重启动作。

 

2. 节点2重启主机后相关日志信息

节点2重启主机后,检查集群启动状态可以发现在启动到ASM实例时停止,检查ASM ALERT日志有如下信息:

集群ALERT日志:

2021-12-12 15:48:34.081:

[client(10295)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

[client(10405)]CRS-10001:12-Dec-19 15:48 ACFS-9391: Checking for existing ADVM/ACFS installation.

[client(10410)]CRS-10001:12-Dec-19 15:48 ACFS-9392: Validating ADVM/ACFS installation files for operating system.

[client(10412)]CRS-10001:12-Dec-19 15:48 ACFS-9393: Verifying ASM Administrator setup.

[client(10415)]CRS-10001:12-Dec-19 15:48 ACFS-9308: Loading installed ADVM/ACFS drivers.

[client(10418)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleoks.ko’ driver.

[client(10446)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleadvm.ko’ driver.

[client(10473)]CRS-10001:12-Dec-19 15:48 ACFS-9154: Loading ‘oracleacfs.ko’ driver.

[client(10560)]CRS-10001:12-Dec-19 15:48 ACFS-9327: Verifying ADVM/ACFS devices.

[client(10564)]CRS-10001:12-Dec-19 15:48 ACFS-9156: Detecting control device ‘/dev/asm/.asm_ctl_spec’.

[client(10568)]CRS-10001:12-Dec-19 15:48 ACFS-9156: Detecting control device ‘/dev/ofsctl’.

[client(10573)]CRS-10001:12-Dec-19 15:48 ACFS-9322: completed

2021-12-12 15:48:43.930:

[client(10374)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:48:53.698:

[client(10598)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:49:03.129:

[client(10905)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:50:00.511:

[client(12458)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /u01/app/11.2.0/grid/log/cxemrdb02/client/crsctl_oracle.log.

2021-12-12 15:51:07.276:

?

节点2:ASM ALERT日志

Thu Dec 12 16:22:11 2021

MMNL started with pid=21, OS id=26248

lmon registered with NM – instance number 2 (internal mem no 1)

Thu Dec 12 16:24:08 2021

LMON received an instance eviction notification from instance 1

The instance eviction reason is 0x20000000

The instance eviction map is 2

Thu Dec 12 16:24:11 2021

PMON (ospid: 26206): terminating the instance due to error 481

Thu Dec 12 16:24:11 2021

ORA-1092 : opitsk aborting process

Thu Dec 12 16:24:13 2021

System state dump requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].

System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216_20211212162413.trc

Dumping diagnostic data in directory=[cdmp_20211212162411], requested by (instance=2, osid=26206 (PMON)), summary=[abnormal instance termination].

Instance terminated by PMON, pid = 26206

检查对应的LMON及DIAG日志:

[grid@cxemrdb02 trace]$ more ./+ASM2_diag_26216.trc

Trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26216.trc

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production

With the Real Application Clusters and Automatic Storage Management options

ORACLE_HOME = /u01/app/11.2.0/grid

System name: ???Linux

Node name: ?????cxemrdb02

Release: ???????2.6.32-431.el6.x86_64

Version: ???????#1 SMP Sun Nov 10 22:19:54 EST 2013

Machine: ???????x86_64

Instance name: +ASM2

Redo thread mounted by this instance: 0 <none>

Oracle process number: 6

Unix process pid: 26216, image: oracle@cxemrdb02 (DIAG)

*** 2021-12-12 16:22:10.788

*** SESSION ID:(373.1) 2021-12-12 16:22:10.788

*** CLIENT ID:() 2021-12-12 16:22:10.788

*** SERVICE NAME:() 2021-12-12 16:22:10.788

*** MODULE NAME:() 2021-12-12 16:22:10.788

*** ACTION NAME:() 2021-12-12 16:22:10.788

Node id: 1

List of nodes: 0, 1,

*** 2021-12-12 16:22:10.790

Reconfiguration starts [incarn=0]

*** 2021-12-12 16:22:10.790

I’m the voting node

Group reconfiguration cleanup

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

kjzdattdlm: Can not attach to DLM (LMON up=[TRUE], DB mounted=[FALSE]).

 

3. 问题分析描述

 

通过对数据库集群相关日志的分析,可以发现在15:36分时的非正常操作,导致了数据库及集群报错并状态异常,此时进行了数据库主机的重启动作;重启后集群日志中显示ASM无法启动,根据ASM日志及TRACE文件信息,结合ORACLE官方文档中ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)文档中描述的信息,可以判断是集群私网通信出现了问题,通过对集群私网相关配置进行分析、排查,最终确认是集群通信的socket文件(/var/tmp/.oracle)内容由于集群异常关闭导致不一致,从而引起集群节点2重启后不能正常启动。通过清理socket文件后集群可以正常启动 恢复正常。

 

 

 

 

 

三、总结与后续处理建议

1. 问题总结

 

通过对数据库集群相关日志的分析,可以发现在15:36分时的非正常操作,导致了数据库及集群报错并状态异常,此时进行了数据库主机的重启动作;重启后集群日志中显示ASM无法启动,根据ASM日志及TRACE文件信息,结合ORACLE官方文档中ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)文档中描述的信息,可以判断是集群私网通信出现了问题;通过对集群私网相关配置进行分析、排查,最终确认是集群通信的socket文件(/var/tmp/.oracle)内容由于集群异常关闭导致不一致,从而引起集群节点2重启后与节点1通信异常,不能正常启动。通过清理socket文件后集群可以正常启动,数据库集群恢复正常。

 

 

 

4. 后续对于集群添加磁盘的建议

对于生产环境数据库集群,向ASM环境中添加磁盘是一个重要且有一点风险的动作,由于数据库所有数据都存放在ASM磁盘组中,因此如果向已有ASM磁盘组添加磁盘不当,可能导致ASM磁盘组读写异常或不能正常MOUNT,从而影响正常的数据库业务运行。因此结合ORACLE官方的标准步骤”如何在 RAC 集群或单机 ASM 环境中对已经存在的 Diskgroup 添加新磁盘(最佳实践) (文档 ID 1946668.1)”以及我们多年的数据库维护经验,

建议后续添加ASM磁盘时遵从如下步骤:

步骤 操作明细
1 对现有集群状态、ASM磁盘状态、使用率、磁盘大小检查
2 存储层面进行相应的LUN划分,建议划分的LUN大小与原ASM磁盘组中的磁盘大小一致
3 存储层新划分LUN到主机操作系统层面的映射,确认集群中每个主机可以正常识别读写LUN
4 参照已有的ASM磁盘配置相应的别名或多路径映射,集群kfed命令可以读取
5 使用新划分LUN新建临时测试磁盘组确认新划分LUN可以被ASM使用
6 删除新加的测试磁盘组
7 将新划分LUN加入所需的ASM磁盘组中
8 检查ASM磁盘组rebalance完成状态,检查集群及数据库的状态
Oracle Asm 增加磁盘 操作不当引起ASM无法启动故障处理记录