Skip to content

Instance tuning

概述

 

某客户电子发票ISMS业务程序异从2022/2/20日早上8:52分至9:07分左右使用异常,之后自动恢复正常。通过远程方式接入数据库环境进行故障排查,通过对相关日志等信息的深入分析,找出问题点并提出解决方案

 

告警中的表空间使用率高问题分析

1. HIS数据库节点2在问题时间段的日志信息

节点2:系统自动扩大数据文件记录

2022-02-20T08:51:50.076038+08:00

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

2022-02-20T08:52:12.250967+08:00

HISDB(3):Resize operation completed for file# 50, old size 1048576K, new size 1069056K

2022-02-20T08:52:12.623885+08:00

HISDB(3):Resize operation completed for file# 50, old size 1069056K, new size 1110016K

2022-02-20T08:56:42.517712+08:00

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

2022-02-20T08:56:42.736925+08:00

HISDB(3):Resize operation completed for file# 49, old size 1048576K, new size 1069056K

2022-02-20T08:56:42.743704+08:00

HISDB(3):Resize operation completed for file# 50, old size 1110016K, new size 1130496K

 

2. HIS数据库节点1在问题时间段的日志信息

 

节点1:9:17分人工扩大数据文件日志记录

2022-02-20T09:17:43.482573+08:00

HISDB(3): alter database datafile ‘+DATADG/CDB/HISDB/DATAFILE/isms_data01’ resize 2G

2022-02-20T09:17:43.726081+08:00

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

2022-02-20T09:17:44.483661+08:00

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

2022-02-20T09:17:44.556208+08:00

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

2022-02-20T09:17:45.556743+08:00

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

2022-02-20T09:17:45.638312+08:00

HISDB(3):Resize operation completed for file# 49, old size 1069056K, new size 2097152K

HISDB(3):Completed:  alter database datafile ‘+DATADG/CDB/HISDB/DATAFILE/isms_data01’ resize 2G

2022-02-20T09:17:45.663855+08:00

 

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

2022-02-20T09:17:58.864830+08:00

HISDB(3):alter database datafile ‘+DATADG/CDB/HISDB/DATAFILE/isms_idx01’ resize 2G

2022-02-20T09:17:59.466571+08:00

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

HISDB(3):Using deprecated SQLNET.ALLOWED_LOGON_VERSION parameter.

2022-02-20T09:18:00.541385+08:00

 

3. 表空间相关属性信息

查看数据文件属性,支持自动扩展

SQL> select file_id,file_name,bytes/1024/1024 mb,AUTOEXTENSIBLE,status,maxbytes/1024/1024  from dba_data_files where tablespace_name=’ISMS_DATA’;

 

FILE_ID FILE_NAME                                                  MB AUT STATUS    MAXBYTES/1024/1024

———- ————————————————– ———- — ——— ——————

49 +DATADG/CDB/HISDB/DATAFILE/isms_data01                   2048 YES AVAILABLE         32767.9844

 

SQL> select file_id,file_name,bytes/1024/1024 mb,AUTOEXTENSIBLE,status,maxbytes/1024/1024  from dba_data_files where tablespace_name=’ISMS_IDX’;

 

FILE_ID FILE_NAME                                                  MB AUT STATUS    MAXBYTES/1024/1024

———- ————————————————– ———- — ——— ——————

50 +DATADG/CDB/HISDB/DATAFILE/isms_idx01                    2048 YES AVAILABLE         32767.9844

表空间使用率-2022/2/20 13:45分记录

TABLESPACE_NAME                            Total(M) usage_of_tablespace% left_space(M) Extent_man

—————————————- ———- ——————– ————- ———-

ISMS_IDX                                       2048                   51      1009.125 LOCAL

ISMS_DATA                                      2048                   49          1054 LOCAL

4. 表空间问题分析描述

微信告警的表空间问题的分析:

1.监控报警截图8:52分报ISMS_DATA及ISMS_IDX表空间使用率高,截图中数据文件大小为1GB。

2.检查表空间及数据文件的属性信息,数据文件支持自动扩大(最大32GB)。

3.查看数据库日志-节点2中,在08:52:12及08:56:42均有自动扩大的记录

4.查看数据库日志-节点1,在9:17分人工将数据文件扩大到2GB。

 

表空间告警问题分析总结:

由于ISMS_DATA及ISMS_IDX表空间的属性支持自动扩大(最大32GB),告警时大小为1GB。在数据库日志中08:52:12及08:56:42均有数据库系统自动扩大空间的记录,因此不存在表空间不足导致应用程序无法使用的问题。

 

 

数据库性能分析

对数据库在问题时段段的性能进行分析,结合监控软件中的数据,可以发现当时存在较多的TX锁等待事件,对应的SQL是对应表名:ISMS.ISMS_TICKET_STOCK (TABLE)

对应语句是:

SELECT TICKET_DEPT_CODE, TICKET_DICTIONARY_CODE, TICKET_VERSION, TICKET_START_NUMBER,

TICKET_END_NUMBER, STOCK_NUMBER, UNIT_CODE, STOCK_CREATE_DATE, TICKET_STOCK_CODE

FROM ISMS_TICKET_STOCK WHERE TICKET_STOCK_CODE=:1 FOR UPDATE

SQL执行计划没有问题,查询条件的TICKET_STOCK_CODE字段是惟一属性。

1. 数据库ASH报告中的对应的行锁等待事件

2. 数据库ASH报告中的对应的锁等待SQL语句

对应的SQL语句:select * FROM ISMS_TICKET_STOCK

WHERE   TICKET_STOCK_CODE=:1    FOR UPDATE

 

 

 

3. SQL语句执行计划等分析

SQL信息:

SQL> set linesize 180 pagesize 1800

SQL> select sql_id,PLAN_HASH_VALUE,CHILD_number,EXECUTIONS,BUFFER_GETS/EXECUTIONS,ELAPSED_TIME/EXECUTIONS/1000 from v$sql where sql_id=’7yjz8nhdaad7k’;

 

SQL_ID        PLAN_HASH_VALUE CHILD_NUMBER EXECUTIONS BUFFER_GETS/EXECUTIONS ELAPSED_TIME/EXECUTIONS/1000

————- ————— ———— ———- ———————- —————————-

7yjz8nhdaad7k      2932473141            0       9869             10.4900193                   6640.86119

 

SQL> col plan_table_output for a150

SQL> set long 9000

SQL> select * from table(dbms_xplan.display_cursor(‘7yjz8nhdaad7k’,0,’advanced’));

 

PLAN_TABLE_OUTPUT

——————————————————————————————————————————————————

SQL_ID  7yjz8nhdaad7k, child number 0

————————————-

SELECT       TICKET_DEPT_CODE, TICKET_DICTIONARY_CODE, TICKET_VERSION,

TICKET_START_NUMBER,   TICKET_END_NUMBER, STOCK_NUMBER,   UNIT_CODE,

STOCK_CREATE_DATE, TICKET_STOCK_CODE      FROM ISMS_TICKET_STOCK

WHERE   TICKET_STOCK_CODE=:1    FOR UPDATE

 

Plan hash value: 2932473141

 

—————————————————————————————————–

| Id  | Operation                    | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |

—————————————————————————————————–

|   0 | SELECT STATEMENT             |                      |       |       |     1 (100)|          |

|   1 |  FOR UPDATE                  |                      |       |       |            |          |

|   2 |   TABLE ACCESS BY INDEX ROWID| ISMS_TICKET_STOCK    |     1 |    57 |     1   (0)| 00:00:01 |

|*  3 |    INDEX UNIQUE SCAN         | ISMS_TICKET_STOCK_PK |     1 |       |     0   (0)|          |

 

 

 

4. SQL历史执行信息

 

 

JBOSS中间件问题排查

1. JBOSS中间件报错日志

应用程序维护人员提供的JBOSS中间件日志报错中显示的问题为不能从JBOSS中间件连接池获取链接来连上数据库:

主要报错是:

IJ000453: Unable to get managed connection for java:jboss/datasources/ISMS:

IJ000655: No managed connections available within configured blocking timeout (30000 [ms]) 这些错误说明连接池达到最大值,没有连接可用。

相关日志:

2022-02-20 08:52:30,275 ERROR [com.th.supcom.framework.dubbo.extensions.rpc.filter.ExceptionFilter] (DubboServerHandler-172.20.2.12:20880-thread-143) Got unchecked and undeclared exception which called by 172.20.2.139. service: com.th.supcom.isms.storage.stock.service.IsmsTicketStockOutService, method: eInvoiceSaveUseOfOutDocu, exception: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:jboss/datasources/ISMS: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:jboss/datasources/ISMS

at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80) [spring-jdbc-3.2.13.RELEASE.jar:3.2.13.RELEASE]

at org.mybatis.spring.SqlSessionUtils.getSqlSession(SqlSessionUtils.java:116) [mybatis-spring-1.0.1.jar:1.0.1]

at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:333) [mybatis-spring-1.0.1.jar:1.0.1]

at com.sun.proxy.$Proxy32.selectList(Unknown Source)

at org.mybatis.spring.SqlSessionTemplate.selectList(SqlSessionTemplate.java:189) [mybatis-spring-1.0.1.jar:1.0.1]

at org.apache.ibatis.binding.MapperMethod.executeForList(MapperMethod.java:100) [mybatis-3.0.5.jar:3.0.5]

at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:70) [mybatis-3.0.5.jar:3.0.5]

at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:38) [mybatis-3.0.5.jar:3.0.5]

 

……………………

Caused by: java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:jboss/datasources/ISMS

at org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:147) [ironjacamar-jdbc-1.0.17.Final-redhat-1.jar:1.0.17.Final-redhat-1]

at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111) [spring-jdbc-3.2.13.RELEASE.jar:3.2.13.RELEASE]

at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77) [spring-jdbc-3.2.13.RELEASE.jar:3.2.13.RELEASE]

… 55 more

Caused by: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:jboss/datasources/ISMS

at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.getManagedConnection(AbstractConnectionManager.java:390)

at org.jboss.jca.core.connectionmanager.tx.TxConnectionManagerImpl.getManagedConnection(TxConnectionManagerImpl.java:368)

at org.jboss.jca.core.connectionmanager.AbstractConnectionManager.allocateConnection(AbstractConnectionManager.java:464)

at org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:139) [ironjacamar-jdbc-1.0.17.Final-redhat-1.jar:1.0.17.Final-redhat-1]

… 57 more

Caused by: javax.resource.ResourceException: IJ000655: No managed connections available within configured blocking timeout (30000 [ms])

 

 

6. 检查JBOSS中间件连接池的配置信息

发现最大连接池参数设置为50,即允许最多50个。

 

 

 

 

 

 

 

总结与后续处理建议

通过对数据库相关日志以及性能分析,对JBOSS中间件在问题时间点的报错的分析,可以得出如下信息:

1. 微信告警的表空间问题的分析:

监控报警截图8:52分报ISMS_DATA及ISMS_IDX表空间使用率高,截图中数据文件大小为1GB

检查表空间及数据文件的属性信息,数据文件支持自动扩大(最大32GB)。

查看数据库日志-节点2中,在08:52:12及08:56:42均有自动扩大的记录

查看数据库日志-节点1,在9:17分人工将数据文件扩大到2GB。

表空间告警问题分析总结:

由于ISMS_DATA及ISMS_IDX表空间的属性支持自动扩大(最大32GB),告警时大小为1GB。在数据库日志中08:52:12及08:56:42均有自动扩大的记录,因此表空间使用率高不会导致应用程序出现无法使用问题。而当时空间使用量增加较快,说明当时应用程序对于ISMS_DATA有较多的数据写入/修改等动作。

2. 数据库性能分析锁等待的SQL语句

监控软件中显示的问题在8:52至9:07分存在较多TX锁

对应表名:ISMS用户的ISMS_TICKET_STOCK (TABLE)

对应语句是:SELECT …… FROM ISMS_TICKET_STOCK WHERE TICKET_STOCK_CODE=:1 FOR UPDATE

此SQL执行计划及执行速度没有问题,查询条件的TICKET_STOCK_CODE字段是惟一属性。SQL执行速度较慢平均需要33秒,主要是由于应用程序未及时提交事务引起的锁不释放问题。这点从3.4章节的SQLHC报告截图中可以清楚的反映出来。同时这条SQL语句的执行频率在今天上午的问题时间段也较高,达到几千次;而从数据库的历史性能数据中未显示之前有如此高的执行频率,已经将此应用程序执行SQL语句频率高的问题点交由应用软件厂商排查。

3. JBOSS中间件的问题

应用程序维护人员提供的JBOSS中间件日志报错中显示的问题为不能从JBOSS中间件连接池获取链接来连上数据库:主要报错是:

IJ000453: Unable to get managed connection for java:jboss/datasources/ISMS:

IJ000655: No managed connections available within configured blocking timeout (30000 [ms]) 这些错误说明连接池达到最大值,没有连接可用。

 

检查JBOSS中间件连接池相关配置,发现max-pool-size最大连接池参数设置为50,即允许最多50个。

5. 问题发生的原因

应用程序执行问题SQL的次数异常增加(基于SQL执行频率)或程序异常(基于问题SQL不释放锁);

比平时多的并发连接导致了JBOSS连接池用满(JBOSS连接最大50),新的应用程序进程无法从连接池分配链接;

程序报错无法连接到数据库。

 

6. 下一步排查及处理方向

因此,接下来需要及时排查的问题为:

  1. 20日上午问题点时应用程序执行问题SQL的次数异常增加的原因排查。
  2. JBOSS连接池配置max-pool-size当前为50,建议增加到200.并根据后续程序运行情况及时监测、调整。

 

7. 后续运维建议

  1. 及时关注数据库运行状态及告警,这一点会将第三方运维人员也加入告警群组来实现更早、更及时的处理数据库告警信息。
  2. 对应用程序、中间件的各项配置,要再进行排查并结合单位实际业务需求进行调整,特别是连接数等配置,确保够用。

3.定期进行数据库、中间件、应用程序等的巡检维护、性能优化等动作,确保各项配置能够满足不断发展的业务需求。

业务暴增导致Oracle出现性能故障分析一例

跟踪oracle 进程链接建立时候发生的情况

 

本方法主要使用触发器的方式对新建立的进程自动进行10046的trace并进行分析的方式,期间使用到truss,需要掌握truss的使用方法,具体如下:

 

创建用户

create user trace_user identified by trace_user account unlock;
grant create session,dba to trace_user;

创建登陆触发器

create or replace trigger sys.set_trace
after logon on database
when (user = 'trace_user')
declare
lcommand varchar(200);
begin
execute immediate 'alter session set statistics_level=all';
execute immediate 'alter session set max_dump_file_size=unlimited';
execute immediate 'alter session set events ''10046 trace name context forever, level 12''';
end set_trace;
/

 

尝试登陆

sqlplus "trace_user/trace_user" << eof
exit;
eof

这将在$ ORACLE_BASE / admin / <db_name> / udump目录下生成(10046)跟踪文件

然后通过查看“e =”部分查看数据库调用(PARSE,BIND,EXEC,FETCH)是否有大量时间,这意味着数据库调用所消耗的持续时间

通过查看“ela =”部分检查等待事件是否有大量时间,这意味着等待事件消耗的持续时间(可能通过进行系统调用)

PARSING IN CURSOR #5 len=131 dep=1 uid=0 oct=3 lid=0 tim=15687899202263 hv=1389591971 ad='ca9a7948'
select privilege#,nvl(col#,0),max(mod(nvl(option$,0),2))from objauth$ where obj#=:1 and grantee#=:2 group by privilege#,nvl(col#,0)
END OF STMT
EXEC #5:c=0,e=145,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=15687899202260
WAIT #5: nam='db file sequential read' ela= 30005 file#=1 block#=24208 blocks=1 obj#=-1 tim=15687899232346
FETCH #5:c=0,e=30075,p=1,cr=2,cu=0,mis=0,r=0,dep=1,og=4,tim=15687899232396

没有数据库调用占用大量时间

less oracle_ora_1118270.trc | grep -i ",e=" | cut -d , -f2 > karlarao.txt ; sed -n 's/e=/ /p' karlarao.txt | sort -nr | less

 358746
 183162
 61293
 44661
 32580
 30075
 28695
 26950
 25837
 24244
 ...

没有等待事件耗费大量时间

less oracle_ora_1118270.trc | grep -i "ela=" | cut -d " " -f8 | sort -nr | less

30005
28624
13253
11592
9650

没有 “os thread startup”事件!

less oracle_ora_1118270.trc | grep -i "ela=" | cut -d "=" -f2 | uniq

'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'db file sequential read' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'db file sequential read' ela
'SQL*Net message to client' ela
'db file sequential read' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'db file sequential read' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela

下来对系统调用进行检测,这里使用truss工具

这个工具可以让你做一些应用程序跟踪,这里是跟踪SQL * Plus,并显示应用程序对外部库和内核的调用

truss -c sqlplus "/ as sysdba" << EOF
exit;
EOF

truss -d sqlplus "/ as sysdba" << EOF
exit;
EOF

 

“-c”开关计算跟踪的系统调用,故障和信号

“-d”开关显示每行的时间戳(以秒为单位)。跟踪输出的第一行将显示测量各个时间戳的基准时间

这里下面可以发现是DNS服务导致sqlplus“/ as sysdba”操作减慢,并且有“os线程启动”等待,下面的文本是“truss”输出的部分,当在/etc/resolv.conf和/ etc / hosts上读取时,“(sleep……)”事件表示减速。

ENOTTY和ECONNREFUSED是看到的错误,这里此文件/usr/include/sys/errno.h上找到有关错误消息的更多详细信息。以下是它们的解释

 

#define	ENOTTY	25	/* Inappropriate I/O control operation	*/
#define ECONNREFUSED    79      /* Connection refused */

 

根据“truss -d”的记录,可以看到时间信息从2.3379秒(读取/etc/resolv.conf) – > 7.3477秒 – > 17.3489秒 – > 37.3555秒(读数) / etc / hosts) – > 70.3863秒(已连接)

0.3172:        gethostname(0x0FFFFFFFFFFF9920, 256) = 0
0.3175:        _getpid()                        = 1102056
0.3177:        open("/etc/resolv.conf", O_RDONLY) = 8
0.3180:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
0.3182:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 69
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 0
0.3190:        close(8)                         = 0
0.3194:        socket(2, 2, 0)                  = 8
0.3197:        getsockopt(8, 65535, 4104, 0x0FFFFFFFFFFF70E4, 0x0FFFFFFFFFFF70E0) = 0
0.3199:        connext(8, 0x09001000A0018600, 16) = 0
0.3202:        send(8, 0x0FFFFFFFFFFF7FF0, 41, 0) = 41
0.3205:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3208:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) Err#79 ECONNREFUSED
0.3210:        close(8)                         = 0
0.3212:        socket(2, 2, 0)                  = 8
0.3215:        sendto(8, 0x0FFFFFFFFFFF7FF0, 41, 0, 0x09001000A0018610, 16) = 41
0.3217:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3220:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) = 108
0.3222:        close(8)                         = 0
0.3224:        socket(2, 2, 0)                  = 8
0.3227:        getsockopt(8, 65535, 4104, 0x0FFFFFFFFFFF70E4, 0x0FFFFFFFFFFF70E0) = 0
0.3229:        connext(8, 0x09001000A0018600, 16) = 0
0.3231:        send(8, 0x0FFFFFFFFFFF7FF0, 28, 0) = 28
0.3233:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3236:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) Err#79 ECONNREFUSED
0.3238:        close(8)                         = 0
0.3240:        socket(2, 2, 0)                  = 8
0.3243:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018610, 16) = 28
0.3245:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3248:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) = 28
0.3250:        close(8)                         = 0
0.3252:        socket(2, 2, 0)                  = 8
0.3373:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018600, 16) = 28
2.3379:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) (sleeping...)
2.3379:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 0
5.3468:        close(8)                         = 0
5.3470:        socket(2, 2, 0)                  = 8
5.3472:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018600, 16) = 28
7.3477:        _poll(0x0FFFFFFFFFFF7190, 1, 10000) (sleeping...)
7.3477:        _poll(0x0FFFFFFFFFFF7190, 1, 10000) = 0
15.3479:        close(8)                        = 0
15.3482:        socket(2, 2, 0)                 = 8
15.3484:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018600, 16) = 28
17.3489:        _poll(0x0FFFFFFFFFFF7190, 1, 20000) (sleeping...)
17.3489:        _poll(0x0FFFFFFFFFFF7190, 1, 20000) = 0
35.3491:        close(8)                        = 0
35.3495:        getdomainname(0x0FFFFFFFFFFF9480, 256) = 0
35.3497:        open("/etc/hosts", O_RDONLY)    = 8
35.3500:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
35.3502:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
35.3505:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 3453
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 0
35.3514:        close(8)                        = 0
kwrite(9, "01 �0603 s".., 432)    = 432
37.3555:        kread(10, "16060280".., 2064) (sleeping...)
kread(10, "06 E06\b".., 2064)   = 1605
70.3762:        open("/oracle/app/product/10.2/db_1/rdbms/mesg/oraus.msb", O_RDONLY) = 8
70.3764:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
70.3767:        lseek(8, 0, 0)                  = 0
kread(8, "1513 "011303\t\t".., 256)     = 256
70.3772:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3776:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3781:        lseek(8, 50688, 0)              = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
70.3786:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3790:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3795:        lseek(8, 50688, 0)              = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
70.3799:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3804:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3808:        lseek(8, 50688, 0)              = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
70.3813:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3817:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3822:        lseek(8, 51712, 0)              = 51712
kread(8, "\n0589 D058A".., 512)     = 512
70.3827:        close(8)                        = 0
kwrite(9, " <0611 k".., 60)     = 60
kread(10, " �06\b".., 2064)   = 179
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
70.3854:        lseek(4, 512, 0)                = 512
kread(4, "17 �".., 512)     = 512
70.3858:        lseek(4, 1024, 0)               = 1024
kread(4, "16 * R h819E".., 512)     = 512
70.3863:        lseek(4, 4096, 0)               = 4096
kread(4, "\f82 P86".., 512)     = 512

kwrite(1, "\n", 1)                              = 1
Connected to:
kwrite(1, " C o n n e c t e d   t o".., 14)     = 14
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
kwrite(1, " O r a c l e   D a t a b".., 77)     = 77
With the Partitioning, OLAP, Data Mining and Real Application Testing options
kwrite(1, " W i t h   t h e   P a r".., 78)     = 78

kwrite(1, "\n", 1)                              = 1
70.3882:        kfcntl(1, F_GETFL, 0x0000000000000008) = 2
70.3886:        __libc_sbrk(0x0000000000030020) = 0x000000001023C880
70.3892:        access("login.sql", 0)          Err#2  ENOENT
70.3895:        access("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0) = 0
70.3898:        statfs("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0x0FFFFFFFFFFFDA10) = 0

 

事实证明确实是dns出现了问题,在调整后迅速恢复了响应时间

3.0713:        gethostname(0x0FFFFFFFFFFF9900, 256) = 0
3.0716:        _getpid()                        = 2539762
3.0719:        open("/etc/resolv.conf", O_RDONLY) = 8
3.0721:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
3.0724:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 70
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 0
3.0731:        close(8)                         = 0
3.0735:        socket(2, 2, 0)                  = 8
3.0740:        getsockopt(8, 65535, 4104, 0x0FFFFFFFFFFF70C4, 0x0FFFFFFFFFFF70C0) = 0
3.0742:        connext(8, 0x09001000A0018600, 16) = 0
3.0746:        send(8, 0x0FFFFFFFFFFF7FD0, 41, 0) = 41
3.0750:        _poll(0x0FFFFFFFFFFF7170, 1, 5000) = 1
3.0753:        nrecvfrom(8, 0x0FFFFFFFFFFF9300, 1024, 0, 0x0FFFFFFFFFFF7930, 0x0FFFFFFFFFFF7158) = 108
3.0756:        send(8, 0x0FFFFFFFFFFF7FD0, 28, 0) = 28
3.0758:        _poll(0x0FFFFFFFFFFF7170, 1, 5000) = 1
3.0761:        nrecvfrom(8, 0x0FFFFFFFFFFF9300, 1024, 0, 0x0FFFFFFFFFFF7930, 0x0FFFFFFFFFFF7158) = 28
3.0764:        close(8)                         = 0
3.0767:        getdomainname(0x0FFFFFFFFFFF9460, 256) = 0
3.0769:        open("/etc/hosts", O_RDONLY)     = 8
3.0771:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
3.0774:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
3.0777:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 3453
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 0
3.0786:        close(8)                         = 0
kwrite(9, "01 �0603 s".., 432)    = 432
kread(10, "06 H06\b".., 2064)   = 1608
3.0973:        open("/oracle/app/product/10.2/db_1/rdbms/mesg/oraus.msb", O_RDONLY) = 8
3.0975:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
3.0978:        lseek(8, 0, 0)                   = 0
kread(8, "1513 "011303\t\t".., 256)     = 256
3.0983:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.0988:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.0993:        lseek(8, 50688, 0)               = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
3.0997:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.1002:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.1006:        lseek(8, 50688, 0)               = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
3.1011:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.1015:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.1021:        lseek(8, 50688, 0)               = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
3.1025:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.1029:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.1034:        lseek(8, 51712, 0)               = 51712
kread(8, "\n0589 D058A".., 512)     = 512
3.1038:        close(8)                         = 0
kwrite(9, " <0611 k".., 60)     = 60
kread(10, " �06\b".., 2064)   = 179
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
3.1064:        lseek(4, 512, 0)                 = 512
kread(4, "17 �".., 512)     = 512
3.1069:        lseek(4, 1024, 0)                = 1024
kread(4, "16 * R h819E".., 512)     = 512
3.1073:        lseek(4, 4096, 0)                = 4096
kread(4, "\f82 P86".., 512)     = 512

kwrite(1, "\n", 1)                              = 1
Connected to:
kwrite(1, " C o n n e c t e d   t o".., 14)     = 14
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
kwrite(1, " O r a c l e   D a t a b".., 77)     = 77
With the Partitioning, OLAP, Data Mining and Real Application Testing options
kwrite(1, " W i t h   t h e   P a r".., 78)     = 78

kwrite(1, "\n", 1)                              = 1
3.1095:        kfcntl(1, F_GETFL, 0x0000000000000008) = 2
3.1099:        __libc_sbrk(0x0000000000030020)  = 0x000000001023C880
3.1105:        access("login.sql", 0)           Err#2  ENOENT
3.1109:        access("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0) = 0
3.1112:        statfs("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0x0FFFFFFFFFFFD9F0) = 0
跟踪oracle 进程链接建立时候发生的情况

Exadata x6 遭遇大量的 TCP socket 等待事件

客户从IBM POWER迁移到EXADATA X6后遭遇大量的TCP SOCKET等待,该等待开销了大约15%的cpu资源, 系统运行未受影响,待过一段时间后,系统cpu降低,该等待减少.

经过分析客户大量使用了utl相关东西,网络延时不稳定. 该等待事件与HTTP和网络的延时有关.

和power上历史允许的性能数据对比发现历史也存在TCP SOCKET在top 5里.

如下图:

历史小机运行信息:

IMG_3634

x6运行信息:

IMG_3632

 

X6网络情况:

IMG_3633

官方给予的解释:数据库层面不用理会,关注http和网络相关情况.

“TCP Socket (KGAS)” Reference Note

This is a reference note for the wait event “TCP Socket (KGAS)” which includes the following subsections:

See Note:61998.1 for an introduction to Wait Events.

Definition:

  • Versions:10.2 – Documentation: None
  • KGAS is a component in the server which handles TCP/IP sockets which is typically used in dedicated connections in 10.2+ by some PLSQL built in packages such as UTL_HTTP and UTL_TCP.
    The KGAS interface is not involved in client / server communication but is a layer which may be used when a session on the server makes some outbound TCP/IP call using a PLSQL package such as UTL_HTTP. Packages such as UTL_FILE have also reported this wait when making an SMTP call.
  • Note that this wait event is new in 10gR2. Earlier versions of Oracle doing the same operations would still wait inside KGAS for socket operations but such waits were not instrumented and so did not show up as waits.

Individual Waits:

 Parameters:	
    P1 = operation
    P2 = Not used
    P3 = Not used

	operation   P1 gives an indication of what 
		    operation is being performed. 
		    Possible values include:
			 1 Event Post
			 2 Call
			 3 Disconnect
			 4 Receive data
			 5 Send data
			 6 Wait for an event (eg wait for data to arrive)
			 7 Sleep
			 8 Special wait (for single object)
			 9 Special wait (for multiple objects)
			10 Select operation
                

 Wait Time:	
    The wait blocks until the current operation completes
    (or times out / errors as appropriate).

Systemwide Waits:

If the TIME spent waiting for this event is significant then it is best to determine which sessions are showing the wait and drill into what those sessions are doing as the wait is usually related to whatever application code is executing .
eg: What part of the application may be using UTL_HTTP or similar and is experiencing waits. This statement can be used to see which sessions may be worth tracing:

  SELECT sid, total_waits, time_waited
    FROM v$session_event
   WHERE event='TCP Socket (KGAS)'
     and total_waits>0
   ORDER BY 3,2
  ;

Reducing Waits / Wait times:

The waits incurred depend on what sockets are being opened to which remote end points and for what reason. To help find the origin of the socket operations try:

  • Check the current SQL / module / action of V$SESSION for sessions that are waiting on the event at the time that they are waiting to try and identify any common area of application code waiting on the event.
  • Get an ERRORSTACK level 3 dump of some sessions waiting on the event. This should help show the exact PLSQL and C call stacks invoking the socket operation if the dump is taken when the session is waiting. Customers may need assistance from Oracle Support in order to get and interpret such a dump but it can help pinpoint the relevant application code.
  • Trace sessions incurring the waits including wait tracing to try and place the waits in the context of the code executing around the waits. eg: Use event 10046 level 8 or DBMS_MONITOR.SESSION_TRACE_ENABLE.
  • Use DBA_DEPENDENCIES to find any application packages which may ultimately be using UTL_HTTP or UTL_TCP underneath for some operation.

Note that there are no real tunables within Oracle for these waits as they involve the session making a call to some remote TCP/IP socket and typically waiting on data from that source. Once you know what is being called, and why, then you can determine if the response times from that remote source are sensible or not and if not why.Example:
Execute the following SQL from a session on a dedicated connection and then check the resulting trace file to see “TCP Socket (KGAS)” waits:

alter session set events '10046 trace name context forever, level 8';
select utl_http.request('http://www.oracle.com/') from dual;

Related:

Tracing User sessions Note:404833.1

关于11g密码大小写验证以及密码延迟验证特性引发的血案所思

在Oracle的11g之前的版本中密码是不区分大小写的(使用双引号强制除外)。在Oracle的11g版本中对此有所增强。从此密码有了大小写的区分,这个大小写敏感特性是通过SEC_CASE_SENSITIVE_LOGON参数来控制的。该参数默认设置为true。

前阵子客户数据从10g迁往11g后,这个参数未修改,新上线第二天全国业务人员输入密码时候参照以前习惯密码全部小写,导致部分业务无法登陆,由于反复尝试登陆导致触发了著名的登陆延迟验证的特性延伸出了library cache lock(Delay after three failed login attempts),导致全线应用响应缓慢假死。当然解决过程当时根据现象,进程达到限制值,做了等待事件模型的分析,发现大量的session都没有成功登陆,很多信息都为null逐判断因为密码延迟验证的原因导致,而密码错误的事情再经过现场验证后发现了大小写敏感的问题。后面就是很熟套的剧情,领导抽了根烟,痛骂了原厂一顿,把此2个特性都关掉后,没有再出现过进程累积的现象。由此问题引发了对密码大小写验证以及密码延迟验证特性的思考.

一:关于sec_case_sensitive_logon的设置

关于sec_case_sensitive_logon参数所关联的密码大小写敏感,我建议在升级数据库系统时候关闭,在新建数据库系统时候关闭。可以配合应用密码策略使用,默认时候可以商讨关闭。

二:密码延迟验证的特性使用与否
关于Delay after three failed login attempts的特性参考yangtingkun的http://blog.itpub.net/4227/viewspace-672925

这个密码延迟特性特性的解决办法有几种,这里我大概描述下:

    1.设置登陆密码验证失败次数超过几次后锁定用户,该办法可以防止用户反复登陆验证,但是也会增加一定的维护工作
    2.通过诊断事件28401关闭密码延迟验证

同时我并不建议用第二种办法,密码延迟验证的特性可以有效遏制恶意破解密码的行为,因此从数据库安全的角度我建议设置密码失败超数锁用户。

关于event 28401

ALTER SYSTEM SET EVENT = '28401 TRACE NAME CONTEXT FOREVER, LEVEL 1' SCOPE = SPFILE

[oracle@ludatou ~]$ oerr ora 28401
28401, 00000, "Event to disable delay after three failed login attempts"
// *Document: NO
// *Cause: N/A
// *Action: Set this event in your environment to disable the login delay
//          which will otherwise take place after three failed login attempts.
// *Note: THIS IS NOT A USER ERROR NUMBER/MESSAGE. THIS DOES NOT NEED TO BE
//        TRANSLATED OR DOCUMENTED.

Oracle Rac hang 案例分析一则(WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK)

LOG FILE
————–
Filename=zsinsraca1_151006_0000.nmon.xlsx
See the following:

+++ CPU_ALL ========> CPU usage is NOT high

CPU Total zsinsraca1 User% Sys% Wait% Idle% CPU% Logical CPUs (SMTmode=4)
10/6/15 10:30 AM 5.2 1.7 0.2 92.9 6.9 48
10/6/15 10:31 AM 4.3 1.6 0.1 94 5.9 48
10/6/15 10:32 AM 7.6 1.5 0.2 90.6 9.1 48
10/6/15 10:33 AM 8.6 1.7 0 89.7 10.3 48
10/6/15 10:34 AM 8.8 1.8 0 89.3 10.6 48
10/6/15 10:35 AM 10.7 2 0.1 87.2 12.7 48
10/6/15 10:36 AM 9.3 2.1 0.1 88.6 11.4 48
10/6/15 10:37 AM 9.4 2.1 0.1 88.4 11.5 48
10/6/15 10:38 AM 10.7 2.5 0.2 86.6 13.2 48
10/6/15 10:39 AM 10.2 2.4 0.1 87.3 12.6 48
10/6/15 10:40 AM 9.6 1.8 0 88.5 11.4 48
10/6/15 10:41 AM 9.6 2.1 0 88.2 11.7 48
10/6/15 10:42 AM 9 2.2 0.1 88.7 11.2 48
10/6/15 10:43 AM 9.2 1.7 0 89 10.9 48
10/6/15 10:44 AM 11 1.8 0 87.2 12.8 48
10/6/15 10:45 AM 10.7 1.9 0.1 87.3 12.6 48
10/6/15 10:46 AM 12.9 2.1 0.1 84.9 15 48
10/6/15 10:47 AM 9.4 1.5 0 89 10.9 48
10/6/15 10:48 AM 11.3 1.6 0.1 87.1 12.9 48
10/6/15 10:49 AM 11.8 2.4 0.1 85.7 14.2 48
10/6/15 10:50 AM 12.2 1.8 0 85.9 14 48
10/6/15 10:51 AM 13.2 1.9 0.1 84.8 15.1 48
10/6/15 10:52 AM 12.4 1.3 0.1 86.3 13.7 48
10/6/15 10:53 AM 12.7 1.8 0 85.5 14.5 48
10/6/15 10:54 AM 11.9 1.9 0.1 86.2 13.8 48
10/6/15 10:55 AM 10.2 2 0.1 87.6 12.2 48
10/6/15 10:56 AM 11.5 2.4 0.2 85.9 13.9 48
10/6/15 10:57 AM 11.4 1.8 0.1 86.6 13.2 48
10/6/15 10:58 AM 11.9 1.7 0 86.3 13.6 48
10/6/15 10:59 AM 10.3 1.6 0 88.1 11.9 48
10/6/15 11:00 AM 12.3 1.5 0.2 86 13.8 48
10/6/15 11:01 AM 12.1 1.7 0.1 86.2 13.8 48
10/6/15 11:02 AM 10.6 1.6 0.1 87.8 12.2 48
10/6/15 11:03 AM 13.8 2.4 0.1 83.7 16.2 48
10/6/15 11:04 AM 12.5 2.6 0.1 84.9 15.1 48
10/6/15 11:05 AM 11.4 1.8 0.1 86.7 13.2 48
10/6/15 11:06 AM 10.1 2.3 0.1 87.5 12.4 48
10/6/15 11:07 AM 9.7 2.1 0.1 88.2 11.8 48
10/6/15 11:08 AM 10.5 2 0 87.5 12.5 48
10/6/15 11:09 AM 11.3 1.1 0.1 87.5 12.4 48
10/6/15 11:10 AM 11.6 1.8 0.1 86.6 13.4 48
10/6/15 11:11 AM 9.5 1.4 0.1 89 10.9 48
10/6/15 11:12 AM 8.4 1.4 0 90.2 9.8 48
10/6/15 11:13 AM 9 1.6 0 89.3 10.6 48
10/6/15 11:14 AM 10.1 1.9 0 87.9 12 48
10/6/15 11:15 AM 11.7 2.2 0.1 86.1 13.9 48
10/6/15 11:16 AM 10.5 1.9 0.1 87.6 12.4 48
10/6/15 11:17 AM 11 1.2 0.1 87.8 12.2 48
10/6/15 11:18 AM 12.6 2.2 0.1 85.2 14.8 48
10/6/15 11:19 AM 9.9 1.6 0.1 88.4 11.5 48
10/6/15 11:20 AM 11.3 2.1 0.1 86.5 13.4 48
10/6/15 11:21 AM 11.4 1.3 0.1 87.3 12.7 48
10/6/15 11:22 AM 11.6 1.5 0.1 86.8 13.1 48
10/6/15 11:23 AM 12.7 3.2 0.1 84 15.9 48
10/6/15 11:24 AM 11.8 3.1 0.1 85.1 14.9 48
10/6/15 11:25 AM 9.1 1.5 0 89.3 10.6 48
10/6/15 11:26 AM 11.1 1.9 0.1 86.9 13 48
10/6/15 11:27 AM 13.3 3.9 0.4 82.4 17.2 48
10/6/15 11:28 AM 10.3 2 0.8 86.9 12.3 48
10/6/15 11:29 AM 10.2 1.2 0 88.6 11.4 48
10/6/15 11:30 AM 11.3 1.7 0 87 13 48
10/6/15 11:31 AM 13.9 2.8 0.1 83.2 16.7 48
10/6/15 11:32 AM 13.2 3.1 0.1 83.6 16.3 48
10/6/15 11:33 AM 10.9 2.2 0.1 86.8 13.1 48
10/6/15 11:34 AM 10.6 2.6 0.1 86.8 13.2 48
10/6/15 11:35 AM 12 2.6 0 85.4 14.6 48
10/6/15 11:36 AM 14.5 5.3 0.1 80.1 19.8 48
10/6/15 11:37 AM 11.4 1.6 0.1 86.9 13 48
10/6/15 11:38 AM 10.8 1.7 0 87.5 12.5 48
10/6/15 11:39 AM 9.7 1.6 0 88.6 11.3 48
10/6/15 11:40 AM 13.2 1.9 0 84.9 15.1 48
10/6/15 11:41 AM 10.6 1.3 0 88.2 11.9 48
10/6/15 11:42 AM 12.1 4 0.2 83.7 16.1 48
10/6/15 11:43 AM 11.9 4.4 0 83.6 16.3 48
10/6/15 11:44 AM 10.1 1.6 0 88.3 11.7 48
10/6/15 11:45 AM 12.4 2.4 0.4 84.8 14.8 48
10/6/15 11:46 AM 15.3 6.5 1.3 77 21.8 48
10/6/15 11:47 AM 13.4 4.4 1 81.2 17.8 48
10/6/15 11:48 AM 16.5 6.8 0.1 76.6 23.3 48
10/6/15 11:49 AM 15.9 5.1 0.1 78.9 21 48
10/6/15 11:50 AM 8.5 4.8 0.2 86.5 13.3 48

+++ MEM =========> free memory 在11:50时仍有3GB free

Memory zsinsraca1 Real Free % Virtual free % Real free(MB) Virtual free(MB) Real total(MB) Virtual total(MB)
10/6/15 10:30 AM 7.6 88 7476.4 28845 98304 32768
10/6/15 10:31 AM 7.6 88 7472.8 28845 98304 32768
10/6/15 10:32 AM 7.6 88 7454.3 28845 98304 32768
10/6/15 10:33 AM 7.5 88 7412.8 28845 98304 32768
10/6/15 10:34 AM 7.5 88 7374.2 28845.5 98304 32768
10/6/15 10:35 AM 7.4 88 7303.3 28845.5 98304 32768
10/6/15 10:36 AM 7.3 88 7147.1 28845.5 98304 32768
10/6/15 10:37 AM 7.3 88 7210.1 28845.5 98304 32768
10/6/15 10:38 AM 7.3 88 7170 28845.5 98304 32768
10/6/15 10:39 AM 7.3 88 7128 28845.5 98304 32768
10/6/15 10:40 AM 7.2 88 7126.4 28845.5 98304 32768
10/6/15 10:41 AM 7.2 88 7104.1 28845.5 98304 32768
10/6/15 10:42 AM 7.2 88 7052 28845.5 98304 32768
10/6/15 10:43 AM 7 88 6888.5 28845.6 98304 32768
10/6/15 10:44 AM 7.1 88 7028.3 28845.6 98304 32768
10/6/15 10:45 AM 7 88 6914.8 28845 98304 32768
10/6/15 10:46 AM 7 88 6917.3 28844.5 98304 32768
10/6/15 10:47 AM 7 88 6886.8 28844.1 98304 32768
10/6/15 10:48 AM 6.9 88 6822.9 28843.3 98304 32768
10/6/15 10:49 AM 6.8 88 6668 28842 98304 32768
10/6/15 10:50 AM 6.7 88 6602.4 28841.1 98304 32768
10/6/15 10:51 AM 6.6 88 6463.4 28840 98304 32768
10/6/15 10:52 AM 6.3 88 6235.7 28838.4 98304 32768
10/6/15 10:53 AM 6.5 88 6397.8 28837.6 98304 32768
10/6/15 10:54 AM 6.5 88 6384.6 28837.4 98304 32768
10/6/15 10:55 AM 6.5 88 6340.9 28836.2 98304 32768
10/6/15 10:56 AM 6.2 88 6123.9 28836.2 98304 32768
10/6/15 10:57 AM 6.3 88 6228.7 28835.8 98304 32768
10/6/15 10:58 AM 6.2 88 6143.9 28834.9 98304 32768
10/6/15 10:59 AM 6.2 88 6111.3 28834.2 98304 32768
10/6/15 11:00 AM 6.1 88 6017.9 28833.1 98304 32768
10/6/15 11:01 AM 5.9 88 5837.6 28832.9 98304 32768
10/6/15 11:02 AM 6 88 5888.8 28832 98304 32768
10/6/15 11:03 AM 5.9 88 5848.5 28831.3 98304 32768
10/6/15 11:04 AM 5.7 88 5606.2 28827.8 98304 32768
10/6/15 11:05 AM 5.7 88 5555.2 28827.1 98304 32768
10/6/15 11:06 AM 5.6 88 5503.2 28826.3 98304 32768
10/6/15 11:07 AM 5.4 88 5271.7 28825.5 98304 32768
10/6/15 11:08 AM 5.5 88 5388.1 28824.7 98304 32768
10/6/15 11:09 AM 5.3 88 5239.6 28824.7 98304 32768
10/6/15 11:10 AM 5.5 88 5391.1 28824.1 98304 32768
10/6/15 11:11 AM 5.4 88 5312.4 28823.9 98304 32768
10/6/15 11:12 AM 5.4 88 5276.5 28823.4 98304 32768
10/6/15 11:13 AM 5.4 88 5325.3 28823.3 98304 32768
10/6/15 11:14 AM 5.3 88 5231.7 28822.8 98304 32768
10/6/15 11:15 AM 5.2 88 5157.4 28821.8 98304 32768
10/6/15 11:16 AM 5.1 88 5054 28820.6 98304 32768
10/6/15 11:17 AM 5.1 88 5017.2 28820.5 98304 32768
10/6/15 11:18 AM 5 88 4928.5 28819.5 98304 32768
10/6/15 11:19 AM 5.1 88 5049.6 28821.2 98304 32768
10/6/15 11:20 AM 5.1 88 5015.3 28820.4 98304 32768
10/6/15 11:21 AM 5 88 4935.5 28819.6 98304 32768
10/6/15 11:22 AM 4.9 87.9 4815.3 28818.4 98304 32768
10/6/15 11:23 AM 4.7 87.9 4594.5 28816 98304 32768
10/6/15 11:24 AM 4.6 87.9 4528.6 28815.4 98304 32768
10/6/15 11:25 AM 4.6 87.9 4550.3 28814.9 98304 32768
10/6/15 11:26 AM 4.6 87.9 4486.5 28814.3 98304 32768
10/6/15 11:27 AM 4.4 87.9 4361.3 28813.4 98304 32768
10/6/15 11:28 AM 4.4 87.9 4287 28812.4 98304 32768
10/6/15 11:29 AM 4.3 87.9 4251.7 28811.8 98304 32768
10/6/15 11:30 AM 4.3 87.9 4187.3 28811 98304 32768
10/6/15 11:31 AM 4.2 87.9 4097.9 28810.4 98304 32768
10/6/15 11:32 AM 4.2 87.9 4101.4 28809.9 98304 32768
10/6/15 11:33 AM 4.1 87.9 4062.2 28809.6 98304 32768
10/6/15 11:34 AM 4 87.9 3960.2 28808.4 98304 32768
10/6/15 11:35 AM 3.9 87.9 3858.5 28806.8 98304 32768
10/6/15 11:36 AM 3.9 87.9 3790.8 28805.9 98304 32768
10/6/15 11:37 AM 3.8 87.9 3774.6 28805.7 98304 32768
10/6/15 11:38 AM 3.6 87.9 3586.9 28805.1 98304 32768
10/6/15 11:39 AM 3.9 87.9 3838.9 28804.7 98304 32768
10/6/15 11:40 AM 3.8 87.9 3763 28804.7 98304 32768
10/6/15 11:41 AM 3.7 87.9 3661.5 28804 98304 32768
10/6/15 11:42 AM 3.8 88.1 3763 28864.2 98304 32768
10/6/15 11:43 AM 3.8 88.1 3753.1 28864.3 98304 32768
10/6/15 11:44 AM 3.7 88.1 3676.4 28864.3 98304 32768
10/6/15 11:45 AM 3.5 88.1 3405.6 28864.3 98304 32768
10/6/15 11:46 AM 3.5 88.1 3484.2 28864.4 98304 32768
10/6/15 11:47 AM 3.5 88.1 3477 28864.5 98304 32768
10/6/15 11:48 AM 3.6 88.1 3570.6 28864.5 98304 32768
10/6/15 11:49 AM 4.7 88.1 4646.2 28868.9 98304 32768
10/6/15 11:50 AM 34.5 98 33959.3 32123.8 98304 32768

LOG FILE
————–
Filename=alert_zsiims1.log
See the following:

Tue Oct 6 02:34:52 2015
Thread 1 advanced to log sequence 37413 (LGWR switch)
Current log# 1 seq# 37413 mem# 0: +V5DATA/zsiims/onlinelog/group_1.9844.799569609
Tue Oct 6 10:50:38 2015
ALTER SYSTEM ARCHIVE LOG
Tue Oct 6 10:50:39 2015
Thread 1 advanced to log sequence 37414 (LGWR switch)
Current log# 3 seq# 37414 mem# 0: +V5DATA/zsiims/onlinelog/group_3.9851.799569423
Tue Oct 6 11:21:49 2015
Errors in file /oracle/admin/zsiims/bdump/zsiims1_j001_10027156.trc:
ORA-12012: error on auto execute of job 54375
ORA-27468: “EXFSYS.RLM$SCHDNEGACTION” is locked by another process
Tue Oct 6 11:23:07 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=142
System State dumped to trace file /oracle/admin/zsiims/udump/zsiims1_ora_20971728.trc
Tue Oct 6 11:25:43 2015
Errors in file /oracle/admin/zsiims/bdump/zsiims1_j002_6815818.trc:
ORA-12012: error on auto execute of job 54374
ORA-27468: “EXFSYS.RLM$EVTCLEANUP” is locked by another process
Tue Oct 6 11:26:57 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=28
System State dumped to trace file /oracle/admin/zsiims/bdump/zsiims1_smon_15335466.trc
Tue Oct 6 11:35:26 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=193
System State dumped to trace file /oracle/admin/zsiims/udump/zsiims1_ora_32309462.trc
Tue Oct 6 11:41:27 2015
Shutting down instance: further logons disabled

LOG FILE
————–
Filename=alert_zsiims2.log
See the following:
Tue Oct 6 10:50:40 2015
Thread 2 advanced to log sequence 25153 (LGWR switch)
Current log# 5 seq# 25153 mem# 0: +V5DATA/zsiims/onlinelog/group_5.9853.799569371
Tue Oct 6 11:45:06 2015
IPC Send timeout detected.Sender: ospid 7798804
Receiver: inst 1 binc 4 ospid 6881304
Tue Oct 6 11:45:08 2015
Trace dumping is performing id=[cdmp_20151006114508]
Tue Oct 6 11:48:58 2015
Shutting down instance (abort)
License high water mark = 450
Instance terminated by USER, pid = 20578736
Tue Oct 6 11:49:50 2015
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0

Filename=zsiims1_ora_20971728A.trc ===> 发生问题后产生的第一个systemstate dump, 在Oct 6 11:23:07AM产生的,此时数据库的状况应该已经很糟糕了
See the following:

50几个session在等待latch: row cache objects

33: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=45472
37: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=18549
68: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=13886
71: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=46595
81: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=48525
106: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11855
118: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11183
122: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8913
132: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=12194
140: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=12643
149: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=12173
154: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=15022
156: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=12143
160: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=20615
173: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11470
174: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11668
177: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11249
183: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=10725
184: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11485
200: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=9970
204: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11019
207: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8356
208: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8811
215: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8914
219: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=10169
230: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=9156
241: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=7911
244: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8785
249: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=17542
255: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=9209
264: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=9247
276: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8368
288: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8211
310: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=7413
317: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=17869
331: waiting for ‘latch: row cache objects'[Rcache object=700000418a9adf8,] seq=5309
352: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=2625
353: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=10183
361: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=5948
364: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=5303
366: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=5300
378: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=760
381: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=4788
394: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=4758
401: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=109
424: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=2924
429: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=3120
432: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=2426
454: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=1861
457: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=1579
463: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=1603
468: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=1016
470: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=1076
486: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=501
576: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=16368
606: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=30065

还有很多session在等待row cache lock

28: waiting for ‘row cache lock’ [Rcache object=70000064a573ef8,] seq=40578
40: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=59768
93: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=2938
98: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=2777
193: waiting for ‘row cache lock’ [Rcache object=700000621252470,] seq=966
199: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=107
214: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=3501
220: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=86
232: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1148
239: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=12734
252: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=2908
259: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1134
260: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=65
266: waiting for ‘row cache lock’ [Rcache object=700000550af2690,] seq=1791
268: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1288
274: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=821
278: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1357
287: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=637
289: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=22
293: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=3363
302: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=27
319: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=98
320: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=366
323: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=2618
327: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=167
328: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1137
330: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=31
332: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=53
333: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=28
334: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=368
337: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=202
341: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=56
342: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=18
343: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=28
345: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=359
349: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=562
350: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=24
351: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=407
358: waiting for ‘row cache lock’ [Rcache object=70000078b3d6290,] seq=28
359: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=32
363: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=32
365: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=27
371: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=29
373: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=28
375: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=89
379: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=605
383: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=523
385: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=595
387: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=243
390: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=28
391: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=355
393: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=20
395: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=28
397: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1080
404: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=29
409: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=22
410: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=60
411: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1101
413: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=95
414: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=948
416: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=315
417: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=516
419: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=102
420: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=607
421: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=831
426: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=42
433: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=58
437: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=421
442: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=44
445: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=185
447: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=35
450: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=55
456: waiting for ‘row cache lock’ [Rcache object=700000789bc8cc8,] seq=349
459: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=127
465: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=234
474: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=52
475: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=25
478: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=41
490: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=49
610: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=15070

分析发生问题时候的systemstate dump trace zsiims1_ora_20971728A.trc:

50几个session在等待latch: row cache objects, 大部分是和700000768fd8ea8有关的

33: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=45472
37: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=18549
68: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=13886
71: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=46595
81: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=48525
106: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11855
118: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11183
122: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8913
132: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=12194
140: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=12643
149: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=12173
154: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=15022
156: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=12143
160: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=20615
173: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11470
174: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11668
177: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11249
183: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=10725
184: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11485
200: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=9970
204: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=11019
207: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8356
208: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8811
215: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8914
219: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=10169
230: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=9156
241: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=7911
244: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8785
249: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=17542
255: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=9209
264: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=9247
276: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8368
288: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=8211
310: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=7413
317: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=17869
331: waiting for ‘latch: row cache objects'[Rcache object=700000418a9adf8,] seq=5309
352: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=2625
353: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=10183
361: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=5948
364: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=5303
366: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=5300
378: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=760
381: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=4788
394: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=4758
401: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=109
424: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=2924
429: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=3120
432: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=2426
454: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=1861
457: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=1579
463: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=1603
468: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=1016
470: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=1076
486: waiting for ‘latch: row cache objects'[Latch 70000075f1cd020] seq=501
576: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=16368
606: waiting for ‘latch: row cache objects'[Latch 700000768fd8ea8] seq=30065

以其中一个比较典型的process为例:

PROCESS 576:
—————————————-
SO: 70000078326f638, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=576, calls cur/top: 70000039983fd08/70000039983f788, flag: (0) –
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 109 0 4
last post received-location: kslpsr
last process to post me: 700000786279d80 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 700000786279d80 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kghfrunp: clatch: wait:
waiting for 700000768fd8ea8 Child row cache objects level=4 child#=9
Location from where latch is held: kghfrunp: clatch: wait: ===============> 可以看到这个latch是因为kghfrunp相关的功能发起的,而这个功能是需要从shared pool中释放空间有关
Context saved from call: 0 ===========================>kghfrunp: (KGH: Ask client to free unpinned space)
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
207 (5, 1444101815, 2)
174 (5, 1444101815, 2)
322 (5, 1444101815, 2)

还有很多session在等待row cache lock, 最多的对象70000078e9f0d28

28: waiting for ‘row cache lock’ [Rcache object=70000064a573ef8,] seq=40578
40: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=59768
93: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=2938
98: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=2777
193: waiting for ‘row cache lock’ [Rcache object=700000621252470,] seq=966
199: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=107
214: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=3501
220: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=86
232: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1148
239: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=12734
252: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=2908
259: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1134
260: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=65
266: waiting for ‘row cache lock’ [Rcache object=700000550af2690,] seq=1791
268: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1288
274: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=821
278: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1357
287: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=637
289: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=22
293: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=3363
302: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=27
319: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=98
320: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=366
323: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=2618
327: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=167
328: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1137
330: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=31
332: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=53
333: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=28
334: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=368
337: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=202
341: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=56
342: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=18
343: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=28
345: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=359
349: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=562
350: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=24
351: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=407
358: waiting for ‘row cache lock’ [Rcache object=70000078b3d6290,] seq=28
359: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=32
363: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=32
365: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=27
371: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=29
373: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=28
375: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=89
379: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=605
383: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=523
385: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=595
387: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=243
390: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=28
391: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=355
393: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=20
395: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=28
397: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1080
404: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=29
409: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=22
410: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=60
411: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=1101
413: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=95
414: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=948
416: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=315
417: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=516
419: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=102
420: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=607
421: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=831
426: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=42
433: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=58
437: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=421
442: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=44
445: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=185
447: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=35
450: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=55
456: waiting for ‘row cache lock’ [Rcache object=700000789bc8cc8,] seq=349
459: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=127
465: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=234
474: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=52
475: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=25
478: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=41
490: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=49
610: waiting for ‘row cache lock’ [Rcache object=70000078e9f0d28,] seq=15070

以其中的process 610为例:

PROCESS 610:
—————————————-
SO: 700000784271270, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=610, calls cur/top: 700000387cb3b58/700000387cb35d8, flag: (0) –
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 109 0 4
last post received-location: kslpsr
last process to post me: 700000786279d80 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: 700000786279d80 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 7000007852bdfc8
O/S info: user: oracle, term: UNKNOWN, ospid: 39321988
OSD pid info: Unix process pid: 39321988, image: oracle@zsinsraca1
Short stack dump:
ksdxfstk+002c< -ksdxcb+04e4<-sspuser+0074<-000047DC<-sskgpwwait+0034<-skgpwwait+00bc<-ksliwat+06c0<-kslwaitns_timed+0024<-kskthbwt+022c<-kslwait+00f4<-kqrigt+05e0<-kqrpre1+0960<-kqrpre+001c<-ktusmous_online_undoseg+05f4<-ktusmaus_add_us+0300<-ktubnd+04ac<-ktuchg+055c<-ktbchg2+01f0<-kdtchg+0524<-kdtwrp+2540<-kdtInsRow+0a20<-insrow+043c<-insdrv+0428<-inscovexe+02b8<-insExecStmtExecIniEngine+005c<-insexe+02f8<-opiexe+2738<-opipls+185c<-opiodr+0ae0<-rpidrus+01bc<-skgmstack+00c8<-rpidru+0088<-rpiswu2+034c<-rpidrv+095c<-psddr0+02bc<-psdnal+01d0<-pevm_EXECC+01f8<-pfrinstr_EXECC+0070<-pfrrun_no_tool+005c<-pfrrun+1014<-plsql_run+06b4<-pricar+042c<-pricbr+0258<-prient2+03a8<-prient+0b0c<-kkxrpc+0250<-kporpc+01d4<-opiodr+0ae0<-ttcpip+1020<-opitsk+1124<-opiino+0990<-opiodr+0ae0<-opidrv+0484<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0070 ---------------------------------------- SO: 70000078f2c84a0, type: 4, owner: 700000784271270, flag: INIT/-/-/0x00 (session) sid: 236 trans: 7000007619fcf08, creator: 700000784271270, flag: (100041) USR/- BSY/-/-/-/-/- DID: 0001-025F-00004ADB, short-term DID: 0001-0262-00000005 txn branch: 700000760262cd0 oct: 2, prv: 0, sql: 70000043a39dda8, psql: 7000006feac35a8, user: 88/ZSCLMINTF service name: zsiims O/S info: user: oracle, term: , ospid: 10027038, machine: zsinsraca2 program: oracle@zsinsraca2 (TNS V1-V3) application name: oracle@zsinsraca2 (TNS V1-V3), hash value=450405937 waiting for 'row cache lock' blocking sess=0x0 seq=15070 wait_time=0 seconds since wait started=1605 cache id=3, mode=0, request=3 ============> cache id为3, 即dc_rollback_segments上的争用
Dumping Session Wait History
for ‘row cache lock’ count=1 wait_time=1388976
cache id=3, mode=0, request=3

此时我们并不能简单的认为,这个问题和rollback/undo有关
因为此时LCK0进程也在等待和kghfrunp(释放shared pool中内存)有关的latch,而在RAC环境里, row cache lock/library cache lock这样的instance lock的获取和释放必须经过LCK0进程,如果LCK0异常,那么就会发生row cache lock的问题.

PROCESS 33:
—————————————-
SO: 70000078627a560, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=33, calls cur/top: 7000007853a1520/7000007853a1520, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 33
last post received-location: ksrpublish
last process to post me: 700000782261488 12 0
last post sent: 0 0 21
last post sent-location: ksbria
last process posted by me: 70000078627a560 1 6
(latch info) wait_event=0 bits=0
Location from where call was made: kqrbip:
waiting for 70000075f1cd020 Child row cache objects level=4 child#=16
Location from where latch is held: kghfrunp: clatch: nowait:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
33 (3, 3, 3)
431 (3, 1444101789, 3)
486 (3, 1444101789, 3)
492 (3, 1444101789, 3)
468 (3, 1444101789, 3)
321 (3, 1444101789, 3)
470 (3, 1444101789, 3)
66 (3, 1444101789, 3)
276 (3, 1444101789, 3)
388 (3, 1444101789, 3)
207 (3, 1444101789, 3)
205 (3, 1444101789, 3)
443 (3, 1444101789, 3)
waiter count=13
gotten 1588533458 times wait, failed first 46002888 sleeps 3053343
gotten 2849791 times nowait, failed: 4267742
possible holder pid = 75 ospid=32440736
on wait list for 70000075f1cd020
Process Group: DEFAULT, pseudo proc: 7000007852bdfc8
O/S info: user: oracle, term: UNKNOWN, ospid: 6881304
OSD pid info: Unix process pid: 6881304, image: oracle@zsinsraca1 (LCK0)
Short stack dump:
ksdxfstk+002c< -ksdxcb+04e4<-sspuser+0074<-000047DC<-sskgpwwait+0034<-skgpwwait+00bc<-kslges+054c<-kslgetl+033c<-kqrbip+03c8<-kqrbfr+01ec<-kqrbtm+00c8<-ksbcti+03c8<-ksbabs+03fc<-kclabs+00d8<-ksbrdp+04b4<-opirip+03fc<-opidrv+0458<-sou2o+0090<-opimai_real+0150<-main+0098<-__start+0070 ---------------------------------------- SO: 70000078f366290, type: 4, owner: 70000078627a560, flag: INIT/-/-/0x00 (session) sid: 1062 trans: 0, creator: 70000078627a560, flag: (51) USR/- BSY/-/-/-/-/- DID: 0000-0021-00000003, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS service name: SYS$BACKGROUND waiting for 'latch: row cache objects' blocking sess=0x0 seq=45472 wait_time=0 seconds since wait started=0 address=70000075f1cd020, number=c8, tries=1 Dumping Session Wait History for 'latch: row cache objects' count=1 wait_time=906249 address=70000075f1cd020, number=c8, tries=0 for 'latch: row cache objects' count=1 wait_time=27503 address=70000075f1cd020, number=c8, tries=2 for 'latch: row cache objects' count=1 wait_time=292982 address=70000075f1cd020, number=c8, tries=1 综合考虑,从现在得到的信息来看,很可能发生问题时shared pool的使用出现了问题,但是因为我们没有更具体的dba_hist_active_sess_history数据,并且并没有在发生问题的第一时间收集systemstate dump,我们可能无法更进一步的确认它了. LOG FILE -------------- Filename=awrrpt_1_25919_25920.html ===> 发生问题前1个小时的AWR report
See the following:

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 25919 06-Oct-15 09:00:26 224 6.6
End Snap: 25920 06-Oct-15 10:00:36 233 6.3
Elapsed: 60.17 (mins)
DB Time: 18.51 (mins)

Cache Sizes

Begin End
Buffer Cache: 21,776M 21,776M Std Block Size: 8K
Shared Pool Size: 8,752M 8,752M Log Buffer: 30,672K

SGA breakdown difference
ordered by Pool, Name
N/A value for Begin MB or End MB indicates the size of that Pool/Name was insignificant, or zero in that snapshot
Pool Name Begin MB End MB % Diff
shared CCursor 302.83 323.98 6.98
shared Cursor Stats 222.80 222.80 0.00
shared KGH: NO ACCESS 6,056.99 6,056.99 0.00 ===========> 将近6个GB是被KGH: NO ACCESS所使用
shared PCursor 252.15 274.83 9.00
shared free memory 5,239.86 4,815.81 -8.09
shared gcs resources 418.68 418.68 0.00
shared gcs shadows 384.49 384.49 0.00
shared kglsim heap 245.99 245.99 0.00
shared kglsim object batch 416.87 416.87 0.00
shared library cache 251.35 258.46 2.83
shared sql area 382.75 764.75 99.80

1. 发生问题前的awr report显示shared pool中大部分的内存都被KGH: NO ACCESS占用
2. 而发生问题后的systemstate dump显示非常多的进程都在等待跟kghfrunp功能有关的latch (LCK0也在等这样的latch, 导致了系统出现严重的row cache enqueue/row cache lock问题)
而这个kghfrunp功能是要从shared pool中释放空间,这说明shared pool当时不够用

8666117.8, patch 8666117 可以优化lck0在shared pool有压力下的工作,可以减缓row cache lock的问题
note 801787.1里 ,有以下的三个方式可以减少 KGH: NO ACCESS的大小:

a). 设置最小的buffer cache及shared pool的值
b). 减少shared pool resize的频率:设置 _memory_broker_stat_interval
c). 打patch 7189722