Skip to content

Oracle - 63. page

诊断事件:hang 诊断

2011 0428 /*luda*/
一. trc文件以及日志文件

在$ORACLE_BASE/admin/ORACLE_SID/下面有不同类别的跟踪文件
alert.log文件
system log文件 /var/adm/messages

二. hang situations
.what’s hang?
1.查看CPU的使用情况,作为参考指标,一般hang的cpu的使用率是不高的(至少oracle进程的cpu使用率很低)
2.查看进程是否在等待一些不存在的进程。
hang的诊断办法:

1.system/进程 诊断事件
2.从v$session_wait,v$lock,v$latch,v$latchholder
3.hanganalyze event

(1)
V$SESSION_WAITdisplays the resources or events for which active sessions are waiting.

在诊断数据库hang的时候,v$session_wait可以提供数据库级别有用的信息
从v$session_wait字段可以找到有用的信息

sid                  session id(v$session)
seq#                 sequeue number of the wait for this session
event                the event that the session is waiting for just finished for waiting for
wait_time            the time wait for the session
second_in_wait       the approximate time in seconds at the start of the wait state :
                                                                                  
                                                                                   0 – WAITING (the session is currently waiting)

                                                                                   -2 – WAITED UNKNOWN TIME (duration of last wait is unknown)

                                                                                   -1 – WAITED SHORT TIME (last wait <1/100th of a second)

                                                                                   >0 – WAITED KNOWN TIME (WAIT_TIME = duration of last wait)
示例:

SQL> set linesize 2000
SQL> select sid,seq#,event,wait_time,seconds_in_wait from v$session_wait;

       SID       SEQ# EVENT                                                             WAIT_TIME SECONDS_IN_WAIT
———- ———- —————————————————————- ———- —————
       138          1 jobq slave wait                                                           0              25
       149         68 Streams AQ: waiting for time management or cleanup tasks                  0          575891
       150         35 Streams AQ: qmn slave idle wait                                           0           65739
       151          6 Streams AQ: qmn coordinator idle wait                                     0         1863190
       155      43903 rdbms ipc message                                                         0              57
       156      51563 rdbms ipc message                                                         0             208
       159         31 SQL*Net message to client                                                -1               0
       160         10 rdbms ipc message                                                         0          634929
       161       5670 rdbms ipc message                                                         0            1552
       162      59394 rdbms ipc message                                                         0              26
       163         11 rdbms ipc message                                                         0          230558

       SID       SEQ# EVENT                                                             WAIT_TIME SECONDS_IN_WAIT
———- ———- —————————————————————- ———- —————
       164       5051 smon timer                                                                0            3419
       165      62808 rdbms ipc message                                                         0               0
       166      14978 rdbms ipc message                                                         0              21
       167      23334 rdbms ipc message                                                         0              21
       168         20 rdbms ipc message                                                         0           59201
       169      58551 rdbms ipc message                                                         0              24
       170          8 pmon timer                                                                0         1863202

(2)hanganaylyze event
hang诊断事件一般用在数据库hang住或者死锁的情况下
SQL> alter session set events ‘immediate trace name hanganalyze level 4’;

Session altered.

SQL> @gettrc

TRACE_FILE_NAME
—————————————————
/oracle/admin/znjtepp/udump/znjtepp_ora_18886.trc

马上就可以找到跟踪文件在udump目录下的 18886的文件
  –用hanganalyze 监听oracle – 60错误

    在init文件中加入
    event=“60 trace name hanganalyze level 5”
  –oracle debug的使用
  这里handanalyze有6个级别:
  10   全部的进程信息导出,伴随大量的数据,这个一般不被采用
  5    导出全部和等待事件有关的进程
  4    导出等待事件的分支节点
  3    导出被认为是hang的进程
  2    低限度的导出信息
  1    只导出少部分数据
 
loop的诊断办法

系统的诊断dump是非常有帮助于诊断oracle的loop
操作示范:
(1)系统状态转储
SQL> alter session set events ‘immediate trace name systemstate level 10’;

Session altered.
SQL> @gettrc

TRACE_FILE_NAME
——————————————————————————–
/oracle/admin/znjtepp/udump/znjtepp_ora_28508.trc

(2)进程状态转储
SQL> alter session set events ‘immediate trace name processstate level 10’;

Session altered.

SQL> @gettrc

TRACE_FILE_NAME
——————————————————————————–
/oracle/admin/znjtepp/udump/znjtepp_ora_28508.trc

或者

oradebug  setospid <pid>
oradebug dump systemstate 10
下来就是分析trc文件了,当然为了工作便捷,ass.awk可以快速列出当前的等待事件

诊断事件:10513 禁止smon回滚

禁用smon进行tx recovery(所谓tx recovery就是open后数据文件包含提交和未提交数据,数据不一致),不会造成数据库不一致,虽然禁用了smon自动恢复,但是当查询的时候还是会进行回滚从undo中读取 回滚数据(
等同于用到哪个对象回滚哪个对象,这种方式会带来压力,且若undo损坏就十分麻烦了,那么这将是另一个恢复问题select segment_name,status,tablespace_name from dba_rollback_segs看那个段损坏
使用隐藏参数_offline_rollback_segments 标记,然后drop rollback segment ‘xxx’ ,此时才会造成真正的数据不一致)

实验证明:
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area  943718400 bytes
Fixed Size                  2025400 bytes
Variable Size             247466056 bytes
Database Buffers          692060160 bytes
Redo Buffers                2166784 bytes
Database mounted.
Database opened.
SQL> show parameter event

NAME                                 TYPE        VALUE
———————————— ———– ——————————
event                                string
SQL> conn luda/luda
Connected.
SQL> drop table t1;
create table t1 (id number,id2 number);

declare
begin
for i in 1..100000 loop
insert into t1 values(i,i+1);
end loop;
end;
/
Table dropped.

SQL>
Table created.

SQL> SQL>   2    3    4    5    6    7 

PL/SQL procedure successfully completed.

SQL> conn / as sysdba
Connected.
SQL> shutdown abort
ORACLE instance shut down.
SQL> startup mount
ORACLE instance started.

Total System Global Area  943718400 bytes
Fixed Size                  2025400 bytes
Variable Size             251660360 bytes
Database Buffers          687865856 bytes
Redo Buffers                2166784 bytes
Database mounted.
SQL> alter system set events ‘10513 trace name context forever,level 2′;

System altered.

SQL> alter database open;

Database altered.

SQL> select usn,UNDOBLOCKSDONE,UNDOBLOCKSDONE from v$fast_start_transactions;

no rows selected

SQL> show parameter event

NAME                                 TYPE        VALUE
———————————— ———– ——————————
event                                string
SQL> alter session set events’10046 trace name context forever ,level 12’;

Session altered.

SQL> select * from t1;
select * from t1
              *
ERROR at line 1:
ORA-00942: table or view does not exist
SQL> select * from luda.t1;
SQL> alter system set events ‘10046 trace name context off’;

System altered.

在系统的10046的跟踪文件头部记录着:
“znjtepp_ora_26610.trc” 23617 lines, 2172756 characters
/oracle/admin/znjtepp/udump/znjtepp_ora_26610.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – 64bit Production
With the Partitioning, OLAP and Data Mining options
ORACLE_HOME = /oracle/product/10.2/db_1
System name:    AIX
Node name:      aix53
Release:        3
Version:        5
Machine:        0009AFDA4C00
Instance name: znjtepp
Redo thread mounted by this instance: 1
Oracle process number: 15
Unix process pid: 26610, image: oracle@aix53 (TNS V1-V3)

*** 2011-05-25 11:46:13.172
*** SERVICE NAME:() 2011-05-25 11:46:13.163
*** SESSION ID:(159.3) 2011-05-25 11:46:13.163
Thread 1 checkpoint: logseq 39, block 8118, scn 3255055
  cache-low rba: logseq 39, block 8119
    on-disk rba: logseq 39, block 59764, scn 3255568
  start recovery at logseq 39, block 8119, scn 0
—– Redo read statistics for thread 1 —–
Read rate (ASYNC): 25823Kb in 0.28s => 90.06 Mb/sec
Total physical reads: 26624Kb
Longest record: 20Kb, moves: 0/101746 (0%)
Change moves: 0/2 (0%), moved: 0Mb
Longest LWN: 683Kb, moves: 25/94 (26%), moved: 16Mb
Last redo scn: 0x0000.0031ad12 (3255570)
———————————————-
—– Recovery Hash Table Statistics ———
Hash table buckets = 32768
Longest hash chain = 2
Average hash chain = 1251/1246 = 1.0
Max compares per lookup = 2
Avg compares per lookup = 202468/203698 = 1.0
———————————————-
*** 2011-05-25 11:46:13.452
KCRA: start recovery claims for 1251 data blocks
*** 2011-05-25 11:46:13.830
KCRA: blocks processed = 1251/1251, claimed = 1251, eliminated = 0
*** 2011-05-25 11:46:13.831
Recovery of Online Redo Log: Thread 1 Group 2 Seq 39 Reading mem 0
—– Recovery Hash Table Statistics ———
Hash table buckets = 32768
Longest hash chain = 2
…..略

证明在扫描T1表时候还是会使用undo的段进行回滚。

关于10513的设置以及oracle的解释

查询smon进程号:
luda_dba:/home/oracle$ sqlplus ‘/as sysdba’

SQL*Plus: Release 10.2.0.1.0 – Production on Wed May 25 10:28:34 2011

Copyright (c) 1982, 2005, Oracle.  All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 – 64bit Production
With the Partitioning, OLAP and Data Mining options

SQL> select pid,program from v$process where program like ‘%SMON%’;

       PID PROGRAM
———- ————————————————
         8 oracle@aix53 (SMON)

SQL> oradebug setorapid 8
Unix process pid: 22736, image: oracle@aix53 (SMON)
SQL> oradebug event 10513 trace name context forever,level 2
Statement processed.
SQL>

Oracle官方对于event 10513解释,该文件存放在
$ORACLE_HOJE/rdbms/mesg/oraus.msg文件中。

引用
10513, 00000, “turn off wrap source compression”
// *Cause:
// *Action: Set this event if you do not want source of wrapped PL/SQL
//          objects to be concatenated and stored multiple lines to a row.
// *Comment: This event should be set if you want the source of wrapped
//           PL/SQL objects to be stored one line per source row. The
//           new storage method is multiple lines per row, which
//           improves load and compile performance.

0519

今天在群里,一朋友把flashback table和回收站的功能混淆了,

他一直坚持他正确的,认为flashback table用的是回收站的资源,而不是使用undo,

这里一下就暴露了对闪回和回收站的认识不足,回收站只针对drop的对象,而flashback table不止针对ddl,也针对dml

但是flashback table对ddl的闪回限制比较多,而且一般的ddl将首先考虑flashback database。

最后还是用实验说服了他…

最近瞎忙,复习了下concepts。

马上就要再次步入OCM的路程

我也会继续更新对block的解析文章,包括bbed的使用方法,块以及datafile的结构分析

还会增加OCM的考试内容的实地操作介绍。

坚持

在最艰难的时候,告诉自己,再坚持一下,再坚持一下,也许就成功了