Skip to content

ORA-63999

ORA-63999 data file suffered media failure 导致实例Crash

KCF: read, write or open error, block=0xb79ab online=1
        file=85 '/dev/vgpmesdb12/rLV_FEM_PRD_I02'
        error=27063 txt: 'HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: -1
Additional information: 8192'
Encountered write error

*** 2015-06-17 20:06:22.714
DDE rules only execution for: ORA 1110
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
Successfully dispatched
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (SUCCESS, 0 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 0 csec) -----
error 63999 detected in background process
ORA-63999: data file suffered media failure
ORA-01114: IO error writing block to file 85 (block # 752043)
ORA-01110: data file 85: '/dev/vgpmesdb12/rLV_FEM_PRD_I02'
ORA-27063: number of bytes read/written is incorrect
HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: -1
Additional information: 8192
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+544<-kjzdssdmp()+400<-kjzduptcctx()+432<-kjzdicrshnfy()+128<-$cold_ksuitm()+5872<-$cold_ksbrdp()+2704<-opirip()+1296<-opidrv()+1152<-sou2o()+256<-opimai_real()+352<-ssthrdmain()+576<-main()+336<-main_opd_entry()+80
----- End of Abridged Call Stack Trace -----

*** 2015-06-17 20:06:23.172
DBW1 (ospid: 5833): terminating the instance due to error 63999
ksuitm: waiting up to [5] seconds before killing DIAG(5807)

数据库文件所在设备的部分io错误,导致实例宕机。由于报错ORA-27063: number of bytes read/written is incorrect,跟踪下来是初步怀疑坏块导致,通过效验后未发现坏块,HPUX-ia64 Error: 11: Resource temporarily unavailable的错误引入考虑,初步怀疑是hp的系统内部在做一些操作时候导致/dev/vgpmesdb12/rLV_FEM_PRD_I02设备无法被访问到。
随后在官方找到相似bug 16884689 : DATABASE CRASH DUE TO ORA-27063 HPUX-IA64 ERROR: 11.从整体的诊断看,问题的原因还是因为出现了io问题导致的,而且集群内部是在发生io问题后才发现数据库资源的问题,所以需判断是否hp系统或者io系统各模块的问题.与发现的bug不同场景.

文件io错误时候实例重启受_datafile_write_errors_crash_instance控制影响。

参考:
1.


Description

This fix introduces a notable change in behaviour in that
from 11.2.0.2 onwards an I/O write error to a datafile will
now crash the instance.

Before this fix I/O errors to datafiles not in the system tablespace
offline the respective datafiles when the database is in archivelog mode.
This behavior is not always desirable. Some customers would prefer
that the instance crash due to a datafile write error.

This fix introduces a new hidden parameter to control if the instance
should crash on a write error or not:
 _datafile_write_errors_crash_instance



With this fix:
 If _datafile_write_errors_crash_instance = TRUE (default) then
  any write to a datafile which fails due to an IO error causes
  an instance crash.

 If _datafile_write_errors_crash_instance = FALSE then the behaviour
  reverts to the previous behaviour (before this fix) such that
  a write error to a datafile offlines the file (provided the DB is
  in archivelog mode and the file is not in SYSTEM tablespace in
  which case the instance is aborted)

2.

This is due to a problem with the I/O subsystem.
Issues of this nature are common when there is a problem in the I/O subsystem.
This can include, but is not limited to:

2.1 A bad sector on disk
2.2 An I/O card that is starting to fail
2.3 A bad array cable
2.4 An interruption in network connectivity, in the case of NFS mounts
2.5 Could also be caused by a OS level bug.
etc.
Review the OS Messages file as this will almost certainly reflect errors (for example   Error for Command: write(10) )