Skip to content

_datafile_write_errors_crash_instance

_datafile_write_errors_crash_instance设置建议

该参数在11.2.0.2以前默认是false,在11.2.0.2后默认为true,作用为在出现io错误的时候数据库选择是offline出现io错误相关的datafile还是直接将instance crash.当为true时候,数据库在发生io错误时候会直接瘫痪.报错ORA-63999等,前阵子我碰到这样的错误,一般碰到此类错误都是从IO传输层,存储和系统网络之间找问题,该参数设置为FALSE或者TRUE只是从业务影响层面广度的考虑.所以一旦碰到IO错误,考调整此参数只是治标不治本,根源还需要从IO传输层的各层面找问题.可以考虑设置为false,减少因为io错误而导致影响的范围增加.

在考虑此参数时候多数已经是碰到IO错误了,所以此时候应该考虑下,数据库坏块的产生控制影响,对以下几个参数给予考虑:

DB_ULTRA_SAFE
DB_BLOCK_CHECKING
DB_LOST_WRITE_PROTECT
DB_BLOCK_CHECKSUM

建议设置参数db_ultra_safe为DATA_ONLY,会稍微增加数据库主机的消耗,以加强对数据的校验以便能及时发现问题,设置该参数会去自动修改对应的另外3个参数:

DB_BLOCK_CHECKING will be set to MEDIUM.(当前为FALSE)
DB_LOST_WRITE_PROTECT will be set to TYPICAL. (当前为TYPICAL)
DB_BLOCK_CHECKSUM will be set to FULL. (当前为NONE)

以下是国外一个工程师对该参数的建议:

Param ‘_datafile_write_errors_crash_instance’ , TRUE or FALSE?

Since 11.2.0.2 there’s a new parameter, “_datafile_write_errors_crash_instance” to prevent the intance to crash when a write error on a datafile occurs . But.. should I use this or not. The official text of this parameter:

This fix introduces a notable change in behaviour in that
from 11.2.0.2 onwards an I/O write error to a datafile will
now crash the instance.

Before this fix I/O errors to datafiles not in the system tablespace
offline the respective datafiles when the database is in archivelog mode.
This behavior is not always desirable. Some customers would prefer
that the instance crash due to a datafile write error.

This fix introduces a new hidden parameter to control if the instance
should crash on a write error or not:
_datafile_write_errors_crash_instance

With this fix:
If _datafile_write_errors_crash_instance = TRUE (default) then
any write to a datafile which fails due to an IO error causes
an instance crash.

If _datafile_write_errors_crash_instance = FALSE then the behaviour
reverts to the previous behaviour (before this fix) such that
a write error to a datafile offlines the file (provided the DB is
in archivelog mode and the file is not in SYSTEM tablespace in
which case the instance is aborted)
When you ask Oracle for advice, you get the following answer:

20+ years ago a feature was added to Oracle to offline a datafile when there was an error writing a dirty buffer to it and it was not part of the system tablespace. At that time it made sense to do this since neither RAC or even
OPS was implemented and storage arrays did not exist. Then the most likelycause of an I/O error was a problem with the direct attached disk drive holding the datafile. By offlining the datafile the database might be able to continue running. Customers assumed that a disk failure would require restoring a backup and doing a media recovery so taking the file offline might improve availability. High availability was not expected.
Today almost all customers use highly available storage arrays accessible from multiple hosts. Now most I/O errors are either transient or are local to the host that encounters them. Real disk failures are hidden by the storage array redundancy. Customers expect a disk failure to have no effect on the operation of the database.

Unfortunately the code to offline a datafile on an I/O error is still there. The effect is that an error on one node in a cluster offlines the datafile and usually takes down the entire application on all nodes or even crashes all instances if the problem is with an undo tablespace. For example dismounting a file system on one node in a cluster makes that node get I/O errors for the files on that file system. This makes a mistake on one node take down the entire cluster.

Offlining a datafile on a write error is even a problem with single instance. Most I/O errors today will go away if the database is restarted on another machine or if the current machine is rebooted. However if the I/O error took a datafile offline, then the administrator must do a media recovery to make the application function again. This is an unusual procedure that takes awhile.

If the database instances do not crash it takes longer for the administrator to find out that the application is not working even though the database appears to be up and running. This is a problem with both RAC and single
instance.

Question: One concern is that a failed datafile write to a non-critical tablespace will bring down the database when it occurs in the only open instance.

It is true that there may be some situations where taking the file offline would be better. On the other hand there are cases where crashing in single instance is better because rebooting the server or restarting the instance will bring it up sooner with no need for manual intervention. Since we have to choose without knowing much about the system we have to base our choice on the odds of the failure being one case or the other. Twenty years ago a datafile was on one disk, almost all I/O errors were disk failures and a disk failure always meant doing media recovery. In that situation taking the datafile offline was clearly the right thing to do, even if the tablespace was critical to the application – it was going to need media recovery in any case.

Today systems are much different.

– Storage arrays and mirroring mean that disk failures almost never require media recovery. I/O write errors usually stop happening when the system is reinitialized.
– Many customers have mechanisms like CRS to automatically restart the database, possibly on a different node.
Now it is much more likely that restarting the instance will resolve the problem without doing any media recovery, and it will happen automatically. The chance that the application can continue running with the offline datafile has always been slight, but when media recovery was going to be required anyway there was no harm in trying to offline the file. Now there is a lot of harm in offlining the file since it prevents automatic recovery and requires an administrator to perform tasks he is unfamiliar with. Today crashing the instance has a better chance of getting the application running sooner.
So Oracle advises to leave the parameter default (=TRUE) and use the new feature. The system is then more capable to recover without interfering needed of a DBA. But when someone has different experiences, feel free to comment on this post….