Skip to content

Oracle 排障配置与调整 - 5. page

建议关注11g数据库password_life_time

Oracle 11g 之前默认的用户时是没有密码过期的限制的,在Oracle 11g 中default的profile启用了密码过期时间是180天,也就是password_life_time值为180,虽然是个小细节,但是很多客户在迁移到11g后有规律性的都在半年后出现了密码错误无法登陆的问题.

如下:

select * from dba_profiles where profile='DEFAULT' and resource_name='PASSWORD_LIFE_TIME';

PROFILE       RESOURCE_NAME       RESOURCE LIMIT
------------ -------------------- -------- -------
DEFAULT       PASSWORD_LIFE_TIME  PASSWORD 180

当过期时候系统会报错ORA-28002.当遭遇这个问题时候可以通过以下方式解决:

    1.新建profile,对用户指定新的profile.
    2.通过对用户重设密码(密码可以和原来一样).

    命令为:

    alter user username identified by password.
    

    3.针对默认的profile的password_life_time设置为unlimited

    命令为:

    alter profile default limit PASSWORD_LIFE_TIM 180.

以下是profile里的关于password的设置类目解释:

    FAILED_LOGIN_ATTEMPTS
    设定登录到Oracle 数据库时可以失败的次数。一旦某用户尝试登录数据库的达到该值时,该用户的帐户就被锁定,只能由DBA能解锁。
    PASSWORD_LIFE_TIME
    设定口令的有效时间(天数),一旦超过这一时间,必须重新设口令。缺省为180天(11g,10gUNLIMITED).
    PASSWORD_REUSE_TIME
    许多系统不许用户重新启用过去用过的口令。该资源项设定了一个失效口令要经过多少天,用户才可以重新使用该口令。缺省为UNLIMITED.
    PASSWORD_REUSE_MAX
    重新启用一个先前用过的口令前必须对该口令进行重新设置的次数(重复用的次数)。
    PASSWORD_LOCK_TIME
    设定帐户被锁定的天数(当登录失败达到FAILED_LOGIN_ATTEMPTS时)。
    PASSWORD_GRACE_TIME
    设定在口令失效前,给予的重新设该口令的宽限天。当口令失效之后回,在登录时会出现警告信息显示该天数。如果没有在宽限天内修改口令,口令将失效。
    PASSWORD_VERITY_FUNCTION
    该资源项允许调用一个PL/SQL 来验证口令。Oracle公司已提供该应用 的脚本,但是只要愿意的话,用户可以制定自己的验证脚本。该参数的设定就是PL/SQL函数的名称。缺省为NULL.

关于11g数据库基础审计的初始注意事项

目前11g已经有些年份了,大部分的系统已经迁移或者升级到11g的版本,而后续上线的大部分系统都是以11g为主.在11g以上的版本中数据库的基础审计和以往版本不同,11g以后的版本默认是开启的.基础审计这个功能大部分时候并不是很受关注,一来是比较简单就能设置完成,二来国内对信息数据安全的不够注重的环境也让很多dba在数据库安全层面未有太多落实.所以在基础设计这块在数据库构建之处也容易忽视.

审计在默认开启的情况下,主要影响是2方面:

    1.审计记录在长久记录后,会占用较多的空间,而审计记录表默认是存放在system表空间.
    2.在审计记录膨胀后,容易对一些类型的应用产生性能上的影响.

在初始构建11g实例时候,对审计处理并不是一味的关闭处理,主要是考虑开启的必要性需求,以及如果开启后如何去管理审计记录.以前我遇见过这样的需求,主要是针对审计开启后的审计记录管理问题.关闭倒是容易,一条命令就可以结束.反倒是开启后对审计记录的管理.可以参考以下2个文章对审计记录表存放位置挪移到别的表空间,同时制定定期的删除策略.

_datafile_write_errors_crash_instance设置建议

该参数在11.2.0.2以前默认是false,在11.2.0.2后默认为true,作用为在出现io错误的时候数据库选择是offline出现io错误相关的datafile还是直接将instance crash.当为true时候,数据库在发生io错误时候会直接瘫痪.报错ORA-63999等,前阵子我碰到这样的错误,一般碰到此类错误都是从IO传输层,存储和系统网络之间找问题,该参数设置为FALSE或者TRUE只是从业务影响层面广度的考虑.所以一旦碰到IO错误,考调整此参数只是治标不治本,根源还需要从IO传输层的各层面找问题.可以考虑设置为false,减少因为io错误而导致影响的范围增加.

在考虑此参数时候多数已经是碰到IO错误了,所以此时候应该考虑下,数据库坏块的产生控制影响,对以下几个参数给予考虑:

DB_ULTRA_SAFE
DB_BLOCK_CHECKING
DB_LOST_WRITE_PROTECT
DB_BLOCK_CHECKSUM

建议设置参数db_ultra_safe为DATA_ONLY,会稍微增加数据库主机的消耗,以加强对数据的校验以便能及时发现问题,设置该参数会去自动修改对应的另外3个参数:

DB_BLOCK_CHECKING will be set to MEDIUM.(当前为FALSE)
DB_LOST_WRITE_PROTECT will be set to TYPICAL. (当前为TYPICAL)
DB_BLOCK_CHECKSUM will be set to FULL. (当前为NONE)

以下是国外一个工程师对该参数的建议:

Param ‘_datafile_write_errors_crash_instance’ , TRUE or FALSE?

Since 11.2.0.2 there’s a new parameter, “_datafile_write_errors_crash_instance” to prevent the intance to crash when a write error on a datafile occurs . But.. should I use this or not. The official text of this parameter:

This fix introduces a notable change in behaviour in that
from 11.2.0.2 onwards an I/O write error to a datafile will
now crash the instance.

Before this fix I/O errors to datafiles not in the system tablespace
offline the respective datafiles when the database is in archivelog mode.
This behavior is not always desirable. Some customers would prefer
that the instance crash due to a datafile write error.

This fix introduces a new hidden parameter to control if the instance
should crash on a write error or not:
_datafile_write_errors_crash_instance

With this fix:
If _datafile_write_errors_crash_instance = TRUE (default) then
any write to a datafile which fails due to an IO error causes
an instance crash.

If _datafile_write_errors_crash_instance = FALSE then the behaviour
reverts to the previous behaviour (before this fix) such that
a write error to a datafile offlines the file (provided the DB is
in archivelog mode and the file is not in SYSTEM tablespace in
which case the instance is aborted)
When you ask Oracle for advice, you get the following answer:

20+ years ago a feature was added to Oracle to offline a datafile when there was an error writing a dirty buffer to it and it was not part of the system tablespace. At that time it made sense to do this since neither RAC or even
OPS was implemented and storage arrays did not exist. Then the most likelycause of an I/O error was a problem with the direct attached disk drive holding the datafile. By offlining the datafile the database might be able to continue running. Customers assumed that a disk failure would require restoring a backup and doing a media recovery so taking the file offline might improve availability. High availability was not expected.
Today almost all customers use highly available storage arrays accessible from multiple hosts. Now most I/O errors are either transient or are local to the host that encounters them. Real disk failures are hidden by the storage array redundancy. Customers expect a disk failure to have no effect on the operation of the database.

Unfortunately the code to offline a datafile on an I/O error is still there. The effect is that an error on one node in a cluster offlines the datafile and usually takes down the entire application on all nodes or even crashes all instances if the problem is with an undo tablespace. For example dismounting a file system on one node in a cluster makes that node get I/O errors for the files on that file system. This makes a mistake on one node take down the entire cluster.

Offlining a datafile on a write error is even a problem with single instance. Most I/O errors today will go away if the database is restarted on another machine or if the current machine is rebooted. However if the I/O error took a datafile offline, then the administrator must do a media recovery to make the application function again. This is an unusual procedure that takes awhile.

If the database instances do not crash it takes longer for the administrator to find out that the application is not working even though the database appears to be up and running. This is a problem with both RAC and single
instance.

Question: One concern is that a failed datafile write to a non-critical tablespace will bring down the database when it occurs in the only open instance.

It is true that there may be some situations where taking the file offline would be better. On the other hand there are cases where crashing in single instance is better because rebooting the server or restarting the instance will bring it up sooner with no need for manual intervention. Since we have to choose without knowing much about the system we have to base our choice on the odds of the failure being one case or the other. Twenty years ago a datafile was on one disk, almost all I/O errors were disk failures and a disk failure always meant doing media recovery. In that situation taking the datafile offline was clearly the right thing to do, even if the tablespace was critical to the application – it was going to need media recovery in any case.

Today systems are much different.

– Storage arrays and mirroring mean that disk failures almost never require media recovery. I/O write errors usually stop happening when the system is reinitialized.
– Many customers have mechanisms like CRS to automatically restart the database, possibly on a different node.
Now it is much more likely that restarting the instance will resolve the problem without doing any media recovery, and it will happen automatically. The chance that the application can continue running with the offline datafile has always been slight, but when media recovery was going to be required anyway there was no harm in trying to offline the file. Now there is a lot of harm in offlining the file since it prevents automatic recovery and requires an administrator to perform tasks he is unfamiliar with. Today crashing the instance has a better chance of getting the application running sooner.
So Oracle advises to leave the parameter default (=TRUE) and use the new feature. The system is then more capable to recover without interfering needed of a DBA. But when someone has different experiences, feel free to comment on this post….

SPM固定执行计划以及踩bug事一件

原有2个sql语句有多个表连接,执行计划一直在走错误的执行计划.表级统计信息以及索引规划都已经是最新(这里统计信息有狗血不做描述),只是SQL里还有六个绑定变量以及字段的柱状图影响了执行计划,在这个优化里没有删除柱状图和对绑定变量的影响进行处理(星形连接不建议使用绑定变量),现场环境微妙最终选择通过sql profile以及spm对这2个sql的执行计划进行固定处理.先用sqlprofile固定后让sql重新解析后发现未能生效,逐用spm的方式固定.

这里以其中一个sql_id为bwwnw7r1gzhdf的语句为例,这是收集到对应1个小时内的sqlrpt,其中plan_hash_value为711942702执行计划为正确的执行计划,从报告中可以看到这个sql选择了错误的执行计划,并且从中也可以看到sql有多个执行计划.当中执行计划正确与否的判断方式就不做描述.
 

SQL ID: bwwnw7r1gzhdf

# Plan Hash Value Total Elapsed Time(ms) Executions 1st Capture Snap ID Last Capture Snap ID
1 3052678239 13,512,877 10 25060 25060
2 3392573872 0 0 25060 25060
3 4134955434 0 0 25060 25060
4 1564064893 0 0 25060 25060
5 2504448979 0 0 25060 25060
6 147966509 0 0 25060 25060
7 711942702 0 0 25060 25060

 
通过coe_xfr_sql_profile.sql脚本对bwwnw7r1gzhdf的sql进行固定711942702,生成sql profile的名字为coe_bwwnw7r1gzhdf_711942702.
(该部分可以参考
1.Using Sqltxplain to create a ‘SQL Profile’ to consistently reproduce a good plan (文档 ID 1487302.1)
2.Automatic SQL Tuning and SQL Profiles (文档 ID 271196.1)
3.Correcting Optimizer Cost Estimates to Encourage Good Execution Plans Using the COE XFR SQL Profile Script (文档 ID 1955195.1))

让sql从新解析后从v$sql视图中的sql profile字段没有看到生效的迹象,原因是在脚本coe_xfr_sql_profile.sql中对创建的sqlprofile默认的生效是false的,所以创建出来的profile不会失效,监控中的执行计划未变(现场我对此处的profile drop).

 

SQL>  select name,created,status from dba_sql_profiles;

NAME                           CREATED                        STATUS
------------------------------ ------------------------------ --------
coe_bwwnw7r1gzhdf_711942702    26-JUN-15 02.09.30.000000 PM   ENABLED
coe_g87an0j5djjpm_334801256    26-JUN-15 11.30.25.000000 AM   ENABLED

SQL>  select SQL_ID, SQL_PROFILE,PLAN_HASH_VALUE from V$SQL where SQL_ID='bwwnw7r1gzhdf' and sql_profile is not null;

no rows

SQL>  select sql_profile,EXECUTIONS,PLAN_HASH_VALUE,parse_calls,ELAPSED_TIME/1000000,
ELAPSED_TIME/1000000/EXECUTIONS,LAST_LOAD_TIME,ROWS_PROCESSED
from v$sql where EXECUTIONS>0 and sql_id='bwwnw7r1gzhdf' order by LAST_LOAD_TIME desc;
...

逐对profile进行disable并drop

=====disable profile==============
BEGIN
DBMS_SQLTUNE.ALTER_SQL_PROFILE(
name => 'coe_bwwnw7r1gzhdf_711942702',
attribute_name => 'STATUS',
value => 'DISABLED');
END;
/

BEGIN
DBMS_SQLTUNE.ALTER_SQL_PROFILE(
name => 'coe_g87an0j5djjpm_334801256',
attribute_name => 'STATUS',
value => 'ENABLED');
END;
/

=====drop profile=================
begin
DBMS_SQLTUNE.DROP_SQL_PROFILE(name => 'coe_bwwnw7r1gzhdf_711942702');
end;
/

begin
DBMS_SQLTUNE.DROP_SQL_PROFILE(name => 'coe_g87an0j5djjpm_334801256');
end;
/

由于已经存在了正确的执行计划,所以通过DBMS_SPM直接创建baseline,并通过DBMS_SPM包对该sql的baseline的enable,accept,fixed三个属性指定为yes.

该部分可以参考:
Plan Stability Features (Including SQL Plan Management (SPM)) (文档 ID 1359841.1)

为sql创建baseline

variable cnt number;
execute :cnt :=DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE(SQL_ID => 'bwwnw7r1gzhdf', PLAN_HASH_VALUE => 711942702) ;

验证该baseline已经生成

SQL> set linesize 200
SQL> Select Sql_Handle, Plan_Name, Origin, Enabled, Accepted,Fixed,Optimizer_Cost,Sql_Text
From Dba_Sql_Plan_Baselines
Where Sql_Text Like '%FROM P1EDBADM.MES_PROCESSOPERATIONSPEC%' Order By Last_Modified;


SQL_HANDLE                     PLAN_NAME                      ORIGIN         ENA ACC FIX OPTIMIZER_COST SQL_TEXT
------------------------------ ------------------------------ -------------- --- --- --- -------------- --------------------------------------------------------------------------------
SQL_995463d3d1edd710           SQL_PLAN_9kp33ug8yvpsh4af503b5 MANUAL-LOAD    YES YES NO              69 SELECT D.LOTNAME LOT, D.PRODUCTNAME GLASS, TO_CHAR(D.CREATETIME, 'YYYY-MM-DD HH2

为sqlbaseline的fixed属性改为yes

variable cnt number;
execute :cnt :=DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE(SQL_ID => 'bwwnw7r1gzhdf', PLAN_HASH_VALUE => 711942702,fixed => 'yes') ;
验证修改完成
SQL> set linesize 200
SQL> Select Sql_Handle, Plan_Name, Origin, Enabled, Accepted,Fixed,Optimizer_Cost,Sql_Text
  2  From Dba_Sql_Plan_Baselines
  3  Where Sql_Text Like '%FROM P1EDBADM.MES_PROCESSOPERATIONSPEC%' Order By Last_Modified;

SQL_HANDLE                     PLAN_NAME                      ORIGIN         ENA ACC FIX OPTIMIZER_COST SQL_TEXT
------------------------------ ------------------------------ -------------- --- --- --- -------------- --------------------------------------------------------------------------------
SQL_995463d3d1edd710           SQL_PLAN_9kp33ug8yvpsh4af503b5 MANUAL-LOAD    YES YES YES            574 SELECT D.LOTNAME LOT, D.PRODUCTNAME GLASS, TO_CHAR(D.CREATETIME, 'YYYY-MM-DD HH2

最终验证生效

SQL> Select Sql_Handle, Plan_Name, Origin, Enabled, Accepted,Fixed,Optimizer_Cost,Sql_Text
  2  From Dba_Sql_Plan_Baselines
  3  Where Sql_Text Like '%FROM P1EDBADM.MES_PROCESSOPERATIONSPEC%' Order By Last_Modified;

SQL_HANDLE                     PLAN_NAME                      ORIGIN         ENA ACC FIX OPTIMIZER_COST SQL_TEXT
------------------------------ ------------------------------ -------------- --- --- --- -------------- --------------------------------------------------------------------------------
SQL_995463d3d1edd710           SQL_PLAN_9kp33ug8yvpsh4af503b5 MANUAL-LOAD    YES YES YES            574 SELECT D.LOTNAME LOT, D.PRODUCTNAME GLASS, TO_CHAR(D.CREATETIME, 'YYYY-MM-DD HH2
SQL_2e1c8025edb165b3           SQL_PLAN_2w7404rqv2tdm56eb6fa8 MANUAL-LOAD    YES YES YES            311 SELECT 1 " ", D.LOTNAME LOT, D.PRODUCTNAME GLASS, TO_CHAR(MAX(H.EVENTTIME), 'YYY

SPM主要和2个参数有关,一个是baseline生效(optimizer_user_sql_plan_baselines,前提是accept属性要为yes,否则会产生干扰),一个是捕获sql语句生成baseline(optimizer_cature_sql_plan_baselines).在数据库中我一般不开启捕获,但是开启baseline生效.
开启的语法:

alter system set optimizer_user_sql_plan_baselines=true scope=both;
alter system set optimizer_cature_sql_plan_baselines=true scope=both;

关闭的语法:

alter system set optimizer_user_sql_plan_baselines=false scope=both;
alter system set optimizer_cature_sql_plan_baselines=false scope=both;

开启捕获的情况在一些11g版本中会触发该bug
Bug 9910484 – SQL Plan Management Capture uses excessive space in SYSAUX (文档 ID 9910484.8)
此bug会造成sysaux的表空间暴增,主要为sqllob$data,我遇见的是在一天内从2g增长到4g.关闭了捕获后,该现象消失.
删除掉不必要的baseline后可以通过shrink的方式回收sysaux的空间,具体可以参考
Reducing the Space Usage of the SQL Management Base in the SYSAUX Tablespace (文档 ID 1499542.1)