Oracle 排障配置与调整 – 第5页

建议关注11g数据库password_life_time

Oracle 11g 之前默认的用户时是没有密码过期的限制的，在Oracle 11g 中default的profile启用了密码过期时间是180天,也就是password_life_time值为180,虽然是个小细节,但是很多客户在迁移到11g后有规律性的都在半年后出现了密码错误无法登陆的问题.

如下：

select * from dba_profiles where profile='DEFAULT' and resource_name='PASSWORD_LIFE_TIME';

PROFILE       RESOURCE_NAME       RESOURCE LIMIT
------------ -------------------- -------- -------
DEFAULT       PASSWORD_LIFE_TIME  PASSWORD 180

当过期时候系统会报错ORA-28002.当遭遇这个问题时候可以通过以下方式解决:

命令为:

alter user username identified by password.

3.针对默认的profile的password_life_time设置为unlimited

命令为:

alter profile default limit PASSWORD_LIFE_TIM 180.

以下是profile里的关于password的设置类目解释:

Access control Oracle Oracle security Oracle 排障配置与调整数据技术 | Guang Cai Li | 2015-07-29 | password_life_time

关于11g数据库基础审计的初始注意事项

目前11g已经有些年份了,大部分的系统已经迁移或者升级到11g的版本,而后续上线的大部分系统都是以11g为主.在11g以上的版本中数据库的基础审计和以往版本不同,11g以后的版本默认是开启的.基础审计这个功能大部分时候并不是很受关注,一来是比较简单就能设置完成,二来国内对信息数据安全的不够注重的环境也让很多dba在数据库安全层面未有太多落实.所以在基础设计这块在数据库构建之处也容易忽视.

审计在默认开启的情况下,主要影响是2方面:

在初始构建11g实例时候,对审计处理并不是一味的关闭处理,主要是考虑开启的必要性需求,以及如果开启后如何去管理审计记录.以前我遇见过这样的需求,主要是针对审计开启后的审计记录管理问题.关闭倒是容易,一条命令就可以结束.反倒是开启后对审计记录的管理.可以参考以下2个文章对审计记录表存放位置挪移到别的表空间,同时制定定期的删除策略.

1.使用DBMS_AUDIT_MGMT定期PURGE部分AUD$以及FGA_LOG$审计记录

2.关于AUD$的Shrink以及migrate问题

Audit and FGA Database Oracle Oracle security Oracle 排障配置与调整数据技术 | Guang Cai Li | 2015-07-11

_datafile_write_errors_crash_instance设置建议

该参数在11.2.0.2以前默认是false,在11.2.0.2后默认为true,作用为在出现io错误的时候数据库选择是offline出现io错误相关的datafile还是直接将instance crash.当为true时候,数据库在发生io错误时候会直接瘫痪.报错ORA-63999等,前阵子我碰到这样的错误,一般碰到此类错误都是从IO传输层,存储和系统网络之间找问题,该参数设置为FALSE或者TRUE只是从业务影响层面广度的考虑.所以一旦碰到IO错误,考调整此参数只是治标不治本,根源还需要从IO传输层的各层面找问题.可以考虑设置为false,减少因为io错误而导致影响的范围增加.

在考虑此参数时候多数已经是碰到IO错误了,所以此时候应该考虑下,数据库坏块的产生控制影响,对以下几个参数给予考虑:

DB_ULTRA_SAFE
DB_BLOCK_CHECKING
DB_LOST_WRITE_PROTECT
DB_BLOCK_CHECKSUM

建议设置参数db_ultra_safe为DATA_ONLY，会稍微增加数据库主机的消耗，以加强对数据的校验以便能及时发现问题，设置该参数会去自动修改对应的另外3个参数：

DB_BLOCK_CHECKING will be set to MEDIUM.（当前为FALSE）
DB_LOST_WRITE_PROTECT will be set to TYPICAL. （当前为TYPICAL）
DB_BLOCK_CHECKSUM will be set to FULL. （当前为NONE）

以下是国外一个工程师对该参数的建议:

Param ‘_datafile_write_errors_crash_instance’ , TRUE or FALSE?

Since 11.2.0.2 there’s a new parameter, “_datafile_write_errors_crash_instance” to prevent the intance to crash when a write error on a datafile occurs . But.. should I use this or not. The official text of this parameter:

This fix introduces a notable change in behaviour in that
from 11.2.0.2 onwards an I/O write error to a datafile will
now crash the instance.

Before this fix I/O errors to datafiles not in the system tablespace
offline the respective datafiles when the database is in archivelog mode.
This behavior is not always desirable. Some customers would prefer
that the instance crash due to a datafile write error.

This fix introduces a new hidden parameter to control if the instance
should crash on a write error or not:
_datafile_write_errors_crash_instance

With this fix:
If _datafile_write_errors_crash_instance = TRUE (default) then
any write to a datafile which fails due to an IO error causes
an instance crash.

If _datafile_write_errors_crash_instance = FALSE then the behaviour
reverts to the previous behaviour (before this fix) such that
a write error to a datafile offlines the file (provided the DB is
in archivelog mode and the file is not in SYSTEM tablespace in
which case the instance is aborted)
When you ask Oracle for advice, you get the following answer:

20+ years ago a feature was added to Oracle to offline a datafile when there was an error writing a dirty buffer to it and it was not part of the system tablespace. At that time it made sense to do this since neither RAC or even
OPS was implemented and storage arrays did not exist. Then the most likelycause of an I/O error was a problem with the direct attached disk drive holding the datafile. By offlining the datafile the database might be able to continue running. Customers assumed that a disk failure would require restoring a backup and doing a media recovery so taking the file offline might improve availability. High availability was not expected.
Today almost all customers use highly available storage arrays accessible from multiple hosts. Now most I/O errors are either transient or are local to the host that encounters them. Real disk failures are hidden by the storage array redundancy. Customers expect a disk failure to have no effect on the operation of the database.

Unfortunately the code to offline a datafile on an I/O error is still there. The effect is that an error on one node in a cluster offlines the datafile and usually takes down the entire application on all nodes or even crashes all instances if the problem is with an undo tablespace. For example dismounting a file system on one node in a cluster makes that node get I/O errors for the files on that file system. This makes a mistake on one node take down the entire cluster.

Offlining a datafile on a write error is even a problem with single instance. Most I/O errors today will go away if the database is restarted on another machine or if the current machine is rebooted. However if the I/O error took a datafile offline, then the administrator must do a media recovery to make the application function again. This is an unusual procedure that takes awhile.

If the database instances do not crash it takes longer for the administrator to find out that the application is not working even though the database appears to be up and running. This is a problem with both RAC and single
instance.

Question: One concern is that a failed datafile write to a non-critical tablespace will bring down the database when it occurs in the only open instance.

It is true that there may be some situations where taking the file offline would be better. On the other hand there are cases where crashing in single instance is better because rebooting the server or restarting the instance will bring it up sooner with no need for manual intervention. Since we have to choose without knowing much about the system we have to base our choice on the odds of the failure being one case or the other. Twenty years ago a datafile was on one disk, almost all I/O errors were disk failures and a disk failure always meant doing media recovery. In that situation taking the datafile offline was clearly the right thing to do, even if the tablespace was critical to the application – it was going to need media recovery in any case.

Today systems are much different.

– Storage arrays and mirroring mean that disk failures almost never require media recovery. I/O write errors usually stop happening when the system is reinitialized.
– Many customers have mechanisms like CRS to automatically restart the database, possibly on a different node.
Now it is much more likely that restarting the instance will resolve the problem without doing any media recovery, and it will happen automatically. The chance that the application can continue running with the offline datafile has always been slight, but when media recovery was going to be required anyway there was no harm in trying to offline the file. Now there is a lot of harm in offlining the file since it prevents automatic recovery and requires an administrator to perform tasks he is unfamiliar with. Today crashing the instance has a better chance of getting the application running sooner.
So Oracle advises to leave the parameter default (=TRUE) and use the new feature. The system is then more capable to recover without interfering needed of a DBA. But when someone has different experiences, feel free to comment on this post….

Database Oracle Oracle 排障配置与调整数据技术 | Guang Cai Li | 2015-07-10 | _datafile_write_errors_crash_instance

SPM固定执行计划以及踩bug事一件

原有2个sql语句有多个表连接,执行计划一直在走错误的执行计划.表级统计信息以及索引规划都已经是最新(这里统计信息有狗血不做描述),只是SQL里还有六个绑定变量以及字段的柱状图影响了执行计划,在这个优化里没有删除柱状图和对绑定变量的影响进行处理(星形连接不建议使用绑定变量),现场环境微妙最终选择通过sql profile以及spm对这2个sql的执行计划进行固定处理.先用sqlprofile固定后让sql重新解析后发现未能生效,逐用spm的方式固定.

这里以其中一个sql_id为bwwnw7r1gzhdf的语句为例,这是收集到对应1个小时内的sqlrpt,其中plan_hash_value为711942702执行计划为正确的执行计划,从报告中可以看到这个sql选择了错误的执行计划,并且从中也可以看到sql有多个执行计划.当中执行计划正确与否的判断方式就不做描述.

SQL ID: bwwnw7r1gzhdf

1st Capture and Last Capture Snap IDs refer to Snapshot IDs witin the snapshot range
SELECT D.LOTNAME LOT, D.PRODUCTNAME GLASS, TO_CHAR(D.CREATETIME, ‘YYYY…

#	Plan Hash Value	Total Elapsed Time(ms)	Executions	1st Capture Snap ID	Last Capture Snap ID
1	3052678239	13,512,877	10	25060	25060
2	3392573872	0	0	25060	25060
3	4134955434	0	0	25060	25060
4	1564064893	0	0	25060	25060
5	2504448979	0	0	25060	25060
6	147966509	0	0	25060	25060
7	711942702	0	0	25060	25060

通过coe_xfr_sql_profile.sql脚本对bwwnw7r1gzhdf的sql进行固定711942702,生成sql profile的名字为coe_bwwnw7r1gzhdf_711942702.
(该部分可以参考
1.Using Sqltxplain to create a ‘SQL Profile’ to consistently reproduce a good plan (文档 ID 1487302.1)
2.Automatic SQL Tuning and SQL Profiles (文档 ID 271196.1)
3.Correcting Optimizer Cost Estimates to Encourage Good Execution Plans Using the COE XFR SQL Profile Script (文档 ID 1955195.1))

让sql从新解析后从v$sql视图中的sql profile字段没有看到生效的迹象，原因是在脚本coe_xfr_sql_profile.sql中对创建的sqlprofile默认的生效是false的，所以创建出来的profile不会失效,监控中的执行计划未变（现场我对此处的profile drop）.

SQL>  select name,created,status from dba_sql_profiles;

NAME                           CREATED                        STATUS
------------------------------ ------------------------------ --------
coe_bwwnw7r1gzhdf_711942702    26-JUN-15 02.09.30.000000 PM   ENABLED
coe_g87an0j5djjpm_334801256    26-JUN-15 11.30.25.000000 AM   ENABLED

SQL>  select SQL_ID, SQL_PROFILE,PLAN_HASH_VALUE from V$SQL where SQL_ID='bwwnw7r1gzhdf' and sql_profile is not null;

no rows

SQL>  select sql_profile,EXECUTIONS,PLAN_HASH_VALUE,parse_calls,ELAPSED_TIME/1000000,
ELAPSED_TIME/1000000/EXECUTIONS,LAST_LOAD_TIME,ROWS_PROCESSED
from v$sql where EXECUTIONS>0 and sql_id='bwwnw7r1gzhdf' order by LAST_LOAD_TIME desc;
...

逐对profile进行disable并drop

=====disable profile==============
BEGIN
DBMS_SQLTUNE.ALTER_SQL_PROFILE(
name =&gt; 'coe_bwwnw7r1gzhdf_711942702',
attribute_name =&gt; 'STATUS',
value =&gt; 'DISABLED');
END;
/

BEGIN
DBMS_SQLTUNE.ALTER_SQL_PROFILE(
name =&gt; 'coe_g87an0j5djjpm_334801256',
attribute_name =&gt; 'STATUS',
value =&gt; 'ENABLED');
END;
/

=====drop profile=================
begin
DBMS_SQLTUNE.DROP_SQL_PROFILE(name => 'coe_bwwnw7r1gzhdf_711942702');
end;
/

begin
DBMS_SQLTUNE.DROP_SQL_PROFILE(name => 'coe_g87an0j5djjpm_334801256');
end;
/

由于已经存在了正确的执行计划,所以通过DBMS_SPM直接创建baseline,并通过DBMS_SPM包对该sql的baseline的enable,accept,fixed三个属性指定为yes.

该部分可以参考:
Plan Stability Features (Including SQL Plan Management (SPM)) (文档 ID 1359841.1)

为sql创建baseline

variable cnt number;
execute :cnt :=DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE(SQL_ID => 'bwwnw7r1gzhdf', PLAN_HASH_VALUE => 711942702) ;

验证该baseline已经生成

SQL> set linesize 200
SQL> Select Sql_Handle, Plan_Name, Origin, Enabled, Accepted,Fixed,Optimizer_Cost,Sql_Text
From Dba_Sql_Plan_Baselines
Where Sql_Text Like '%FROM P1EDBADM.MES_PROCESSOPERATIONSPEC%' Order By Last_Modified;


SQL_HANDLE                     PLAN_NAME                      ORIGIN         ENA ACC FIX OPTIMIZER_COST SQL_TEXT
------------------------------ ------------------------------ -------------- --- --- --- -------------- --------------------------------------------------------------------------------
SQL_995463d3d1edd710           SQL_PLAN_9kp33ug8yvpsh4af503b5 MANUAL-LOAD    YES YES NO              69 SELECT D.LOTNAME LOT, D.PRODUCTNAME GLASS, TO_CHAR(D.CREATETIME, 'YYYY-MM-DD HH2

为sqlbaseline的fixed属性改为yes

variable cnt number;
execute :cnt :=DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE(SQL_ID => 'bwwnw7r1gzhdf', PLAN_HASH_VALUE => 711942702,fixed => 'yes') ;
验证修改完成
SQL> set linesize 200
SQL> Select Sql_Handle, Plan_Name, Origin, Enabled, Accepted,Fixed,Optimizer_Cost,Sql_Text
  2  From Dba_Sql_Plan_Baselines
  3  Where Sql_Text Like '%FROM P1EDBADM.MES_PROCESSOPERATIONSPEC%' Order By Last_Modified;

SQL_HANDLE                     PLAN_NAME                      ORIGIN         ENA ACC FIX OPTIMIZER_COST SQL_TEXT
------------------------------ ------------------------------ -------------- --- --- --- -------------- --------------------------------------------------------------------------------
SQL_995463d3d1edd710           SQL_PLAN_9kp33ug8yvpsh4af503b5 MANUAL-LOAD    YES YES YES            574 SELECT D.LOTNAME LOT, D.PRODUCTNAME GLASS, TO_CHAR(D.CREATETIME, 'YYYY-MM-DD HH2

最终验证生效

SQL> Select Sql_Handle, Plan_Name, Origin, Enabled, Accepted,Fixed,Optimizer_Cost,Sql_Text
  2  From Dba_Sql_Plan_Baselines
  3  Where Sql_Text Like '%FROM P1EDBADM.MES_PROCESSOPERATIONSPEC%' Order By Last_Modified;

SQL_HANDLE                     PLAN_NAME                      ORIGIN         ENA ACC FIX OPTIMIZER_COST SQL_TEXT
------------------------------ ------------------------------ -------------- --- --- --- -------------- --------------------------------------------------------------------------------
SQL_995463d3d1edd710           SQL_PLAN_9kp33ug8yvpsh4af503b5 MANUAL-LOAD    YES YES YES            574 SELECT D.LOTNAME LOT, D.PRODUCTNAME GLASS, TO_CHAR(D.CREATETIME, 'YYYY-MM-DD HH2
SQL_2e1c8025edb165b3           SQL_PLAN_2w7404rqv2tdm56eb6fa8 MANUAL-LOAD    YES YES YES            311 SELECT 1 " ", D.LOTNAME LOT, D.PRODUCTNAME GLASS, TO_CHAR(MAX(H.EVENTTIME), 'YYY

SPM主要和2个参数有关,一个是baseline生效(optimizer_user_sql_plan_baselines,前提是accept属性要为yes,否则会产生干扰),一个是捕获sql语句生成baseline(optimizer_cature_sql_plan_baselines).在数据库中我一般不开启捕获,但是开启baseline生效.
开启的语法:

alter system set optimizer_user_sql_plan_baselines=true scope=both;
alter system set optimizer_cature_sql_plan_baselines=true scope=both;

关闭的语法:

alter system set optimizer_user_sql_plan_baselines=false scope=both;
alter system set optimizer_cature_sql_plan_baselines=false scope=both;

开启捕获的情况在一些11g版本中会触发该bug
Bug 9910484 – SQL Plan Management Capture uses excessive space in SYSAUX (文档 ID 9910484.8)
此bug会造成sysaux的表空间暴增,主要为sqllob$data,我遇见的是在一天内从2g增长到4g.关闭了捕获后,该现象消失.
删除掉不必要的baseline后可以通过shrink的方式回收sysaux的空间,具体可以参考
Reducing the Space Usage of the SQL Management Base in the SYSAUX Tablespace (文档 ID 1499542.1)

Data access tuning Index Maintenance Oracle partition tab performance tuning 数据技术 | Guang Cai Li | 2015-07-10

4月11号浙江数据库用户组的分享ppt：散发思维之数据迁移

附件页面：

散发思维之数据迁移

Oracle Oracle 排障配置与调整数据技术 | Guang Cai Li | 2015-07-08

Oracle 排障配置与调整 - 5. page

建议关注11g数据库password_life_time

关于11g数据库基础审计的初始注意事项

_datafile_write_errors_crash_instance设置建议

SPM固定执行计划以及踩bug事一件

SQL ID: bwwnw7r1gzhdf

4月11号浙江数据库用户组的分享ppt：散发思维之数据迁移

Oracle 恢复工具 Mdata 5.0.1 版本发布

近期文章

分类目录

扫码关注微信公众号:Oracle运维那些事获取定期发布的数据库运维的有趣事情!

近期活动

SQL ID: bwwnw7r1gzhdf

Oracle 恢复工具 Mdata 5.0.1 版本发布

近期文章

分类目录

扫码关注微信公众号:Oracle运维那些事 获取定期发布的数据库运维的有趣事情!

近期活动

扫码关注微信公众号:Oracle运维那些事获取定期发布的数据库运维的有趣事情!