Skip to content

最近的一些杂事

1.Blog主机空间迁移

由于朋友的vps的ngix缓存设置出问题导致我的博客不能登录控制后台,朋友一直没时间去调整ngix对应的内存参数恢复运行,我只好把blog迁移到现在的主机空间。

用的去年12月份的备份恢复的,经过一翻波折还是把12月后的文章迁过来了大部分,所以可见保持blog的定时备份还是很重要的。

后续开始会跟紧更新,以前埋下的坑会补上,新的文章更多会关于Rac和dg以及数据库安全,数据库安全部分的比例会占大头。

2.申请参与ACE program

由周亮的推荐参加了ace program的评审,我自己觉得心虚,八成是过不了了。不过还是在某人怂恿下提交了表格和资料,希望不大,就当是给自己装个小胆。

3.一封尘封的邮件

前几天翻luda@ludatou这个邮箱发现了机械工业出版社在13年的5月份时候发了封邮件给我,向我邀稿《大规模数据库的设计与运维》一书,我得瑟一下。虽然这本个邮件收晚了,但是在13年底这个编辑还是找到了我约了另一本关于安全的书籍,压力大起来了。要好好的沉淀。向各位前辈学习。

4.今年公司在杭州帮我招了个dba

呵呵,虽然这个dba水平还不够纯青,但是已经能帮我顶下大部分的活了,特别是database heathly check,这个是让我日复一日操作了那么多年的活儿,终于有人接过我手中的棒子了。这下和老于一起奔波在华东除了上海之外的应急和调优现场,也不知道这算不算同一种生活的另一种方式?说变没变,说没变又有点变。

5.弟弟的ocm

我弟弟考过了ocp,打算年底把ocm给考了,一直督促他学习,这孩子,让家人操了不少心。

 

早上了洗脸刷牙,准备贴发票。

OUI-0094 安装11g和10g同平台安装遇错

在安装11g数据库后,再另外一个用户下安装oracle 10g数据库报错OUI0094错误。这个错误是因为oraInst.log文件已经写入了11g数据库的信息导致,既然知道了问题所在那解决起来就水到渠成了,具体解决方法如下:

标注:以下步骤都是在安装10g oracle用户下操作

1. 创建新的oraInst.loc文件在 $ORACLE_HOME,并更新为如下

inventory_loc=$ORACLE_HOME/oraInventory  -- 必须用绝对路径
inst_group=oinstall

2. 创建完$ORACLE_HOME/oraInst.loc后启动OUI具体如下:

./runInstaller -invPtrLoc $ORACLE_HOME/oraInst.loc

The -invPtrLoc flag is used to locate the oraInst.loc file.

3.经过以上步骤以后就可以顺利安装oracle 10g

statement suspended, wait error to be cleared

impdp 时候遭遇等待事件statement suspended, wait error to be cleared。

问题分析:

经过查验为导入设定的表空间空间不足造成的,只需要把表空间扩大这个错误就自然消失,导入不会终端。

具体的日志如下:

21:18:00 (2.0 min) 1,204 statement suspended, wait erro 939 8.21
CPU + Wait for CPU 264 2.31
db file sequential read 1 0.01
21:20:00 (2.0 min) 1,201 statement suspended, wait erro 951 8.32
CPU + Wait for CPU 249 2.18
control file sequential read 1 0.01
21:22:00 (2.0 min) 1,197 statement suspended, wait erro 942 8.24
CPU + Wait for CPU 255 2.23
21:24:00 (2.0 min) 1,201 statement suspended, wait erro 960 8.40
CPU + Wait for CPU 241 2.11
21:26:00 (2.0 min) 1,215 statement suspended, wait erro 963 8.42
CPU + Wait for CPU 250 2.19
log file parallel write 2 0.02
21:28:00 (2.0 min) 1,201 statement suspended, wait erro 952 8.33
CPU + Wait for CPU 249 2.18
21:30:00 (2.0 min) 1,202 statement suspended, wait erro 956 8.36
CPU + Wait for CPU 246 2.15
21:32:00 (2.0 min) 1,202 statement suspended, wait erro 950 8.31
CPU + Wait for CPU 252 2.20
21:34:00 (2.0 min) 1,199 statement suspended, wait erro 944 8.26
CPU + Wait for CPU 254 2.22
db file sequential read 1 0.01
21:36:00 (1 secs) 10 statement suspended, wait erro 8 0.07
CPU + Wait for CPU 2 0.02
-------------------------------------------------------------

End of Report
Report written to awrrpt_1_4_5.txt
SQL> SQL>
SQL>
SQL> select event from v$session_wait where wait_class#<>6;

EVENT
----------------------------------------------------------------
SQL*Net message to client
statement suspended, wait error to be cleared
db file scattered read
statement suspended, wait error to be cleared
statement suspended, wait error to be cleared
statement suspended, wait error to be cleared
statement suspended, wait error to be cleared
statement suspended, wait error to be cleared
db file sequential read
statement suspended, wait error to be cleared
statement suspended, wait error to be cleared

Completed: alter tablespace lisdata add datafile '+DATA01' size 4116m
Mon Dec 30 21:40:17 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed
Mon Dec 30 21:40:17 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed
Mon Dec 30 21:40:18 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed
Mon Dec 30 21:40:18 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed
Mon Dec 30 21:40:18 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed
Mon Dec 30 21:40:18 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed
Mon Dec 30 21:40:18 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed
Mon Dec 30 21:40:18 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed
Mon Dec 30 21:40:50 2013
alter tablespace lisdata add datafile '+DATA01' size 4116m
Mon Dec 30 21:41:02 2013
Completed: alter tablespace lisdata add datafile '+DATA01' size 4116m
Mon Dec 30 21:41:44 2013
alter tablespace lisdata add datafile '+DATA01' size 4116m
Mon Dec 30 21:41:59 2013
Completed: alter tablespace lisdata add datafile '+DATA01' size 4116m
Mon Dec 30 21:44:25 2013
alter tablespace lisdata add datafile '+DATA01' size 4116m
Mon Dec 30 21:44:31 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was suspended due to
ORA-01652: unable to extend temp segment by 1024 in tablespace LISDATA
Mon Dec 30 21:44:31 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was suspended due to
ORA-01652: unable to extend temp segment by 1024 in tablespace LISDATA
Mon Dec 30 21:44:31 2013
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was suspended due to
ORA-01652: unable to extend temp segment by 1024 in tablespace LISDATA
Completed: alter tablespace lisdata add datafile '+DATA01' size 4116m
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed
statement in resumable session 'SYSTEM.SYS_IMPORT_SCHEMA_02.1' was resumed

更改RAC私有网络(private network change)配置的步骤以及版本差异的注意事项

Network information(interface, subnet and role of each interface) for Oracle Clusterware is managed by ‘oifcfg’, but actual IP address for each interfaces are not, ‘oifcfg’ can not update IP address information. ‘oifcfg getif’ can be used to find out currently configured interfaces in OCR:

% $CRS_HOME/bin/oifcfg getif
eth0 10.2.156.0 global public
eth1 192.168.0.0 global cluster_interconnect

On Unix/Linux systems, the interface names are generally assigned by the OS, and standard names vary by platform. For Windows systems, see additional notes below. Above example shows currently interface eth0 is used for public with subnet 10.2.156.0, and eth1 for cluster_interconnect/private with subnet 192.168.0.0.

The ‘public’ network is for database client communication (VIP also uses the same network though it’s stored in OCR as separate entry), whereas the ‘cluster_interconnect’ network is for RDBMS/ASM cache fusion. Starting with 11gR2, cluster_interconnect is also used for clusterware heartbeats – this is significant change compare to prior release as pre-11gR2 uses the private nodename that were specified at installation time for clusterware heartbeats.

If the subnet or interface name for ‘cluster_interconnect’ interface is incorrect, it needs to be changed as crs/grid user.

Case I. Changing private hostname

In pre-11.2 Oracle Clusterware, private hostname is recorded in OCR, it can not be updated. Generally private hostname is not required to change. Its associated IP can be changed. The only way to change private hostname is by deleting/adding nodes, or reinstall Oracle Clusterware.

In 11.2 Grid Infrastructure, private hostname is no longer recorded in OCR and there is no dependancy on the private hostname. It can be changed freely in /etc/hosts.

Case II. Changing private IP only without changing network interface, subnet and netmask

For example, private IP is changed from 192.168.1.10 to 192.168.1.21, network interface name and subnet remain the same,.

Simply shutdown Oracle Clusterware stack on the node where change required, make IP modification at OS layer (eg: /etc/hosts, OS network config etc) for private network, restart Oracle Clusterware stack will complete the task.

Case III. Changing private network MTU only

For example, private network MTU is changed from 1500 to 9000 (enable jumbo frame), network interface name and subnet remain the same.

1. Shutdown Oracle Clusterware stack on all nodes
2. Make the required network change of MTU size at OS network layer, ensure private network is available with the desired MTU size, ping with the desired MTU size works on all cluster nodes
3. Restart Oracle Clusterware stack on all nodes

Case IV. Changing private network interface name, subnet or netmask

Note: When the netmask is changed but the subnet ID doesn’t change, for example:
The netmask is changed from 255.255.0.0 to 255.255.255.0 with private IP like 192.168.0.x, the subnet ID remains the same as 192.168.0.0, the network interface name is not changed.
Please follow the same procedure as outlined in Case II.
When the netmask is changed, the associated subnet ID is often changed. Oracle only store network interface name and subnet ID in OCR, not the netmask. Oifcfg command can be used for such change, oifcfg commands only require to run on 1 of the cluster node, not all.

A. For pre-11gR2 Oracle Clusterware

1. Use oifcfg to add the new private network information, delete the old private network information:

% $ORA_CRS_HOME/bin/oifcfg/oifcfg setif -global <if_name>/:cluster_interconnect
% $ORA_CRS_HOME/bin/oifcfg/oifcfg delif -global <if_name>[/]]

For example:
% $ORA_CRS_HOME/bin/oifcfg setif -global eth3/192.168.2.0:cluster_interconnect
% $ORA_CRS_HOME/bin/oifcfg delif -global eth1/192.168.1.0

To verify the change
% $ORA_CRS_HOME/bin/oifcfg getif
eth0 10.2.166.0 global public
eth3 192.168.2.0 global cluster_interconnect

2. Shutdown Oracle Clusterware stack

As root user: # crsctl stop crs

3. Make required network change at OS level, /etc/hosts file should be modified on all nodes to reflect the change.
Ensure the new network is available on all cluster nodes:

% ping % ifconfig -a on Unix/Linux
or
% ipconfig /all on windows

4. restart the Oracle Clusterware stack

As root user: # crsctl start crs

Note: If running OCFS2 on Linux, one may also need to change the private IP address that OCFS2 is using to communicate with other nodes. For more information, please refer to Note 604958.1

B. For 11gR2 and higher

As of 11.2 Grid Infrastructure, the private network configuration is not only stored in OCR but also in the gpnp profile. If the private network is not available or its definition is incorrect, the CRSD process will not start and any subsequent changes to the OCR will be impossible. Therefore care needs to be taken when making modifications to the configuration of the private network. It is important to perform the changes in the correct order. Please also note that manual modification of gpnp profile is not supported.

Please take a backup of profile.xml on all cluster nodes before proceeding, as grid user:

$ cd $GRID_HOME/gpnp//profiles/peer/
$ cp -p profile.xml profile.xml.bk

1. Ensure Oracle Clusterware is running on ALL cluster nodes in the cluster

2. As grid user:

Get the existing information. For example:
$ oifcfg getif
eth1 100.17.10.0 global public
eth0 192.168.0.0 global cluster_interconnect

Add the new cluster_interconnect information:

$ oifcfg setif -global /:cluster_interconnect

For example:
a. add a new interface bond0 with the same subnet
$ oifcfg setif -global bond0/192.168.0.0:cluster_interconnect

b. add a new subnet with the same interface name but different subnet or new interface name
$ oifcfg setif -global eth0/192.65.0.0:cluster_interconnect
or
$ oifcfg setif -global eth3/192.168.1.96:cluster_interconnect

1. This can be done with -global option even if the interface is not available yet, but this can not be done with -node option if the interface is not available, it will lead to node eviction.

2. If the interface is available on the server, subnet address can be identified by command:

$ oifcfg iflist

It lists the network interface and its subnet address. This command can be run even if Oracle Clusterware is not running. Please note, subnet address might not be in the format of x.y.z.0, it can be x.y.z.24, x.y.z.64 or x.y.z.128 etc. For example,
$ oifcfg iflist
lan1 18.1.2.0
lan2 10.2.3.64 < < this is the private network subnet address associated with private network IP: 10.2.3.86

3. If it is for adding a 2nd private network, not replacing the existing private network, please ensure MTU size of both interfaces are the same, otherwise instance startup will report error:

ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:if MTU failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpcini2
ORA-27303: additional information: requested interface lan1:801 has a different MTU (1500) than lan3:801 (9000), which is not supported. Check output from ifconfig command

Verify the change:

$ oifcfg getif

3. Shutdown Oracle Clusterware on all nodes and disable the Oracle Clusterware as root user:

# crsctl stop crs
# crsctl disable crs

4. Make the network configuration change at OS level as required, ensure the new interface is available on all nodes after the change.

$ ifconfig -a
$ ping

5. Enable Oracle Clusterware and restart Oracle Clusterware on all nodes as root user:

# crsctl enable crs
# crsctl start crs

6. Remove the old interface if required:

$ oifcfg delif -global <if_name>[/]
eg:
$ oifcfg delif -global eth0/192.168.0.0

Something to note for 11gR2

1. If underlying network configuration has been changed, but oifcfg has not been run to make the same change, then upon Oracle Clusterware restart, the CRSD will not be able to start.

The crsd.log will show:

2010-01-30 09:22:47.234: [ default][2926461424] CRS Daemon Starting
..
2010-01-30 09:22:47.273: [ GPnP][2926461424]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=7153, tl=3, f=0
2010-01-30 09:22:47.282: [ OCRAPI][2926461424]clsu_get_private_ip_addresses: no ip addresses found.
2010-01-30 09:22:47.282: [GIPCXCPT][2926461424] gipcShutdownF: skipping shutdown, count 2, from [ clsinet.c : 1732], ret gipcretSuccess (0)
2010-01-30 09:22:47.283: [GIPCXCPT][2926461424] gipcShutdownF: skipping shutdown, count 1, from [ clsgpnp0.c : 1021], ret gipcretSuccess (0)
[ OCRAPI][2926461424]a_init_clsss: failed to call clsu_get_private_ip_addr (7)
2010-01-30 09:22:47.285: [ OCRAPI][2926461424]a_init:13!: Clusterware init unsuccessful : [44]
2010-01-30 09:22:47.285: [ CRSOCR][2926461424] OCR context init failure. Error: PROC-44: Error in network address and interface operations Network address and interface operations error [7]
2010-01-30 09:22:47.285: [ CRSD][2926461424][PANIC] CRSD exiting: Could not init OCR, code: 44
2010-01-30 09:22:47.285: [ CRSD][2926461424] Done.
Above errors indicate a mismatch between OS setting (oifcfg iflist) and gpnp profile setting profile.xml.

Workaround: restore the OS network configuration back to the original status, start Oracle Clusterware. Then follow above steps to make the changes again.

If the underlying network has not been changed, but oifcfg setif has been run with a wrong subnet address or interface name, same issue will happen.

2. If any one node is down in the cluster, oifcfg command will fail with error:

$ oifcfg setif -global bond0/192.168.0.0:cluster_interconnect
PRIF-26: Error in update the profiles in the cluster
Workaround: start Oracle Clusterware on the node where it is not running. Ensure Oracle Clusterware is up on all cluster nodes. If the node is down for any OS reason, please remove the node from the cluster before performing private network change.

3. If a user other than Grid Infrastructure owner issues above command, it will fail with same error:

$ oifcfg setif -global bond0/192.168.0.0:cluster_interconnect
PRIF-26: Error in update the profiles in the cluster
Workaround: ensure to login as Grid Infrastructure owner to perform such command.

4. From 11.2.0.2 onwards, if attempt to delete the last private interface (cluster_interconnect) without adding a new one first, following error will occur:

PRIF-31: Failed to delete the specified network interface because it is the last private interface
Workaround: Add new private interface first before deleting the old private interface.

5. If Oracle Clusterware is down on the node, the following error is expected:

$ oifcfg getif
PRIF-10: failed to initialize the cluster registry
Workaround: Start the Oracle Clusterware on the node

Notes for Windows Systems

The syntax for changing the interfaces on Windows/RAC clusters is the same as on Unix/Linux, but the interface names will be slightly different. On Windows systems, the default names assigned to the interfaces are generally named such as:

Local Area Connection
Local Area Connection 1
Local Area Connection 2

If using an interface name that has space in it, the name must be enclosed in quotes. Also, be aware that it is case sensitive. For example, on Windows, to set cluster_interconnect:

C:\oracle\product\10.2.0\crs\bin\oifcfg setif -global “Local Area Connection 1″/192.168.1.0:cluster_interconnect
However, it is best practice on Windows to rename the interfaces to be more meaningful, such as renaming them to ‘ocwpublic’ and ‘ocwprivate’. If interface names are renamed after Oracle Clusterware is installed, then you will need to run ‘oifcfg’ to add the new interface and delete the old one, as described above.

You can view the available interface names on each node by running the command:

oifcfg iflist -p -n
This command must be run on each node to verify the interface names are defined the same.

Ramifications of Changing Interface Names Using oifcfg

For the Private interface, the database will use the interface stored in the OCR and defined as a 'cluster_interconnect' for cache fusion traffic. The cluster_interconnect information is available at startup in the alert log, after the parameter listing - for example:

For pre 11.2.0.2:
Cluster communication is configured to use the following interface(s) for this instance
192.168.1.1

For 11.2.0.2+: (HAIP address will show in alert log instead of private IP)
Cluster communication is configured to use the following interface(s) for this instance
169.254.86.97
If this is incorrect, then instance is required to restart once the OCR entry is corrected. This applies to ASM instances and Database instances alike. On Windows systems, after shutting down the instance, it is also required to stop/restart the OracleService (or OracleASMService before the OCR will be re-read.

Oifcfg Usage

To see the full options of oifcfg, simply type:

$ $ORA_CRS_HOME/bin/oifcfg

 

Troubleshooting 'latch: cache buffers chains' Wait Contention

最近在好几个项目上遭遇LCBC无外乎都是CPU异常导致,先把这方面官方诊断的文章共享出来,后面描述一些极端场景的案例。

If you have high contention, you need to look at the statements that perform the most buffer gets and then look at their access paths to determine whether these are performing as efficiently as you would like.

Typical solutions are:-

  • Look for SQL that accesses the blocks in question and determine if the repeated reads are necessary. This may be within a single session or across multiple sessions.
  • Check for suboptimal SQL (this is the most common cause of the events) – look at the execution plan for the SQL being run and try to reduce the gets per executions which will minimize the number of blocks being accessed and therefore reduce the chances of multiple sessions contending for the same block.

Further information can be found in:

Note:390374.1 Oracle Performance Diagnostic Guide (OPDG) (Doc ID 390374.1)
Note:163424.1 How To Identify a Hot Block Within The Database Buffer Cache.
Note:62172.1 Understanding and Tuning Buffer Cache and DBWR (Doc ID 62172.1)

 

Worked example:

Problem: Database is slow and ‘latch: cache buffers chains’ is high in the waits in AWR.

Start with Top 5 Waits:

Top 5 Timed Events                                      Avg    %Total
~~~~~~~~~~~~~~~~~~                                      wait   Call
Event                          Waits        Time (s)    (ms)   Time   Wait Class
—————————— ———— ———– —— —— ———-
latch: cache buffers chains          74,642      35,421    475    6.1 Concurrenc
CPU time                                         11,422           2.0
log file sync                        34,890       1,748     50    0.3 Commit
latch free                            2,279         774    340    0.1 Other
db file parallel write               18,818         768     41    0.1 System I/O
————————————————————-

High cache buffers chains latch indicates that there is likely to be something reading a lot of buffers. Typically the SQL with the most gets is likely to be that which is contending:

SQL ordered by Gets         DB/Inst:  Snaps: 1-2
-> Resources reported for PL/SQL code includes the resources used by all SQL
statements called by the code.
-> Total Buffer Gets:   265,126,882
-> Captured SQL account for   99.8% of Total
                            Gets                CPU      Elapsed
Buffer Gets    Executions   per Exec     %Total Time (s) Time (s)  SQL Id
————– ———— ———— —— ——– ——— ————-
   256,763,367       19,052     13,477.0   96.8 ######## ######### a9nchgksux6x2
Module: JDBC Thin Client
SELECT * FROM SALES ….
     1,974,516      987,056          2.0    0.7    80.31    110.94 ct6xwvwg3w0bv
SELECT COUNT(*) FROM ORDERS ….

The Query with SQL_ID a9nchgksux6x2 is reading 100x more buffers than the 2nd most ‘hungry’ statement and CPU and Elapsed are off the ‘scale’ of the report.  This is a prime candidate for the cause of the CBC latch issues.

You can also link this information to the Top  Segments by Logical Reads:

Segments by Logical Reads
-> Total Logical Reads:     265,126,882
-> Captured Segments account for   98.5% of Total
           Tablespace                      Subobject  Obj.       Logical
Owner         Name    Object Name            Name     Type         Reads  %Total
———- ———- ——————– ———- —– ———— ——-
DMSUSER    USERS      SALES                           TABLE  212,206,208   80.04
DMSUSER    USERS      SALES_PK                        INDEX   44,369,264   16.74
DMSUSER    USERS      SYS_C0012345                    INDEX    1,982,592     .75
DMSUSER    USERS      ORDERS_PK                       INDEX      842,304     .32
DMSUSER    USERS      INVOICES                        TABLE      147,488     .06
          ————————————————————-

The top object read is SALES and the top SQL is a select from SALES which appears to correlate towards this being a potential problem select.

This SQL should be investigated to see if the Gets per Exec or the Executions figure per hour has changed in any way (comparison to previous reports would show this) and if so the reasons for that change investigated and resolved.

In this case the statement is reading > 10,000 buffers per execution and executing > 15,000 times
so both of these may need to be adjusted to get better performance.

Note: This is a simple example where there is a high likelihood that the ‘biggest’ query is the culprit but it is not always the ‘Top’ SQL that causes the problem. For example, contention may occur on a statement with a smaller total if it is only executed a small number of times so that  it may not appear as the top sql. It may still make millions of buffer gets, but will appear lower in the list because other sqls are performing many times, just not contending.

So, if the first SQL is not the culprit then look at the others.