Skip to content

Grid and Cluster

使用kfed修复损坏的asm disk header以及恢复原理测试.(disk header backup in au no.1)

In ASM versions 11.1.0.7 and later, the ASM disk header block is backed up in the second last ASM metadata block in the allocation unit 1.
Kfed parameters

aun – Allocation Unit (AU) number to read from. Default is AU0, or the very beginning of the ASM disk.
aus – AU size. Default is 1048576 (1MB). Specify the aus when reading from a disk group with non-default AU size.
blkn – block number to read. Default is block 0, or the very first block of the AU.
dev – ASM disk or device name. Note that the keyword dev can be omitted, but the ASM disk name is mandatory.
Understanding ASM disk layout

Read ASM disk header block from AU[0]

[root@grac41 Desktop]# kfed read  /dev/asm_test_1G_disk1 | egrep 'name|size|type'
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD   <-- ASM disk header
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9          <-- ASM disk name
kfdhdb.grpname:                    TEST ; 0x048: length=4          <-- ASM DG name
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9          <-- ASM Failgroup
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200            <-- Disk sector size
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000            <-- ASM block size
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000        <-- AU size : 1 Mbyte
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff        <-- ASM disk size : 1 GByte

Check ASM block types for the first 2 AUs
AU[0] :

[root@grac41 Desktop]# kfed find /dev/asm_test_1G_disk1
Block 0 has type 1
Block 1 has type 2
Block 2 has type 3
Block 3 has type 3
Block 4 has type 3
Block 5 has type 3
Block 6 has type 3
Block 7 has type 3
Block 8 has type 3
Block 9 has type 3
Block 10 has type 3
..
Block 252 has type 3
Block 253 has type 3
Block 254 has type 3
Block 255 has type 3

AU[1] :

[root@grac41 Desktop]#  kfed find /dev/asm_test_1G_disk1 aun=1
Block 256 has type 17
Block 257 has type 17
Block 258 has type 13
Block 259 has type 18
Block 260 has type 13
..
Block 508 has type 13
Block 509 has type 13
Block 510 has type 1
Block 511 has type 19

Summary :

–> Disk header size is 512 bytes
AU size = 1Mbyte –> AU block size = 4096
This translates to 1048576 / 4096 = 256 blocks to read an AU ( start with block 0 – 255 )
Block 510 and block 0 storing an ASM disk header ( == type 1 )

Run the kfed command below if you interested in a certain ASM block type ( use output from kfed find to the type info )
[root@grac41 Desktop]# kfed read /dev/asm_test_1G_disk1 aun=1 blkn=255 | egrep ‘type’
kfbh.type: 19 ; 0x002: KFBTYP_HBEAT

Some ASM block types

[root@grac41 Desktop]# kfed read  /dev/asm_test_1G_disk1 aun=0 blkn=0  | egrep 'type'
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
kfbh.type:                            5 ; 0x002: KFBTYP_LISTHEAD
kfbh.type:                           13 ; 0x002: KFBTYP_PST_NONE
kfbh.type:                           18 ; 0x002: KFBTYP_PST_DTA
kfbh.type:                           19 ; 0x002: KFBTYP_HBEAT

Repair ASM disk header block in AU[0] with kfed repair

In ASM versions 11.1.0.7 and later, the ASM disk header block is backed up in the second last ASM metadata block in the allocation unit 1.
Verify ASM DISK Header block located in AU[0] and AU[1]
AU[0] :

[root@grac41 Desktop]# kfed read  /dev/asm_test_1G_disk1 aun=0 blkn=0 | egrep 'name|size|type'
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9
kfdhdb.grpname:                    TEST ; 0x048: length=4
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff

AU[1] :

[root@grac41 Desktop]# kfed read  /dev/asm_test_1G_disk1 aun=1 blkn=254  | egrep 'name|size|type'
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9
kfdhdb.grpname:                    TEST ; 0x048: length=4
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff

Erase Disk header block in first AU ( aun=0 blkn=0 )

# dd if=/dev/zero of=/dev/asm_test_1G_disk1  bs=4096 count=1

Verify ASM disk header

# kfed read /dev/asm_test_1G_disk1 aun=0 blkn=0
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]
--> Corrupted ASM disk header detected in AU [0]

Repair disk header in AU[0] with kfed

[grid@grac41 ASM]$ kfed repair  /dev/asm_test_1G_disk1
[grid@grac41 ASM]$ kfed read /dev/asm_test_1G_disk1 aun=0 blkn=0
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9
kfdhdb.grpname:                    TEST ; 0x048: length=4
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff
--> kfed repair worked - Disk header restored

Can kfed repair the Disk header block stored in the 2.nd AU ?

Delete Disk header block in AU[1]
First use dd to figure out whether we are getting the correct block

[grid@grac41 ASM]$  dd if=/dev/asm_test_1G_disk1 of=-  bs=4096 count=1 skip=510 ; strings block1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000464628 s, 8.8 MB/s
ORCLDISK
TEST_0000
TEST
TEST_0000
--> looks like an ASM disk header - go ahead and erase that block

[grid@grac41 ASM]$  dd if=/dev/zero of=/dev/asm_test_1G_disk1  bs=4096 count=1  seek=510
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.00644028 s, 636 kB/s

Verify ASM disk header block in AU[1]

[grid@grac41 ASM]$ kfed read /dev/asm_test_1G_disk1 aun=1 blkn=254
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]
--> Corrupted ASM disk header detected

[grid@grac41 ASM]$ kfed repair  /dev/asm_test_1G_disk1
KFED-00320: Invalid block num1 = [0], num2 = [1], error = [endian_kfbh]
--> kfed repair doesn' work

Repair block with dd

grid@grac41 ASM]$ dd if=/dev/asm_test_1G_disk1  bs=4096  count=1 of=/dev/asm_test_1G_disk1  bs=4096 count=1  seek=510
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.0306682 s, 134 kB/s
[grid@grac41 ASM]$ kfed read /dev/asm_test_1G_disk1 aun=0 blkn=0
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9
kfdhdb.grpname:                    TEST ; 0x048: length=4
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff

# kfed read /dev/asm_test_1G_disk1 aun=1 blkn=254
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9
kfdhdb.grpname:                    TEST ; 0x048: length=4
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff

Summary:

to fix the backup block or the ASM disk header in AU 1 block you need to use dd

Reference:

http://laurent-leturgez.com/2012/11/12/how-asm-disk-header-block-repair-works/
http://asmsupportguy.blogspot.fr/2010/04/kfed-asm-metadata-editor.html
http://asmsupportguy.blogspot.co.uk/2011/08/asm-disk-header.html

由繁化简:快速解析asm disk header结构

在asm instance crash,特定的一些场景下无法启动asm,为了恢复数据或者其他需要则需要直接从disk中读取数据,此时asm disk的number则为必需获取的信息,特别是在一些不规范的disk命名环境,因此需要用kfed读取磁盘头依次获取信息。

在disk中

kfdhdb.dsknum 代表磁盘号
kfdhdb.grptyp 代表磁盘类型
kfdhdb.dskname 代表磁盘名字
kfdhdb.grpname 代表磁盘所在asm diskgroup 名字
kfdhdb.blksize 代表磁盘的块单元大小
kfdhdb.dsksize 代表磁盘的全部大小

因此识别asm disk所属的asm diskgroup number以及asm disk number只需要在用kfed read asm disk时候加上dsknum以及grpname来区分即可
例如

 kfed read /dev/oracleasm/disks/VOL01   | grep dsknum

或者写成批量处理的脚本,效率会更高。
通过此种方式找到0号磁盘的情况下,通过识别2号au的的file 1,就可以找到想恢复的文件了。

案例:

[oracle@oradb bin]$ kfed read /dev/oracleasm/disks/VOL01
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj:              2147483648 ; 0x008: TYPE=0x8 NUMB=0x0
kfbh.check:                  3091711072 ; 0x00c: 0xb847c460
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdhdb.driver.provstr:    ORCLDISKVOL01 ; 0x000: length=13
kfdhdb.driver.reserved[0]:    810307414 ; 0x008: 0x304c4f56
kfdhdb.driver.reserved[1]:           49 ; 0x00c: 0x00000031
kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000
kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000
kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000
kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000
kfdhdb.compat:                168820736 ; 0x020: 0x0a100000
kfdhdb.dsknum:                        0 ; 0x024: 0x0000                  //这里是磁盘号。
kfdhdb.grptyp:                        3 ; 0x026: KFDGTP_HIGH             //这里是磁盘类型。
kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER
kfdhdb.dskname:                   VOL01 ; 0x028: length=5                //磁盘名称。
kfdhdb.grpname:                    DATA ; 0x048: length=4                //磁盘所属的磁盘组。
kfdhdb.fgname:                    VOL01 ; 0x068: length=5
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.crestmp.hi:             32942006 ; 0x0a8: HOUR=0x16 DAYS=0x1d MNTH=0x9 YEAR=0x7da
kfdhdb.crestmp.lo:            449689600 ; 0x0ac: USEC=0x0 MSEC=0x36e SECS=0x2c MINS=0x6
kfdhdb.mntstmp.hi:             32942646 ; 0x0b0: HOUR=0x16 DAYS=0x11 MNTH=0xa YEAR=0x7da
kfdhdb.mntstmp.lo:           1573951488 ; 0x0b4: USEC=0x0 MSEC=0x26 SECS=0x1d MINS=0x17
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000                    //这里指每个block的大小。
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80
kfdhdb.dsksize:                   10236 ; 0x0c4: 0x000027fc             //这里是磁盘的大小。
kfdhdb.pmcnt:                         2 ; 0x0c8: 0x00000002
kfdhdb.fstlocn:                       1 ; 0x0cc: 0x00000001
kfdhdb.altlocn:                       2 ; 0x0d0: 0x00000002
kfdhdb.f1b1locn:                      2 ; 0x0d4: 0x00000002
kfdhdb.redomirrors[0]:                0 ; 0x0d8: 0x0000
kfdhdb.redomirrors[1]:                0 ; 0x0da: 0x0000
kfdhdb.redomirrors[2]:                0 ; 0x0dc: 0x0000
kfdhdb.redomirrors[3]:                0 ; 0x0de: 0x0000
kfdhdb.dbcompat:              168820736 ; 0x0e0: 0x0a100000
kfdhdb.grpstmp.hi:             32942006 ; 0x0e4: HOUR=0x16 DAYS=0x1d MNTH=0x9 YEAR=0x7da
kfdhdb.grpstmp.lo:            448217088 ; 0x0e8: USEC=0x0 MSEC=0x1d0 SECS=0x2b MINS=0x6
kfdhdb.ub4spare[0]:                   0 ; 0x0ec: 0x00000000
kfdhdb.ub4spare[1]:                   0 ; 0x0f0: 0x00000000
kfdhdb.ub4spare[2]:                   0 ; 0x0f4: 0x00000000
kfdhdb.ub4spare[3]:                   0 ; 0x0f8: 0x00000000
kfdhdb.ub4spare[4]:                   0 ; 0x0fc: 0x00000000
kfdhdb.ub4spare[5]:                   0 ; 0x100: 0x00000000
kfdhdb.ub4spare[6]:                   0 ; 0x104: 0x00000000
kfdhdb.ub4spare[7]:                   0 ; 0x108: 0x00000000
kfdhdb.ub4spare[8]:                   0 ; 0x10c: 0x00000000
kfdhdb.ub4spare[9]:                   0 ; 0x110: 0x00000000
kfdhdb.ub4spare[10]:                  0 ; 0x114: 0x00000000
kfdhdb.ub4spare[11]:                  0 ; 0x118: 0x00000000
kfdhdb.ub4spare[12]:                  0 ; 0x11c: 0x00000000
kfdhdb.ub4spare[13]:                  0 ; 0x120: 0x00000000
kfdhdb.ub4spare[14]:                  0 ; 0x124: 0x00000000
kfdhdb.ub4spare[15]:                  0 ; 0x128: 0x00000000
kfdhdb.ub4spare[16]:                  0 ; 0x12c: 0x00000000
kfdhdb.ub4spare[17]:                  0 ; 0x130: 0x00000000
kfdhdb.ub4spare[18]:                  0 ; 0x134: 0x00000000
kfdhdb.ub4spare[19]:                  0 ; 0x138: 0x00000000
kfdhdb.ub4spare[20]:                  0 ; 0x13c: 0x00000000
kfdhdb.ub4spare[21]:                  0 ; 0x140: 0x00000000
kfdhdb.ub4spare[22]:                  0 ; 0x144: 0x00000000
kfdhdb.ub4spare[23]:                  0 ; 0x148: 0x00000000
kfdhdb.ub4spare[24]:                  0 ; 0x14c: 0x00000000
kfdhdb.ub4spare[25]:                  0 ; 0x150: 0x00000000
kfdhdb.ub4spare[26]:                  0 ; 0x154: 0x00000000
kfdhdb.ub4spare[27]:                  0 ; 0x158: 0x00000000
kfdhdb.ub4spare[28]:                  0 ; 0x15c: 0x00000000
kfdhdb.ub4spare[29]:                  0 ; 0x160: 0x00000000
kfdhdb.ub4spare[30]:                  0 ; 0x164: 0x00000000
kfdhdb.ub4spare[31]:                  0 ; 0x168: 0x00000000
kfdhdb.ub4spare[32]:                  0 ; 0x16c: 0x00000000
kfdhdb.ub4spare[33]:                  0 ; 0x170: 0x00000000
kfdhdb.ub4spare[34]:                  0 ; 0x174: 0x00000000
kfdhdb.ub4spare[35]:                  0 ; 0x178: 0x00000000
kfdhdb.ub4spare[36]:                  0 ; 0x17c: 0x00000000
kfdhdb.ub4spare[37]:                  0 ; 0x180: 0x00000000
kfdhdb.ub4spare[38]:                  0 ; 0x184: 0x00000000
kfdhdb.ub4spare[39]:                  0 ; 0x188: 0x00000000
kfdhdb.ub4spare[40]:                  0 ; 0x18c: 0x00000000
kfdhdb.ub4spare[41]:                  0 ; 0x190: 0x00000000
kfdhdb.ub4spare[42]:                  0 ; 0x194: 0x00000000
kfdhdb.ub4spare[43]:                  0 ; 0x198: 0x00000000
kfdhdb.ub4spare[44]:                  0 ; 0x19c: 0x00000000
......

通过x$KFDAT确认asm file的au信息

X$KFDAT (metadata, disk-to-AU mapping table)

该视图的结构图,以及字段含义
x$kfdat

example:

查找spfile的au信息:

sys@+ASM1> select GROUP_KFDAT,NUMBER_KFDAT,AUNUM_KFDAT from x$kfdat where
   fnum_kfdat=(select file_number from v$asm_alias where name='spfiletest1.ora');
GROUP_KFDAT NUMBER_KFDAT AUNUM_KFDAT
----------- ------------ -----------
          1            3         101
          1           20         379


通过以上可以知道,spfile存在在1号diskgroup的3号磁盘和20号磁盘的2个au,文件所在位置在3号磁盘的相对硬盘第一个au的位置为101号,在20号磁盘相对的位置为379号.x$kfdat的AUNUM_KFDAT字段与x$kffxp视图中的au_kffxp为同一信息.我的diskgroup的为normal冗余模式,所以3.101au和20.379au为mirror关系,可以通过x$kffxp.xnum_kffxp验证,mirror的2个au相关的extent number 一致.x$kfdat与x$kffxp有些字段是关联的.

在12c的asm中,oracle在asmcmd中新增2个命令来确认asm file相关au信息,分别为mapextent以及map au:

example:

ASMCMD>  mapextent '+ORCL_MYTEST/ORCL/DATAFILE/mytest.256.8332901607' 1
Disk_Num         AU      Extent_Size
1                211     1
0                211     1

ASMCMD> mapau
usage: mapau [--suppressheader] <dg number> <disk number> <au>
help:  help mapau
ASMCMD> mapau 1 1 107
File_Num         Extent          Extent_Set
261              1273            636

ORA-63999 data file suffered media failure 导致实例Crash

KCF: read, write or open error, block=0xb79ab online=1
        file=85 '/dev/vgpmesdb12/rLV_FEM_PRD_I02'
        error=27063 txt: 'HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: -1
Additional information: 8192'
Encountered write error

*** 2015-06-17 20:06:22.714
DDE rules only execution for: ORA 1110
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
Successfully dispatched
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (SUCCESS, 0 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 0 csec) -----
error 63999 detected in background process
ORA-63999: data file suffered media failure
ORA-01114: IO error writing block to file 85 (block # 752043)
ORA-01110: data file 85: '/dev/vgpmesdb12/rLV_FEM_PRD_I02'
ORA-27063: number of bytes read/written is incorrect
HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: -1
Additional information: 8192
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+544<-kjzdssdmp()+400<-kjzduptcctx()+432<-kjzdicrshnfy()+128<-$cold_ksuitm()+5872<-$cold_ksbrdp()+2704<-opirip()+1296<-opidrv()+1152<-sou2o()+256<-opimai_real()+352<-ssthrdmain()+576<-main()+336<-main_opd_entry()+80
----- End of Abridged Call Stack Trace -----

*** 2015-06-17 20:06:23.172
DBW1 (ospid: 5833): terminating the instance due to error 63999
ksuitm: waiting up to [5] seconds before killing DIAG(5807)

数据库文件所在设备的部分io错误,导致实例宕机。由于报错ORA-27063: number of bytes read/written is incorrect,跟踪下来是初步怀疑坏块导致,通过效验后未发现坏块,HPUX-ia64 Error: 11: Resource temporarily unavailable的错误引入考虑,初步怀疑是hp的系统内部在做一些操作时候导致/dev/vgpmesdb12/rLV_FEM_PRD_I02设备无法被访问到。
随后在官方找到相似bug 16884689 : DATABASE CRASH DUE TO ORA-27063 HPUX-IA64 ERROR: 11.从整体的诊断看,问题的原因还是因为出现了io问题导致的,而且集群内部是在发生io问题后才发现数据库资源的问题,所以需判断是否hp系统或者io系统各模块的问题.与发现的bug不同场景.

文件io错误时候实例重启受_datafile_write_errors_crash_instance控制影响。

参考:
1.


Description

This fix introduces a notable change in behaviour in that
from 11.2.0.2 onwards an I/O write error to a datafile will
now crash the instance.

Before this fix I/O errors to datafiles not in the system tablespace
offline the respective datafiles when the database is in archivelog mode.
This behavior is not always desirable. Some customers would prefer
that the instance crash due to a datafile write error.

This fix introduces a new hidden parameter to control if the instance
should crash on a write error or not:
 _datafile_write_errors_crash_instance



With this fix:
 If _datafile_write_errors_crash_instance = TRUE (default) then
  any write to a datafile which fails due to an IO error causes
  an instance crash.

 If _datafile_write_errors_crash_instance = FALSE then the behaviour
  reverts to the previous behaviour (before this fix) such that
  a write error to a datafile offlines the file (provided the DB is
  in archivelog mode and the file is not in SYSTEM tablespace in
  which case the instance is aborted)

2.

This is due to a problem with the I/O subsystem.
Issues of this nature are common when there is a problem in the I/O subsystem.
This can include, but is not limited to:

2.1 A bad sector on disk
2.2 An I/O card that is starting to fail
2.3 A bad array cable
2.4 An interruption in network connectivity, in the case of NFS mounts
2.5 Could also be caused by a OS level bug.
etc.
Review the OS Messages file as this will almost certainly reflect errors (for example   Error for Command: write(10) )

遭遇scls_scr_create,bug 4632899

没啥说的,刚熬完通宵,客户数据库挂了,说是机器挂了,重启以后系统恢复了,尝试启动crsctl报错

Failure at scls_scr_create with code 1
 Internal Error Information:
 Category: 1234
 Operation: scls_scr_create
 Location: mkdir
 Other: Unable to make user dir
 Dep: 2

support描述与主机名大小写有关,solaris上的一个bug 4632899.而该主机主机名设置混乱,hostname和/etc/hosts下不一致,sysconfig/network的主机名也未设置!

该bug描述:

 

Bug 4632899  CSS does not start if hostname has capital letters  This note gives a brief overview of bug 4632899.
The content was last updated on: 14-OCT-2011
Click here for details of each of the sections below.

Affects:

Product (Component) Oracle Server (PCW)
Range of versions believed to be affected Versions BELOW 11.1
Versions confirmed as being affected
Platforms affected Generic (all / most platforms affected)

Fixed:

This issue is fixed in

Symptoms:

Related To:

  • (None Specified)

Description

CSS does not start on Solaris if the hostname is in upper case.

eg:
 The following errors might be noticed when using CRS scripts:
  Failure at scls_scr_create with code 1
  Category: 1234
  Operation: scls_scr_create
  Location: mkdir
  Other: Unable to make user dir
  Dep: 2

Workaround
  Change the hostname to from uppercase to lower case.

HOOKS PSE:A203 LIKELYAFFECTS XAFFECTS_10.2.0.1 XAFFECTS_V10020001 AFFECTS=10.2.0.1 XAFFECTS_10.2.0.2 XAFFECTS_V10020002 AFFECTS=10.2.0.2 XAFFECTS_10.2.0.3 XAFFECTS_V10020003 AFFECTS=10.2.0.3 XAFFECTS_10.2.0.4 XAFFECTS_V10020004 AFFECTS=10.2.0.4 XPRODID_5 PRODUCT_ID=5 PRODID-5 PCW XCOMP_PCW COMPONENT=PCW TAG_OPSM OPSM FIXED_10.2.0.4.CRS02 FIXED_10.2.0.5 FIXED_11.1.0.6
Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support.

References

Bug:4632899 (This link will only work for PUBLISHED bugs)
Note:245840.1 Information on the sections in this article

11g r2 模拟OCR和voting disk不可用,完整恢复过程,以及一些注意事项

环境:RHEL5.8 RAC 11.2.0.3.0

1:查看ORC和voting disk信息:

In 11g Release 2 your voting disk data is automatically backed up in the OCR whenever there is a configuration change.
所以恢复时恢复备份OCR即可,这里和10g是不同的,不需要备份voting disk,备份OCR即可

2:使用cluvfy 工具检查OCR完整性
[grid@rac1 ~]$ cluvfy comp ocr -n all
Verifying OCR integrity
Checking OCR integrity…

Checking the absence of a non-clustered configuration…
All nodes free of non-clustered, local-only configurations
ASM Running check passed. ASM is running on all specified nodes
Checking OCR config file “/etc/oracle/ocr.loc”…
OCR config file “/etc/oracle/ocr.loc” check successful
Disk group for ocr location “+CRSDATA” available on all the nodes

NOTE:
This check does not verify the integrity of the OCR contents. Execute ‘ocrcheck’ as a privileged user to verify the contents of OCR.
OCR integrity check passed
Verification of OCR integrity was successful.

3:使用ocrcheck检测OCR内容的完整性
[grid@rac1 ~]$ ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 3016
Available space (kbytes) : 259104
ID : 1236405787
Device/File Name : +CRSDATA
Device/File integrity check succeeded
Device/File not configured
Device/File not configured
Device/File not configured
Device/File not configured

Cluster registry integrity check succeeded

Logical corruption check bypassed due to non-privileged user –如果使用root用户执行ocrcheck时,会显示Logical corruption check succeeded

4:检测voting disk的信息
[grid@rac1 ~]$ crsctl query css votedisk
## STATE File Universal Id File Name Disk group
— —– —————– ——— ———
1. ONLINE 2b1bd0c122584f5abf72033b2b2d26bd (/dev/asm-b_crs) [CRSDATA]
2. ONLINE 2bc03776cdd94f5cbfb9165c473fdb0e (/dev/asm-c_crs) [CRSDATA]
3. ONLINE 3b43c39513a64f2dbf7083a9510ada89 (/dev/asm-d_crs) [CRSDATA]
Located 3 voting disk(s).

从上面看出,OCR和voting disk都位于+CRSDATA磁盘组 ,注意+CRSDATA磁盘组还有ASM的启动参数文件,ASM启动是根据磁盘头的kfdhdb.spfile指向ASM上的此磁盘的UA NUMBER从而读取spfile文件

5:手动备份一份OCR信息:
[root@rac1 grid]# ocrconfig -export /tmp/ocr_20130717.dmp
[root@rac1 grid]# ll /tmp/ocr_20130717.dmp -h
-rw——- 1 root root 102K Jul 17 14:45 /tmp/ocr_20130717.dmp

6:查看OCR自动备份信息
[grid@rac1 ~]$ ocrconfig -showbackup

rac1 2013/07/16 15:45:24 /u01/app/11.2.0.3/grid/cdata/ad-cluster/backup00.ocr
rac2 2013/07/16 08:13:38 /u01/app/11.2.0.3/grid/cdata/ad-cluster/backup01.ocr
rac2 2013/07/16 04:14:09 /u01/app/11.2.0.3/grid/cdata/ad-cluster/backup02.ocr
rac2 2013/07/16 00:14:38 /u01/app/11.2.0.3/grid/cdata/ad-cluster/day.ocr
rac2 2013/07/07 04:40:11 /u01/app/11.2.0.3/grid/cdata/ad-cluster/week.ocr
PROT-25: Manual backups for the Oracle Cluster Registry are not available

7:保存一份ASM参数文件,如果提前没保存,可以到$CRS_HOME/dbs/init.ora获取一份,后面此启动参数的详细内容

[grid@rac1 dbs]$ sqlplus / as sysasm

SQL> create pfile=’/tmp/asm_pfile_130717.txt’ from spfile;

File created.

8:破坏保存OCR信息的磁盘组+CRSDATA
[root@rac1 dev]# dd if=/dev/zero of=/dev/asm-b_crs bs=1024 count=1000
[root@rac1 dev]# dd if=/dev/zero of=/dev/asm-c_crs bs=1024 count=1000

9:破坏了磁盘b和c后,都检测通过,没报错,在rac1和rac2停止crs
[root@rac1 dev]# crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘rac1’
CRS-2673: Attempting to stop ‘ora.crsd’ on ‘rac1’
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘rac1’
…………………
CRS-4133: Oracle High Availability Services has been stopped.

[root@rac2 dev]# crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘rac1’
CRS-2673: Attempting to stop ‘ora.crsd’ on ‘rac1’
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on ‘rac1’
…………………
CRS-4133: Oracle High Availability Services has been stopped.

[root@rac1 dev]# ps -ef |grep ora_
root 16189 32265 0 16:26 pts/0 00:00:00 grep ora_
[root@rac1 dev]# ps -ef |grep asm_
root 16195 32265 0 16:26 pts/0 00:00:00 grep asm_

10:再启动CRS,报错
[root@rac1 dev]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.

[root@rac1 ~]# tail -50f /u01/app/11.2.0.3/grid/log/rac1/alertrac1.log

[cssd(16559)]CRS-1637:Unable to locate configured voting file with ID 2b1bd0c1-22584f5a-bf72033b-2b2d26bd; details at (:CSSNM00020:) in /u01/app/11.2.0.3/grid/log/rac1/cssd/ocssd.log
2013-07-17 16:28:15.947
[cssd(16559)]CRS-1637:Unable to locate configured voting file with ID 2bc03776-cdd94f5c-bfb9165c-473fdb0e; details at (:CSSNM00020:) in /u01/app/11.2.0.3/grid/log/rac1/cssd/ocssd.log
2013-07-17 16:28:15.947
[cssd(16559)]CRS-1705:Found 1 configured voting files but 2 voting files are required, terminating to ensure data integrity; details at (:CSSNM00021:) in /u01/app/11.2.0.3/grid/log/rac1/cssd/ocssd.log
2013-07-17 16:28:15.948
[cssd(16559)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0.3/grid/log/rac1/cssd/ocssd.log
2013-07-17 16:28:16.073
[cssd(16559)]CRS-1603:CSSD on node rac1 shutdown by user.

ocrcheck检测报错:
[root@rac1 dev]# ocrcheck
PROT-602: Failed to retrieve data from the cluster registry
PROC-26: Error while accessing the physical storage

11:强制关闭CRS:
[root@rac1 dev]# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on ‘rac1’

[root@rac1 dev]# crsctl stop crs -f
CRS-2797: Shutdown is already in progress for ‘rac1’, waiting for it to complete
CRS-2797: Shutdown is already in progress for ‘rac1’, waiting for it to complete
CRS-4133: Oracle High Availability Services has been stopped.

12:以独占模式启动rac1
[root@rac1 dev]# crsctl start crs -excl -nocrs
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start ‘ora.mdnsd’ on ‘rac1’
CRS-2676: Start of ‘ora.mdnsd’ on ‘rac1’ succeeded
CRS-2672: Attempting to start ‘ora.gpnpd’ on ‘rac1’
CRS-2676: Start of ‘ora.gpnpd’ on ‘rac1’ succeeded
CRS-2672: Attempting to start ‘ora.cssdmonitor’ on ‘rac1’
CRS-2672: Attempting to start ‘ora.gipcd’ on ‘rac1’
CRS-2676: Start of ‘ora.cssdmonitor’ on ‘rac1’ succeeded
CRS-2676: Start of ‘ora.gipcd’ on ‘rac1’ succeeded
CRS-2672: Attempting to start ‘ora.cssd’ on ‘rac1’
CRS-2672: Attempting to start ‘ora.diskmon’ on ‘rac1’
CRS-2676: Start of ‘ora.diskmon’ on ‘rac1’ succeeded
CRS-2676: Start of ‘ora.cssd’ on ‘rac1’ succeeded
CRS-2672: Attempting to start ‘ora.drivers.acfs’ on ‘rac1’
CRS-2679: Attempting to clean ‘ora.cluster_interconnect.haip’ on ‘rac1’
CRS-2672: Attempting to start ‘ora.ctssd’ on ‘rac1’
CRS-2681: Clean of ‘ora.cluster_interconnect.haip’ on ‘rac1’ succeeded
CRS-2672: Attempting to start ‘ora.cluster_interconnect.haip’ on ‘rac1’
CRS-2676: Start of ‘ora.ctssd’ on ‘rac1’ succeeded
CRS-2676: Start of ‘ora.drivers.acfs’ on ‘rac1’ succeeded
CRS-2676: Start of ‘ora.cluster_interconnect.haip’ on ‘rac1’ succeeded
CRS-2672: Attempting to start ‘ora.asm’ on ‘rac1’
CRS-2676: Start of ‘ora.asm’ on ‘rac1’ succeeded

12:创建CRSVOTEDISK磁盘组以及spfile

[grid@rac1 ~]$ asmcmd

ASMCMD> ls
空的

[grid@rac1 ~]$ sqlplus / as sysasm

SQL*Plus: Release 11.2.0.3.0 Production on Wed Jul 17 16:58:18 2013

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> show parameter spfile
NAME TYPEVALUE
———————————— ———– ——————————
spfile string

SQL> create diskgroup CRSVOTEDISK normal redundancy disk ‘/dev/asm-b_crs’,’/dev/asm-c_crs’, ‘/dev/asm-d_crs’
2 attribute ‘compatible.asm’=’11.2.0.0.0’, ‘compatible.rdbms’=’11.2.0.0.0’;
create diskgroup CRSVOTEDISK normal redundancy disk ‘/dev/asm-b_crs’,’/dev/asm-c_crs’, ‘/dev/asm-d_crs’
*
ERROR at line 1:
ORA-15018: diskgroup cannot be created
ORA-15033: disk ‘/dev/asm-d_crs’ belongs to diskgroup “CRSDATA” –这里报错是因为asm-d_crs没清除磁盘头信息

清除asm-d_crs磁盘头信息
[root@rac1 dev]# dd if=/dev/zero of=/dev/asm-d_crs bs=1024 count=1000

SQL> create diskgroup CRSVOTEDISK normal redundancy disk ‘/dev/asm-b_crs’,’/dev/asm-c_crs’,’/dev/asm-d_crs’
2 attribute ‘compatible.asm’=’11.2.0.0.0’, ‘compatible.rdbms’=’11.2.0.0.0′;

Diskgroup created.

SQL> create spfile=’+CRSVOTEDISK’ from pfile=’/tmp/asm_pfile_130717.txt’;

File created.

SQL> quit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 – 64bit Production
With the Real Application Clusters and Automatic Storage Management options
[grid@rac1 ~]$ asmcmd
ASMCMD> ls
CRSVOTEDISK/
ASMCMD> ls CRSVOTEDISK
ad-cluster/
ASMCMD> ls CRSVOTEDISK/ad-cluster/
ASMPARAMETERFILE/
ASMCMD> ls CRSVOTEDISK/ad-cluster/ASMPARAMETERFILE
REGISTRY.253.821034567

13:Restore OCR from backup:
将原磁盘组+CRSDATA改为新建立的磁盘组 +CRSVOTEDISK
[root@rac1 dev]# vim /etc/oracle/ocr.loc

ocrconfig_loc=+CRSVOTEDISK
local_only=FALSE

[root@rac1 dev]# ocrconfig -restore /u01/app/11.2.0.3/grid/cdata/ad-cluster/backup00.ocr

可以看到增加了一个OCRFILE文件夹
ASMCMD> ls CRSVOTEDISK/ad-cluster
ASMPARAMETERFILE/
OCRFILE/
ASMCMD> ls CRSVOTEDISK/ad-cluster/OCRFILE -l
Type Redund Striped Time Sys Name
OCRFILE MIRROR COARSE JUL 17 17:00:00 Y REGISTRY.255.821036449
ASMCMD> ls CRSVOTEDISK/ad-cluster/ASMPARAMETERFILE -l
Type Redund Striped Time Sys Name
ASMPARAMETERFILE MIRROR COARSE JUL 17 17:00:00 Y REGISTRY.253.821034567

检测成功
[root@rac1 dev]# ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 3016
Available space (kbytes) : 259104
ID : 1236405787
Device/File Name : +CRSVOTEDISK
Device/File integrity check succeeded

Device/File not configured

Device/File not configured

Device/File not configured

Device/File not configured

Cluster registry integrity check succeeded

14:Restore the Voting Disk:

[root@rac1 dev]# crsctl query css votedisk
## STATE File Universal Id File Name Disk group
— —– —————– ——— ———
1. OFFLINE 2b1bd0c122584f5abf72033b2b2d26bd () []
2. OFFLINE 2bc03776cdd94f5cbfb9165c473fdb0e () []
3. ONLINE 3b43c39513a64f2dbf7083a9510ada89 (/dev/asm-d_crs) [CRSDATA]
Located 3 voting disk(s).

[root@rac1 dev]# crsctl replace votedisk +CRSVOTEDISK
CRS-4602: Failed 27 to add voting file 5818c2c531394f45bff13c5a7532c8d4.
CRS-4602: Failed 27 to add voting file 1ce0436528624faabf7d4a1dd8dc978a.
CRS-4602: Failed 27 to add voting file 09def2b244af4f42bf13679a8aa0ff73.
Failure 27 with Cluster Synchronization Services while deleting voting disk.
Failure 27 with Cluster Synchronization Services while deleting voting disk.
Failure 27 with Cluster Synchronization Services while deleting voting disk.
Failed to replace voting disk group with +CRSVOTEDISK.
CRS-4000: Command Replace failed, or completed with errors.

这里报错是一开始asm-d_crs没清除磁盘头信息导致的

======================================== 到这里恢复voting disk失败了 ,下面重新开始再次尝试恢复============
下面恢复时要注意:

crsctl start crs -excl -nocrs 启动后,马上关闭ASM,不要立刻创建create diskgroup CRSVOTEDISK磁盘组,再使用参数启动ASM

不然创建磁盘组时可能会收入如下报错:

例如:(下面的操作看看就好了,直到 :下面开始再次恢复操作)

[grid@rac1 ~]$ sqlplus / as sysasm
SQL> create diskgroup CRSVOTEDISK normal redundancy disk ‘/dev/asm-b_crs’,’/dev/asm-c_crs’,’/dev/asm-d_crs’
2 attribute ‘compatible.asm’=’11.2.0.0.0’, ‘compatible.rdbms’=’11.2.0.0.0’;
create diskgroup CRSVOTEDISK normal redundancy disk ‘/dev/asm-b_crs’,’/dev/asm-c_crs’,’/dev/asm-d_crs’
*
ERROR at line 1:
ORA-15018: diskgroup cannot be created
ORA-15031: disk specification ‘/dev/asm-d_crs’ matches no disks
ORA-15014: path ‘/dev/asm-d_crs’ is not in the discovery set
ORA-15031: disk specification ‘/dev/asm-c_crs’ matches no disks
ORA-15014: path ‘/dev/asm-c_crs’ is not in the discovery set
ORA-15031: disk specification ‘/dev/asm-b_crs’ matches no disks
ORA-15014: path ‘/dev/asm-b_crs’ is not in the discovery set –这里找不到设备应该也是和下面的情况是一样的,没指定扫描的路径

SQL> col PATH for a50
SQL> select group_number, disk_number, mount_status, header_status, path from v$asm_disk;
no rows selected
说明没识别出磁盘 ,这里为什么没磁盘现在是搞明白了,因为参数里面根本没设置

SQL> show parameter asm

NAME TYPE VALUE
———————————— ———– ——————————
asm_diskgroups stringDATA —这里使用默认参数文件启动时是空的,正常情况也不会显示保存OCR磁盘组名的
asm_diskstring string/dev/asm* —这里使用默认参数文件启动时是空的,没指定扫描的路径
asm_power_limit integer1
asm_preferred_read_failure_groups string

所以为了保险起见,应该crsctl start crs -excl -nocrs 启动后,马上关闭ASM,不要立刻创建create diskgroup CRSVOTEDISK磁盘组,再使用参数启动ASM
SQL> startup pfile=’/tmp/asm_pfile_130717.txt’;

[grid@rac1 ~]$ cat /tmp/asm_pfile_130717.txt
+ASM1.__oracle_base=’/u01/app/grid’#ORACLE_BASE set from in memory value
+ASM1.asm_diskgroups=’DATA’#Manual Mount
+ASM2.asm_diskgroups=’DATA’#Manual Mount
*.asm_diskstring=’/dev/asm*’
*.asm_power_limit=1
*.diagnostic_dest=’/u01/app/grid’
*.instance_type=’asm’
*.large_pool_size=12M
*.remote_login_passwordfile=’EXCLUSIVE’

这里再查下v$asm_disk就可以查询到磁盘,也可以顺利的创建磁盘组了。。。。。。。。。。,就是因为没立刻关闭ASM,使用修改好的参数文件,创建磁盘组时一直提示找不到磁盘,耽误了半天时间

下面开始再次恢复操作:

关闭crs后再启动
[root@rac1 dev]# crsctl stop crs
[root@rac1 dev]# crsctl start crs -excl -nocrs
[root@rac1 dev]# crsctl query css votedisk
Located 0 voting disk(s).

关闭rac1上的ASM,再使用参数文件启动ASM,创建CRS磁盘组,创建spfile

[grid@rac1 ~]$ sqlplus / as sysasm

SQL>shutdown immediate
ASM diskgroups dismounted
ASM instance shutdown
SQL> startup pfile=’/tmp/asm_pfile_130717.txt’;
ASM instance started

SQL> col path for a50
SQL> set linesize 130
SQL> select group_number, disk_number, mount_status, header_status, path from v$asm_disk;

GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU PATH
————————————– ———— —————- ——————————
0 0 CLOSED MEMBER /dev/asm-e_data
0 3 CLOSED CANDIDATE /dev/asm-b_crs
0 2 CLOSED CANDIDATE /dev/asm-c_crs
0 1 CLOSED CANDIDATE /dev/asm-d_crs

SQL> create diskgroup CRSVOTEDISK normal redundancy disk ‘/dev/asm-b_crs’,’/dev/asm-c_crs’,’/dev/asm-d_crs’
2 attribute ‘compatible.asm’=’11.2.0.0.0’, ‘compatible.rdbms’=’11.2.0.0.0’;

Diskgroup created.

SQL> create spfile=’+CRSVOTEDISK ‘ from pfile=’/tmp/asm_pfile_130717.txt’;

File created.

SQL> quit

恢复crs
[root@rac1 dev]# ocrconfig -restore /u01/app/11.2.0.3/grid/cdata/ad-cluster/backup00.ocr

恢复voting disk

[root@rac1 dev]# crsctl replace votedisk +CRSVOTEDISK

Successful addition of voting disk 1b00b0ec4e504f7fbf1f8d20fbbfaa4b.
Successful addition of voting disk 5a3b646433124fdcbf23c3c290de7fe3.
Successful addition of voting disk 5d27d80b96d74f09bf1756be6dee387f.
Successfully replaced voting disk group with +CRSVOTEDISK .
CRS-4266: Voting file(s) successfully replaced

检测
[root@rac1 ~]# ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 3016
Available space (kbytes) : 259104
ID : 1236405787
Device/File Name : +CRSVOTEDISK
Device/File integrity check succeeded

Device/File not configured

Device/File not configured

Device/File not configured

Device/File not configured

Cluster registry integrity check succeeded

Logical corruption check succeeded

[root@rac1 ~]# crsctl query css votedisk
## STATE File Universal Id File Name Disk group
— —– —————– ——— ———
1. ONLINE 1b00b0ec4e504f7fbf1f8d20fbbfaa4b (/dev/asm-b_crs) [CRSVOTEDISK ]
2. ONLINE 5a3b646433124fdcbf23c3c290de7fe3 (/dev/asm-c_crs) [CRSVOTEDISK ]
3. ONLINE 5d27d80b96d74f09bf1756be6dee387f (/dev/asm-d_crs) [CRSVOTEDISK ]
Located 3 voting disk(s).

停止crs以正常方式启动:
[root@rac1 ~]# crsctl stop crs
[root@rac1 ~]# crsctl start crs

此时,crs和voting disk已经完成恢复,但要注意修改rac2上的/etc/oracle/ocr.loc里面的ocrconfig_loc=+CRSVOTEDISK ,不然启动报错:

[/u01/app/11.2.0.3/grid/bin/oraagent.bin(19510)]CRS-5019:All OCR locations are on ASM disk groups [CRSDATA], and none of these disk groups are mounted. Details are at “(:CLSN00100:)” in “/u01/app/11.2.0.3/grid/log/rac1/agent/ohasd/oraagent_grid/oraagent_grid.log”.
2013-07-18 00:10:33.678
[/u01/app/11.2.0.3/grid/bin/oraagent.bin(19510)]CRS-5019:All OCR locations are on ASM disk groups [CRSDATA], and none of these disk groups are mounted. Details are at “(:CLSN00100:)” in “/u01/app/11.2.0.3/grid/log/rac1/agent/ohasd/oraagent_grid/oraagent_grid.log”.
2013-07-18 00:11:03.614

[root@rac2 ~]# crsctl start crs

[root@rac1 ~]# crs_stat -t
Name Type Target State Host
————————————————————
ora.CRSDATA.dg ora….up.type ONLINE OFFLINE
ora.DATA.dg ora….up.type ONLINE ONLINE rac1
ora….ER.lsnr ora….er.type ONLINE ONLINE rac1
ora….N1.lsnr ora….er.type ONLINE ONLINE rac1
ora.asm ora.asm.type ONLINE ONLINE rac1
ora.chris.db ora….se.type ONLINE ONLINE rac1
ora.cvu ora.cvu.type ONLINE ONLINE rac1
ora.gsd ora.gsd.type OFFLINE OFFLINE
ora….network ora….rk.type ONLINE ONLINE rac1
ora.oc4j ora.oc4j.type ONLINE ONLINE rac1
ora.ons ora.ons.type ONLINE ONLINE rac1
ora….SM1.asm application ONLINE ONLINE rac1
ora….C1.lsnr application ONLINE ONLINE rac1
ora.rac1.gsd application OFFLINE OFFLINE
ora.rac1.ons application ONLINE ONLINE rac1
ora.rac1.vip ora….t1.type ONLINE ONLINE rac1
ora….SM2.asm application ONLINE ONLINE rac2
ora….C2.lsnr application ONLINE ONLINE rac2
ora.rac2.gsd application OFFLINE OFFLINE
ora.rac2.ons application ONLINE ONLINE rac2
ora.rac2.vip ora….t1.type ONLINE ONLINE rac2
ora….ry.acfs ora….fs.type ONLINE ONLINE rac1
ora.scan1.vip ora….ip.type ONLINE ONLINE rac1

快速检测多路径对应的DISK的脚本

该脚本为检测rac中disk多路径的情况,在添加磁盘进asm时候要格外小心,需要反复验证多路径对应磁盘,加错盘的情况时有发生,这脚本可以帮助DBA在添加磁盘时候判断对应lun的多路径的情况。

如果aix需要把sd改成rlv,把多路径的命令更换。

脚本为:

#!/bin/ksh
for disk in `ls /dev/sd*`
do
disk_short=`basename $disk`
wwid=`scsi_id -g -s /block/$disk_short`
if [ "$wwid" != "" ]
then
  echo -e "Disk:" $disk_short "\tWWID:" $wwid
fi
done

结果为:


Disk: sda       WWID: 3600601602b702d006218b7de8130e111
Disk: sdaa      WWID: 3600601602b702d000652b695c648e111
Disk: sdab      WWID: 3600601602b702d000752b695c648e111
Disk: sdac      WWID: 3600601602b702d007f2a73fbc648e111
Disk: sdad      WWID: 3600601602b702d007e2a73fbc648e111
Disk: sdae      WWID: 3600601602b702d006218b7de8130e111
Disk: sdaf      WWID: 3600601602b702d0036d7cf191241e111
Disk: sdag      WWID: 3600601602b702d0037d7cf191241e111
Disk: sdah      WWID: 3600601602b702d0038d7cf191241e111
#!/bin/bash -x
ORACLEASM=/etc/init.d/oracleasm
echo"ASM Diskgroup Mapping luns"
echo"----------------------------------------------------"
for f in `$ORACLEASMlistdisks`
do
dp=`$ORACLEASM querydisk -p  $f |head-2| grep /dev | awk-F: '{print $1}'`
echo"$f: $dp"
done

Oracle官方有使用KFED更为便捷的脚本,该文档介绍如下:

 

That information can be obtained with the following shell script:

#!/bin/bash
for asmlibdisk in `ls /dev/oracleasm/disks/*`
do
echo “ASMLIB disk name: $asmlibdisk”
asmdisk=`kfed read $asmlibdisk | grep dskname | tr -s ‘ ‘| cut -f2 -d’ ‘`
echo “ASM disk name: $asmdisk”
majorminor=`ls -l $asmlibdisk | tr -s ‘ ‘ | cut -f5,6 -d’ ‘`
device=`ls -l /dev | tr -s ‘ ‘ | grep -w “$majorminor” | cut -f10 -d’ ‘`
echo “Device path: /dev/$device”
done

The script can be run as OS user that owns ASM or Grid Infrastructure home (oracle/grid), i.e. it does not need to be run as privileged user. The only requirement it that kfed binary exists and that it is in the PATH.

If an ASMLIB disk was already deleted, it will not show up in /dev/oracleasm/disks. We can check for devices that are (or were) associated with ASM with the following shell script:

#!/bin/bash
for device in `ls /dev/sd*`
do
asmdisk=`kfed read $device | grep ORCL | tr -s ‘ ‘ | cut -f2 -d’ ‘ | cut -c1-4`
if [ “$asmdisk” = “ORCL” ]
then
echo “Disk device $device may be an ASM disk”
fi
done

The second scripts takes a peek at sd devices in /dev, so in addition to the requirement for the kfed binary to be in the PATH, it also needs to be run as privileged user. Of course we can look at /dev/dm*, /dev/mapper, etc or all devices in /dev, although that may not be a good idea.

The kfed binary should be available in RDBMS home (prior to version 11.2) and in the Grid Infrastructure home (in version 11.2 and later). If the binary is not there, it can be built as follows:

cd $ORACLE_HOME/rdbms/lib
make -f ins* ikfed

Where ORACLE_HOME is the RDBMS home (prior to version 11.2) and the Grid Infrastructure home in version 11.2 and later.

The same can be achieved without kfed with a script like this:

#!/bin/bash
for device in `ls /dev/sd*`
do
asmdisk=`od -c $device | head | grep 0000040 | tr -d ‘ ‘ | cut -c8-11`
if [ “$asmdisk” = “ORCL” ]
then
echo “Disk device $device may be an ASM disk”
fi
done
快速列出ASM DISK和OS DISK DEV设备对应的脚本

遭遇11g R2 DRM bug:gcs resource directory to be unfrozen

用户的环境是aix版本的11.2.0.2集群,数据库实例hang,看到gcs resource,第一时间就反应是drm和lmon,结合hang前的awr也发现等待事件集中在gcs resource directory to be unfrozen,这个时候一般集中检查和gcs相关的信息:数据库告警日志,lmon trace,lms trace。整个过程是DRM触发了,但是并没有切换资源,导致实例hang住,根本原因是过大的buffer cache导致,根据lmon的信息和官方bug 12879027吻合,打上11.2.0.2.7的psu(DB和GI),后续继续观察

LMON进程trace可见如下:

*** 2014-08-14 21:13:51.87
 CGS recovery timeout = 85 sec
Begin DRM(231) (swin 1)
* drm quiesce

*** 2014-08-14 21:17:06.782
* Request pseudo reconfig due to drm quiesce hang
2012-07-14 21:17:03.752735 : kjfspseudorcfg: requested with reason 5(DRM Quiesce step stall)

*** 2014-08-14 21:17:04.911
kjxgmrcfg: Reconfiguration started, type 6
CGS/IMR TIMEOUTS:
 CSS recovery timeout = 31 sec (Total CSS waittime = 65)
 IMR Reconfig timeout = 75 sec
 CGS rcfg timeout = 85 sec
kjxgmcs: Setting state to 70 0.
 - AWR Top waits are "gcs resource directory to be unfrozen" & "gc remaster"

官方文档:

Bug 12879027  LMON gets stuck in DRM quiesce causing intermittent pseudo reconfiguration

 This note gives a brief overview of bug 12879027. 
 The content was last updated on: 15-OCT-2013
 Click here for details of each of the sections below.

Affects:

Product (Component) Oracle Server (Rdbms)
Range of versions believed to be affected Versions BELOW 12.1
Versions confirmed as being affected
Platforms affected Generic (all / most platforms affected)

Fixed:

The fix for 12879027 is first included in

Interim patches may be available for earlier versions – click here to check.

Symptoms:

Related To:

Description

This bug is only relevant when using Real Application Clusters (RAC)

LMON process can get stuck in the DRM quiesce step triggering
pseudo reconfiguration eventually.

Rediscovery Notes:
 DRM quiesce step hangs and triggers pseudoreconfiguration especially
 in single window DRM and when the buffer cache is very large.

Workaround
 None

Getting a Fix
 Use one of the "Fixed" versions listed above
 (for Patch Sets / bundles use the latest version available as
  contents are cumulative - the "Fixed" version listed above is
  the first version where the fix is included)
 or
 You can check for existing interim patches here: Patch:12879027

Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support.

References

Bug:12879027 (This link will only work for PUBLISHED bugs)
Note:245840.1 Information on the sections in this article