Skip to content

Oracle恢复 - 3. page

使用kfed修复损坏的asm disk header以及恢复原理测试.(disk header backup in au no.1)

In ASM versions 11.1.0.7 and later, the ASM disk header block is backed up in the second last ASM metadata block in the allocation unit 1.
Kfed parameters

aun – Allocation Unit (AU) number to read from. Default is AU0, or the very beginning of the ASM disk.
aus – AU size. Default is 1048576 (1MB). Specify the aus when reading from a disk group with non-default AU size.
blkn – block number to read. Default is block 0, or the very first block of the AU.
dev – ASM disk or device name. Note that the keyword dev can be omitted, but the ASM disk name is mandatory.
Understanding ASM disk layout

Read ASM disk header block from AU[0]

[root@grac41 Desktop]# kfed read  /dev/asm_test_1G_disk1 | egrep 'name|size|type'
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD   <-- ASM disk header
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9          <-- ASM disk name
kfdhdb.grpname:                    TEST ; 0x048: length=4          <-- ASM DG name
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9          <-- ASM Failgroup
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200            <-- Disk sector size
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000            <-- ASM block size
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000        <-- AU size : 1 Mbyte
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff        <-- ASM disk size : 1 GByte

Check ASM block types for the first 2 AUs
AU[0] :

[root@grac41 Desktop]# kfed find /dev/asm_test_1G_disk1
Block 0 has type 1
Block 1 has type 2
Block 2 has type 3
Block 3 has type 3
Block 4 has type 3
Block 5 has type 3
Block 6 has type 3
Block 7 has type 3
Block 8 has type 3
Block 9 has type 3
Block 10 has type 3
..
Block 252 has type 3
Block 253 has type 3
Block 254 has type 3
Block 255 has type 3

AU[1] :

[root@grac41 Desktop]#  kfed find /dev/asm_test_1G_disk1 aun=1
Block 256 has type 17
Block 257 has type 17
Block 258 has type 13
Block 259 has type 18
Block 260 has type 13
..
Block 508 has type 13
Block 509 has type 13
Block 510 has type 1
Block 511 has type 19

Summary :

–> Disk header size is 512 bytes
AU size = 1Mbyte –> AU block size = 4096
This translates to 1048576 / 4096 = 256 blocks to read an AU ( start with block 0 – 255 )
Block 510 and block 0 storing an ASM disk header ( == type 1 )

Run the kfed command below if you interested in a certain ASM block type ( use output from kfed find to the type info )
[root@grac41 Desktop]# kfed read /dev/asm_test_1G_disk1 aun=1 blkn=255 | egrep ‘type’
kfbh.type: 19 ; 0x002: KFBTYP_HBEAT

Some ASM block types

[root@grac41 Desktop]# kfed read  /dev/asm_test_1G_disk1 aun=0 blkn=0  | egrep 'type'
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfbh.type:                            2 ; 0x002: KFBTYP_FREESPC
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
kfbh.type:                            5 ; 0x002: KFBTYP_LISTHEAD
kfbh.type:                           13 ; 0x002: KFBTYP_PST_NONE
kfbh.type:                           18 ; 0x002: KFBTYP_PST_DTA
kfbh.type:                           19 ; 0x002: KFBTYP_HBEAT

Repair ASM disk header block in AU[0] with kfed repair

In ASM versions 11.1.0.7 and later, the ASM disk header block is backed up in the second last ASM metadata block in the allocation unit 1.
Verify ASM DISK Header block located in AU[0] and AU[1]
AU[0] :

[root@grac41 Desktop]# kfed read  /dev/asm_test_1G_disk1 aun=0 blkn=0 | egrep 'name|size|type'
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9
kfdhdb.grpname:                    TEST ; 0x048: length=4
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff

AU[1] :

[root@grac41 Desktop]# kfed read  /dev/asm_test_1G_disk1 aun=1 blkn=254  | egrep 'name|size|type'
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9
kfdhdb.grpname:                    TEST ; 0x048: length=4
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff

Erase Disk header block in first AU ( aun=0 blkn=0 )

# dd if=/dev/zero of=/dev/asm_test_1G_disk1  bs=4096 count=1

Verify ASM disk header

# kfed read /dev/asm_test_1G_disk1 aun=0 blkn=0
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]
--> Corrupted ASM disk header detected in AU [0]

Repair disk header in AU[0] with kfed

[grid@grac41 ASM]$ kfed repair  /dev/asm_test_1G_disk1
[grid@grac41 ASM]$ kfed read /dev/asm_test_1G_disk1 aun=0 blkn=0
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9
kfdhdb.grpname:                    TEST ; 0x048: length=4
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff
--> kfed repair worked - Disk header restored

Can kfed repair the Disk header block stored in the 2.nd AU ?

Delete Disk header block in AU[1]
First use dd to figure out whether we are getting the correct block

[grid@grac41 ASM]$  dd if=/dev/asm_test_1G_disk1 of=-  bs=4096 count=1 skip=510 ; strings block1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000464628 s, 8.8 MB/s
ORCLDISK
TEST_0000
TEST
TEST_0000
--> looks like an ASM disk header - go ahead and erase that block

[grid@grac41 ASM]$  dd if=/dev/zero of=/dev/asm_test_1G_disk1  bs=4096 count=1  seek=510
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.00644028 s, 636 kB/s

Verify ASM disk header block in AU[1]

[grid@grac41 ASM]$ kfed read /dev/asm_test_1G_disk1 aun=1 blkn=254
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]
--> Corrupted ASM disk header detected

[grid@grac41 ASM]$ kfed repair  /dev/asm_test_1G_disk1
KFED-00320: Invalid block num1 = [0], num2 = [1], error = [endian_kfbh]
--> kfed repair doesn' work

Repair block with dd

grid@grac41 ASM]$ dd if=/dev/asm_test_1G_disk1  bs=4096  count=1 of=/dev/asm_test_1G_disk1  bs=4096 count=1  seek=510
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.0306682 s, 134 kB/s
[grid@grac41 ASM]$ kfed read /dev/asm_test_1G_disk1 aun=0 blkn=0
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9
kfdhdb.grpname:                    TEST ; 0x048: length=4
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff

# kfed read /dev/asm_test_1G_disk1 aun=1 blkn=254
kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD
kfdhdb.dskname:               TEST_0000 ; 0x028: length=9
kfdhdb.grpname:                    TEST ; 0x048: length=4
kfdhdb.fgname:                TEST_0000 ; 0x068: length=9
kfdhdb.capname:                         ; 0x088: length=0
kfdhdb.secsize:                     512 ; 0x0b8: 0x0200
kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000
kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000
kfdhdb.dsksize:                    1023 ; 0x0c4: 0x000003ff

Summary:

to fix the backup block or the ASM disk header in AU 1 block you need to use dd

Reference:

http://laurent-leturgez.com/2012/11/12/how-asm-disk-header-block-repair-works/
http://asmsupportguy.blogspot.fr/2010/04/kfed-asm-metadata-editor.html
http://asmsupportguy.blogspot.co.uk/2011/08/asm-disk-header.html

浅谈Oracle数据库坏块( Database corruption part 1 )

关于坏块这一类的故障,从业这些年遇见得比较多,有的数据还能抢救,有的数据直接就丢了,更甚者数据库因此报废了. 一旦碰到,虽说不一定棘手,但是不免心里求佛一翻.

所谓Oracle数据库坏块顾名思义oracle数据库所在的数据存储介质的内容出现了讹误混乱或者无法访问.在Oracle自己的范围内对坏块有定义,这个可以参考文档[Note 840978.1].

坏块的种类有很多,它有可能因为各种原因出现在Oracle database的几种文件上面,报错也因此各有差异,处理的手段方法也各不相同,很早之前我就承诺过对这部分内容进行梳理更新到ludatou.com上,而现在我和陈辉正在对Mdata针对坏块部分的处理在做研发.所以趁此机会也尽量将文章更新到博客上,大部分上会是原创,有好的文章我也会转发.浅谈坏块这一系列的文章以方法和原理为主,除此之外我会把以往的坏块处理案例以日志方式更新,不作为浅谈部分文章.

浅谈坏块内容主要包含以下几部分:

    坏块分类以及产生的主要原因
    坏块的构造,检测方法以及可能发生的Oracle文件
    坏块的主要错误以及处理方法
    坏块的预防方式

本篇为坏块分类以及产生的主要原因

————————————————

1.坏块的主要分类,来自Oracle官方文档的解释,有助于更直接的了解区分物理坏块和逻辑坏块的区别(Note 840978.1有更加详细的描述):

For purposes of the paper we will categorize corruption under three general areasand give best practices for prevention, detection and repair for each:

    Memory corruption
    Logical corruption(soft corruption)
    Media corruption(Physicalcorruption)

Physicalor structural corruption can be defined as damage to internal data structureswhich do not allow Oracle software to find user data within the database. Logical corruption involves Oracle beingable to find the data, but the data values are incorrect as far as the end useris concerned.
Physica lcorruption due to hardware or software can occur in two general places — inmemory (including various IO buffers and the Oracle buffer cache) or on disk.Operator error such as overwriting a file can also be defined as a physicalcorruption. Logical corruption on theother hand is usually due to end-user error or non-robust(?) applicationdesign. A small physical corruption such as a single bit flip may be mistakenfor a logical error.

2.坏块的产生主要原因
坏块产生的原因很多 ,这里根据资料整理以及历史遭遇梳理出主要产生数据库坏块的原因.

    2.1 硬件问题
    Oracle进程在处理一个数据块时,首先将其读入物理内存空间,在处理完成后,再由特定进程将其写回磁盘;如果在这个过程中,出现内存故障,CPU计算失误,都会导致内存数据块的内容混乱,最后反映到写回磁盘的数据块内容有误。同样,如果存储子系统出现异常,数据块损坏也就随之出现了。比如硬盘出现坏道,那么这部分的数据就处于无法读取的情况,和坏块没有区别.

    2.2 操作系统BUG
    由于Oracle进程对数据块的读写,都是以操作系统内核调用(system call)的方式完成的,如果操作系统在内核调用存在问题,必然导致Oracle进程写入非法的内容。

    2.3 操作系统的I/O错误或缓冲问题
    比如写丢失,io系统缓存掉电,io操作结果被截断

    2.4 内存或paging问题
    Oracle软件有较多的bug能导致坏块问题的出现,

    2.5 非Oracle进程扰乱Oracle共享内存区域
    如上文所述,在当数据块的内容被读入主机的物理内存时,如果其他非Oracle进程,对Oracle使用的共享内存区域形成了扰乱,最终导致写回磁盘的数据块内容混乱。

    2.6 异常关机,掉电,终止服务
    异常关机,掉电,终止服务使进程异常终止,也会导致坏块产生。

    2.7 数据库操作人为失误
    使用nologging选项操作后再做恢复后可能造成坏块,或者人为的破坏文件导致的坏块。