Skip to content

All posts by Guang Cai Li - 8. page

1.amdu-ASM Metadata Dump Utility工具介绍

 

AMDU是从一个或多个ASM磁盘中提取所有可用元数据或生成单个块的格式化打印输出的实用程序,不需要asm磁盘组mount。

Oracle Support可能会要求您提供AMDU输出,特别是在面临与ASM元数据相关的内部错误时.

注意:AMDU程序从11.1开始安装,在10g也可以使用,需要单独下载:Note 553639.1 – Placeholder for AMDU binaries and using with ASM 10g

 

AMDU具体以下三个主要功能:

  1. 将ASM DISK上的元数据转储到文件系统上以便分析
  2. 将ASM文件的内容抽取出来并写入到OS文件系统,Diskgroup是否mount均可
  3. 打印出块的元数据,以块中C语言结构或16进制的形式

 

参考文档:

ASM tools used by Support : KFOD, KFED, AMDU (文档 ID 1485597.1)

Note 553639.1 – Placeholder for AMDU binaries and using with ASM 10g

https://blogs.oracle.com/askmaclean/entry/使用amdu工具从无法mount的diskgroup中抽取数据文件

 

2.amdu工具抽取asm磁盘元数据信息

 

找出ASM磁盘对应的ASM磁盘组名称的方法:

kfed read /dev/asm-diskb |grep name

2.1抽取指定磁盘组的元数据信息:

[grid@luda5 ~]$ amdu -diskstring ‘/dev/asm-disk*’  -dump  ‘DATA’

amdu_2017_04_13_16_02_22/

AMDU-00204: Disk N0002 is in currently mounted diskgroup DATA

AMDU-00201: Disk N0002: ‘/dev/asm-diskc’

[grid@luda5 ~]$ ls

amdu_2017_04_13_16_02_22  ocm_temp.rsp  oradiag_grid  PatchSearch.xml

[grid@luda5 ~]$ cd amdu_2017_04_13_16_02_22/

[grid@luda5 amdu_2017_04_13_16_02_22]$ ls

DATA_0001.img  DATA.map  report.txt

[grid@luda5 amdu_2017_04_13_16_02_22]$ du -sh ./*

95M     ./DATA_0001.img

8.0K    ./DATA.map

4.0K    ./report.txt

 

Amdu命令的输出信息:

每次执行AMDU时都会在当前位置创建一个目录–格式amdu_YYYY_MM_DD_HH24_MM_SS,使用-directory参数可自定义此目录的信息。

默认在目录中将生成以下文件:

<diskgroup> _0001.img – 磁盘组的内容的转储;大小限制为2GB,但可以有多个

<diskgroup> .map – 可用于查找磁盘上ASM元数据的确切位置

report.txt – 包括有关扫描的磁盘的详细信息

2.2 查看amdu的report

[grid@luda5 amdu_2017_04_13_16_02_22]$ cat report.txt

-*-amdu-*-

 

******************************* AMDU Settings ********************************

ORACLE_HOME = /u01/11.2.0/grid

System name:    Linux

Node name:      luda5.luda.com

Release:        3.8.13-16.2.1.el6uek.x86_64

Version:        #1 SMP Thu Nov 7 17:01:44 PST 2013

Machine:        x86_64

amdu run:       13-APR-17 16:02:22

Endianess:      1

=====》》当前主机信息

——————————— Operations ———————————

-dump DATA

=====》》执行的操作

——————————- Disk Selection ——————————-

-diskstring ‘/dev/asm-disk*’

=====》》使用的ASM DISK信息

—————————— Reading Control ——————————-

 

——————————- Output Control ——————————-

 

********************************* DISCOVERY **********************************

 

—————————– DISK REPORT N0001 ——————————

Disk Path: /dev/asm-diskb

Unique Disk ID:

Disk Label:

Physical Sector Size: 512 bytes

Disk Size: 1024 megabytes

** NOT A VALID ASM DISK HEADER. BAD VALUE IN FIELD blk_kfbl **

=====》》这里报磁盘头损坏,是另一个OCR磁盘组的盘。

—————————– DISK REPORT N0002 ——————————

Disk Path: /dev/asm-diskc   =====》》asm_disk信息

Unique Disk ID:

Disk Label:

Physical Sector Size: 512 bytes   =====》》物理扇区大小

Disk Size: 4096 megabytes  =====》》块大小

Group Name: DATA  =====》》磁盘组名

Disk Name: DATA_0000

Failure Group Name: DATA_0000

Disk Number: 0

Header Status: 3

Disk Creation Time: 2017/03/23 20:51:44.637000

Last Mount Time: 2017/04/13 14:56:19.845000

Compatibility Version: 0x0b200000(11020000)   =====》》版本信息

Disk Sector Size: 512 bytes

Disk size in AUs: 4096 Aus

Group Redundancy: 1

Metadata Block Size: 4096 bytes

AU Size: 1048576 bytes  =====》》AUSIZE是1M

Stride: 113792 AUs

Group Creation Time: 2017/03/23 20:51:44.428000

File 1 Block 1 location: AU 2

OCR Present: NO  =====》》非OCR盘

 

***************** Slept for 6 seconds waiting for heartbeats *****************

 

************************** SCANNING DISKGROUP DATA ***************************

Creation Time: 2017/03/23 20:51:44.428000

Disks Discovered: 1

Redundancy: 1

AU Size: 1048576 bytes

Metadata Block Size: 4096 bytes

Physical Sector Size: 512 bytes

Metadata Stride: 113792 AU

Duplicate Disk Numbers: 0

 

 

—————————- SCANNING DISK N0002 —————————–

Disk N0002: ‘/dev/asm-diskc’

AMDU-00204: Disk N0002 is in currently mounted diskgroup DATA

AMDU-00201: Disk N0002: ‘/dev/asm-diskc’

** HEARTBEAT DETECTED **

Allocated AU’s: 720

Free AU’s: 3376

AU’s read for dump: 102

Block images saved: 24070

Map lines written: 102

Heartbeats seen: 1

Corrupt metadata blocks: 0

Corrupt AT blocks: 0

 

 

————————- SUMMARY FOR DISKGROUP DATA ————————-

Allocated AU’s: 720

Free AU’s: 3376

AU’s read for dump: 102

Block images saved: 24070

Map lines written: 102

Heartbeats seen: 1

Corrupt metadata blocks: 0

Corrupt AT blocks: 0

 

 

******************************* END OF REPORT ********************************

3.amdu工具抽取无法正常MOUNT的asm disk中数据文件来OPEN数据库

在linux64+11.2.0.4rac环境,模拟ASM磁盘无法正常mount时,通过amdu工具抽取asm disk中的数据文件,并异机恢复、open数据库。

 

1.关闭HAS集群件、数据库;

CRS-2793: Shutdown of Oracle High Availability Services-managed resources on ‘luda5’ has completed

CRS-4133: Oracle High Availability Services has been stopped.

[root@luda5 ~]#

1.1模拟ASM磁盘头损坏,此时amdu工具是无法读出磁盘信息的。

[grid@luda5 ~]$ dd if=/dev/zero of=/dev/asm-diskc bs=4096 count=1

1+0 records in

1+0 records out

4096 bytes (4.1 kB) copied, 0.00422028 s, 971 kB/s

 

[grid@luda5 ~]$ kfed read /dev/asm-diskc

kfbh.endian:                          0 ; 0x000: 0x00

kfbh.hard:                            0 ; 0x001: 0x00

kfbh.type:                            0 ; 0x002: KFBTYP_INVALID

kfbh.datfmt:                          0 ; 0x003: 0x00

kfbh.block.blk:                       0 ; 0x004: blk=0

kfbh.block.obj:                       0 ; 0x008: file=0

kfbh.check:                           0 ; 0x00c: 0x00000000

kfbh.fcn.base:                        0 ; 0x010: 0x00000000

kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000

kfbh.spare1:                          0 ; 0x018: 0x00000000

kfbh.spare2:                          0 ; 0x01c: 0x00000000

7F0F828DA400 00000000 00000000 00000000 00000000  […………….]

Repeat 255 times

KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

 

[grid@luda5 ~]$ amdu -diskstring ‘/dev/asm-diskc’ -extract ‘DATA.261’

amdu_2017_04_13_13_31_05/

AMDU-00210: No disks found in diskgroup DATA

AMDU-00210: No disks found in diskgroup DATA

 

 

1.2修复ASM磁盘头损坏

[grid@luda5 ~]$ kfed repair /dev/asm-diskc

[grid@luda5 ~]$ kfed read /dev/asm-diskc

kfbh.endian:                          1 ; 0x000: 0x01

kfbh.hard:                          130 ; 0x001: 0x82

kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD

kfbh.datfmt:                          1 ; 0x003: 0x01

kfbh.block.blk:                       0 ; 0x004: blk=0

kfbh.block.obj:              2147483648 ; 0x008: disk=0

kfbh.check:                  1931960167 ; 0x00c: 0x73275f67

kfbh.fcn.base:                        0 ; 0x010: 0x00000000

kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000

kfbh.spare1:                          0 ; 0x018: 0x00000000

kfbh.spare2:                          0 ; 0x01c: 0x00000000

kfdhdb.driver.provstr:         ORCLDISK ; 0x000: length=8

kfdhdb.driver.reserved[0]:            0 ; 0x008: 0x00000000

kfdhdb.driver.reserved[1]:            0 ; 0x00c: 0x00000000

kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000

kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000

kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000

kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000

kfdhdb.compat:                186646528 ; 0x020: 0x0b200000

kfdhdb.dsknum:                        0 ; 0x024: 0x0000

kfdhdb.grptyp:                        1 ; 0x026: KFDGTP_EXTERNAL

kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER

kfdhdb.dskname:               DATA_0000 ; 0x028: length=9

kfdhdb.grpname:                    DATA ; 0x048: length=4

kfdhdb.fgname:                DATA_0000 ; 0x068: length=9

kfdhdb.capname:                         ; 0x088: length=0

kfdhdb.crestmp.hi:             33050356 ; 0x0a8: HOUR=0x14 DAYS=0x17 MNTH=0x3 YEAR=0x7e1

kfdhdb.crestmp.lo:           3469341696 ; 0x0ac: USEC=0x0 MSEC=0x27d SECS=0x2c MINS=0x33

kfdhdb.mntstmp.hi:             33051050 ; 0x0b0: HOUR=0xa DAYS=0xd MNTH=0x4 YEAR=0x7e1

kfdhdb.mntstmp.lo:           2813672448 ; 0x0b4: USEC=0x0 MSEC=0x14f SECS=0x3b MINS=0x29

kfdhdb.secsize:                     512 ; 0x0b8: 0x0200

kfdhdb.blksize:                    4096 ; 0x0ba: 0x1000

kfdhdb.ausize:                  1048576 ; 0x0bc: 0x00100000

kfdhdb.mfact:                    113792 ; 0x0c0: 0x0001bc80

kfdhdb.dsksize:                    4096 ; 0x0c4: 0x00001000

kfdhdb.pmcnt:                         2 ; 0x0c8: 0x00000002

kfdhdb.fstlocn:                       1 ; 0x0cc: 0x00000001

kfdhdb.altlocn:                       2 ; 0x0d0: 0x00000002

kfdhdb.f1b1locn:                      2 ; 0x0d4: 0x00000002

kfdhdb.redomirrors[0]:                0 ; 0x0d8: 0x0000

kfdhdb.redomirrors[1]:                0 ; 0x0da: 0x0000

kfdhdb.redomirrors[2]:                0 ; 0x0dc: 0x0000

kfdhdb.redomirrors[3]:                0 ; 0x0de: 0x0000

kfdhdb.dbcompat:              168820736 ; 0x0e0: 0x0a100000

kfdhdb.grpstmp.hi:             33050356 ; 0x0e4: HOUR=0x14 DAYS=0x17 MNTH=0x3 YEAR=0x7e1

kfdhdb.grpstmp.lo:           3469127680 ; 0x0e8: USEC=0x0 MSEC=0x1ac SECS=0x2c MINS=0x33

kfdhdb.vfstart:                       0 ; 0x0ec: 0x00000000

kfdhdb.vfend:                         0 ; 0x0f0: 0x00000000

kfdhdb.spfile:                        0 ; 0x0f4: 0x00000000

kfdhdb.spfflg:                        0 ; 0x0f8: 0x00000000

kfdhdb.ub4spare[0]:                   0 ; 0x0fc: 0x00000000

kfdhdb.ub4spare[1]:                   0 ; 0x100: 0x00000000

kfdhdb.ub4spare[2]:                   0 ; 0x104: 0x00000000

kfdhdb.ub4spare[3]:                   0 ; 0x108: 0x00000000

kfdhdb.ub4spare[4]:                   0 ; 0x10c: 0x00000000

kfdhdb.ub4spare[5]:                   0 ; 0x110: 0x00000000

kfdhdb.ub4spare[6]:                   0 ; 0x114: 0x00000000

kfdhdb.ub4spare[7]:                   0 ; 0x118: 0x00000000

kfdhdb.ub4spare[8]:                   0 ; 0x11c: 0x00000000

kfdhdb.ub4spare[9]:                   0 ; 0x120: 0x00000000

kfdhdb.ub4spare[10]:                  0 ; 0x124: 0x00000000

kfdhdb.ub4spare[11]:                  0 ; 0x128: 0x00000000

kfdhdb.ub4spare[12]:                  0 ; 0x12c: 0x00000000

kfdhdb.ub4spare[13]:                  0 ; 0x130: 0x00000000

kfdhdb.ub4spare[14]:                  0 ; 0x134: 0x00000000

kfdhdb.ub4spare[15]:                  0 ; 0x138: 0x00000000

kfdhdb.ub4spare[16]:                  0 ; 0x13c: 0x00000000

kfdhdb.ub4spare[17]:                  0 ; 0x140: 0x00000000

kfdhdb.ub4spare[18]:                  0 ; 0x144: 0x00000000

kfdhdb.ub4spare[19]:                  0 ; 0x148: 0x00000000

kfdhdb.ub4spare[20]:                  0 ; 0x14c: 0x00000000

kfdhdb.ub4spare[21]:                  0 ; 0x150: 0x00000000

kfdhdb.ub4spare[22]:                  0 ; 0x154: 0x00000000

kfdhdb.ub4spare[23]:                  0 ; 0x158: 0x00000000

kfdhdb.ub4spare[24]:                  0 ; 0x15c: 0x00000000

kfdhdb.ub4spare[25]:                  0 ; 0x160: 0x00000000

kfdhdb.ub4spare[26]:                  0 ; 0x164: 0x00000000

kfdhdb.ub4spare[27]:                  0 ; 0x168: 0x00000000

kfdhdb.ub4spare[28]:                  0 ; 0x16c: 0x00000000

kfdhdb.ub4spare[29]:                  0 ; 0x170: 0x00000000

kfdhdb.ub4spare[30]:                  0 ; 0x174: 0x00000000

kfdhdb.ub4spare[31]:                  0 ; 0x178: 0x00000000

kfdhdb.ub4spare[32]:                  0 ; 0x17c: 0x00000000

kfdhdb.ub4spare[33]:                  0 ; 0x180: 0x00000000

kfdhdb.ub4spare[34]:                  0 ; 0x184: 0x00000000

kfdhdb.ub4spare[35]:                  0 ; 0x188: 0x00000000

kfdhdb.ub4spare[36]:                  0 ; 0x18c: 0x00000000

kfdhdb.ub4spare[37]:                  0 ; 0x190: 0x00000000

kfdhdb.ub4spare[38]:                  0 ; 0x194: 0x00000000

kfdhdb.ub4spare[39]:                  0 ; 0x198: 0x00000000

kfdhdb.ub4spare[40]:                  0 ; 0x19c: 0x00000000

kfdhdb.ub4spare[41]:                  0 ; 0x1a0: 0x00000000

kfdhdb.ub4spare[42]:                  0 ; 0x1a4: 0x00000000

kfdhdb.ub4spare[43]:                  0 ; 0x1a8: 0x00000000

kfdhdb.ub4spare[44]:                  0 ; 0x1ac: 0x00000000

kfdhdb.ub4spare[45]:                  0 ; 0x1b0: 0x00000000

kfdhdb.ub4spare[46]:                  0 ; 0x1b4: 0x00000000

kfdhdb.ub4spare[47]:                  0 ; 0x1b8: 0x00000000

kfdhdb.ub4spare[48]:                  0 ; 0x1bc: 0x00000000

kfdhdb.ub4spare[49]:                  0 ; 0x1c0: 0x00000000

kfdhdb.ub4spare[50]:                  0 ; 0x1c4: 0x00000000

kfdhdb.ub4spare[51]:                  0 ; 0x1c8: 0x00000000

kfdhdb.ub4spare[52]:                  0 ; 0x1cc: 0x00000000

kfdhdb.ub4spare[53]:                  0 ; 0x1d0: 0x00000000

kfdhdb.acdb.aba.seq:                  0 ; 0x1d4: 0x00000000

kfdhdb.acdb.aba.blk:                  0 ; 0x1d8: 0x00000000

kfdhdb.acdb.ents:                     0 ; 0x1dc: 0x0000

kfdhdb.acdb.ub2spare:                 0 ; 0x1de: 0x0000

########################

2.恢复数据库控制文件并找出数据文件信息

修复磁盘头后,通过指定ASM磁盘位置信息,使用amdu工具读出磁盘上指定文件编号的控制文件

 

2.1找出控制文件信息

首先从ALERT日志中找到数据库之前启动时的参数文件信息:

Starting up:

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 – 64bit Production

With the Partitioning, Real Application Clusters, Oracle Label Security, OLAP,

Data Mining, Oracle Database Vault and Real Application Testing options.

ORACLE_HOME = /u01/app/oracle/product/11.2.0/dbhome_1

System name:    Linux

Node name:      luda5.luda.com

Release:        3.8.13-16.2.1.el6uek.x86_64

Version:        #1 SMP Thu Nov 7 17:01:44 PST 2013

Machine:        x86_64

Using parameter settings in server-side pfile /u01/app/oracle/product/11.2.0/dbhome_1/dbs/initludarac1.ora

System parameters with non-default values:

processes                = 300

sessions                 = 472

spfile                   = “+DATA/ludarac/spfileludarac.ora”

sga_target               = 600M

control_files            = “+DATA/ludarac/controlfile/current.261.939458967”

control_files            = “+DATA/ludarac/controlfile/current.260.939458967”

 

2.2 amdu恢复控制文件

从上面可以发现控制文件做了镜像,有两份。

这里来恢复261号文件,命令如下:

注意:-diskstring ‘/dev/asm-diskc’参数,可以写为-diskstring ‘/dev/asm-disk*等,即asm_diskstring 参数.

[grid@luda5 ~]$ amdu -diskstring ‘/dev/asm-diskc’ -extract ‘DATA.261’

amdu_2017_04_13_13_32_02/

[grid@luda5 ~]$ cd amdu_2017_04_13_13_32_02/

[grid@luda5 amdu_2017_04_13_13_32_02]$ ls

DATA_261.f  report.txt

[grid@luda5 amdu_2017_04_13_13_32_02]$ mv DATA_261.f /u01/app/oracle/oradata/

恢复完成后移动新规划的存放ORACLE数据文件到/u01/app/oracle/oradata/目录。

 

 

2.3 从控制文件中找出数据文件信息

方法有两种:

1.启动数据库到MOUNT状态后查询v$dbfile;

注意在RAC环境下需要启动集群后,可以启动数据库实例,不然报如下错误:

[oracle@luda5 ~]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Thu Apr 13 13:39:42 2017

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.

 

SQL> startup nomount pfile=’/home/oracle/restore.ora’;

ORA-29702: error occurred in Cluster Group Service operation

这里使用在GI集群启动后,利用从alert日志中找到的上一次数据库启动时的参数信息,生成如下参数文件:

[oracle@luda5 ~]$ cat restore.ora

processes                = 300

sessions                 = 472

sga_target               = 600M

control_files            = “/u01/app/oracle/oradata/DATA_261.f”

db_block_size            = 8192

compatible               = “11.2.0.4.0”

log_archive_format       = “%t_%s_%r.dbf”

db_recovery_file_dest_size= 4407M

undo_tablespace          = “UNDOTBS1”

remote_login_passwordfile= “EXCLUSIVE”

audit_file_dest          = “/u01/app/oracle/admin/ludarac/adump”

audit_trail              = “DB”

db_name                  = “ludarac”

open_cursors             = 300

pga_aggregate_target     = 100M

diagnostic_dest          = “/u01/app/oracle”

[oracle@luda5 ~]$ sqlplus / as sysdba

 

SQL*Plus: Release 11.2.0.4.0 Production on Thu Apr 13 14:19:23 2017

 

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

 

Connected to an idle instance.

 

SQL> startup nomount pfile=’/home/oracle/recover2.ora’

ORACLE instance started.

 

Total System Global Area  626327552 bytes

Fixed Size                  2255832 bytes

Variable Size             339739688 bytes

Database Buffers          281018368 bytes

Redo Buffers                3313664 bytes

SQL> show parameter cont

 

NAME                                 TYPE        VALUE

———————————— ———– ——————————

control_file_record_keep_time        integer     7

control_files                        string      /u01/app/oracle/oradata/DATA_2

61.f

control_management_pack_access       string      DIAGNOSTIC+TUNING

global_context_pool_size             string

SQL> alter database mount;

alter database mount

*

ERROR at line 1:

ORA-00221: error on write to control file

ORA-00206: error in writing (block 1, # blocks 1) of control file

ORA-00202: control file: ‘/u01/app/oracle/oradata/DATA_261.f’

ORA-27041: unable to open file

Linux-x86_64 Error: 13: Permission denied

Additional information: 3

 

修改相应权限:

[oracle@luda5 ~]$ ls -al /u01/app/oracle/oradata/DATA_261.f

-rw-r–r– 1 grid oinstall 18497536 Apr 13 13:32 /u01/app/oracle/oradata/DATA_261.f

 

[root@luda5 ~]# cd /u01/app/oracle/oradata/

[root@luda5 oradata]# chown -R oracle:oinstall DATA_2*

重新执行数据库MOUNT命令:

SQL> alter database mount;

 

Database altered.

查出数据文件信息:

SQL> col name for a60

SQL> select * from v$dbfile;

 

FILE# NAME

———- ————————————————————

4 +DATA/ludarac/datafile/users.259.939458821

3 +DATA/ludarac/datafile/undotbs1.258.939458821

2 +DATA/ludarac/datafile/sysaux.257.939458821

1 +DATA/ludarac/datafile/system.256.939458821

5 +DATA/ludarac/datafile/undotbs2.267.939459205

 

 

2.使用 strings命令读取控制文件找出数据文件信息:

[oracle@luda5 oradata]$ cp DATA_261.f DATA_261.fbak

[oracle@luda5 oradata]$ strings DATA_261.fbak

…………..有如下有用信息

+DATA/ludarac/datafile/users.259.939458821

+DATA/ludarac/datafile/undotbs1.258.939458821

+DATA/ludarac/datafile/sysaux.257.939458821

+DATA/ludarac/datafile/system.256.939458821

+DATA/ludarac/tempfile/temp.266.939458999

+DATA/ludarac/datafile/undotbs2.267.939459205

……………….

 

 

3.恢复数据文件

根据上一步从控制文件中查出的数据文件信息,使用 amdu工具直接从磁盘中读出相应数据文件:

提取数据文件的格式是:dg_name.asmfile_id

kfed读取ASM磁盘头的字段kfdhdb.dskname可以发现此磁盘所属的磁盘组,从而使用正确的磁盘路径;

同时从控制文件中有数据文件名称,里面包含了数据文件所属的磁盘组名及在磁盘组中的FILE_ID。

3.1Amdu提取数据文件

 

[grid@luda5 ~]$ amdu -diskstring ‘/dev/asm-diskc’ -extract ‘DATA.256’

amdu_2017_04_13_13_48_58/

[grid@luda5 ~]$ cd amdu_2017_04_13_13_48_58/

[grid@luda5 amdu_2017_04_13_13_48_58]$ ls

DATA_256.f  report.txt

[grid@luda5 amdu_2017_04_13_13_48_58]$ du -sh ./*

751M    ./DATA_256.f

4.0K    ./report.txt

[grid@luda5 amdu_2017_04_13_13_48_58]$ mv DATA_256.f /u01/app/oracle/oradata/

[grid@luda5 amdu_2017_04_13_13_48_58]$ cd

[grid@luda5 ~]$ amdu -diskstring ‘/dev/asm-diskc’ -extract ‘DATA.257’

amdu_2017_04_13_13_51_22/

[grid@luda5 ~]$ amdu -diskstring ‘/dev/asm-diskc’ -extract ‘DATA.258’

amdu_2017_04_13_13_53_45/

[grid@luda5 ~]$ amdu -diskstring ‘/dev/asm-diskc’ -extract ‘DATA.259’

amdu_2017_04_13_13_55_45/

[grid@luda5 ~]$ amdu -diskstring ‘/dev/asm-diskc’ -extract ‘DATA.267’

amdu_2017_04_13_13_56_20/

 

3.2移动数据文件位置及修改权限

将amdu读出的数据文件move到恢复目录:/u01/app/oracle/oradata/,并修改权限如下:

[root@luda5 oradata]# chown -R oracle:oinstall DATA_2*

[root@luda5 oradata]# ls -al

total 1444184

drwxrwxr-x  2 oracle oinstall      4096 Apr 13 13:58 .

drwxrwxr-x 12 oracle oinstall      4096 Apr 13 13:35 ..

-rw-r–r–  1 oracle oinstall 786440192 Apr 13 13:50 DATA_256.f

-rw-r–r–  1 oracle oinstall 545267712 Apr 13 13:52 DATA_257.f

-rw-r–r–  1 oracle oinstall  78651392 Apr 13 13:53 DATA_258.f

-rw-r–r–  1 oracle oinstall   5251072 Apr 13 13:55 DATA_259.f

-rw-r–r–  1 oracle oinstall  18497536 Apr 13 13:32 DATA_261.f

-rw-r–r–  1 oracle oinstall  18497536 Apr 13 13:45 DATA_261.fbak

-rw-r–r–  1 oracle oinstall  26222592 Apr 13 13:56 DATA_267.f

 

########################

4.OPEN数据库

4.1 修改数据文件位置信息

首先修改控制文件中的数据文件位置信息:—从v$dbfile;中查找对应关系

alter database rename file ‘+DATA/ludarac/datafile/users.259.939458821’    to ‘/u01/app/oracle/oradata/DATA_259.f’;

alter database rename file ‘+DATA/ludarac/datafile/undotbs1.258.939458821’ to ‘/u01/app/oracle/oradata/DATA_258.f’;

alter database rename file ‘+DATA/ludarac/datafile/sysaux.257.939458821’   to ‘/u01/app/oracle/oradata/DATA_257.f’;

alter database rename file ‘+DATA/ludarac/datafile/system.256.939458821’   to ‘/u01/app/oracle/oradata/DATA_256.f’;

alter database rename file ‘+DATA/ludarac/datafile/undotbs2.267.939459205’ to ‘/u01/app/oracle/oradata/DATA_267.f’;

验证如下:

SQL> select * from v$dbfile order by 1;

 

FILE# NAME

———- ————————————————————

1 /u01/app/oracle/oradata/DATA_256.f

2 /u01/app/oracle/oradata/DATA_257.f

3 /u01/app/oracle/oradata/DATA_258.f

4 /u01/app/oracle/oradata/DATA_259.f

5 /u01/app/oracle/oradata/DATA_267.f

 

检查当前恢复的数据文件在OPEN时是否需要恢复—-取决于之前是否正常关闭数据库

SQL> select ctl.FILE#,ctl.CHECKPOINT_CHANGE# as “ctl-CHECKPOINT_CHANGE#”,ctl.LAST_CHANGE# as “ctl-LAST_CHANGE#”,

2  dbf.CHECKPOINT_CHANGE# as “dbf-CHECKPOINT_CHANGE#”,dbf.status,dbf.fuzzy from v$datafile ctl,v$datafile_header dbf where ctl.file#=dbf.file#;

 

 

FILE# ctl-CHECKPOINT_CHANGE# ctl-LAST_CHANGE# dbf-CHECKPOINT_CHANGE# STATUS  FUZ

———- ———————- —————- ———————- ——- —

1                1153651                                 1153651 ONLINE  YES

2                1153651                                 1153651 ONLINE  YES

3                1153651                                 1153651 ONLINE  YES

4                1153651                                 1153651 ONLINE  YES

5                1153651                                 1153651 ONLINE  YES

从数据文件头v$datafile_header.fuzzy 标记可以发现数据库是异常关闭(如shutdown abort/kill关键进程、断电等)的,数据文件头检查点是一致的,因此需要实例恢复。

 

4.2 重建控制文件

这里就不模拟使用amdu从磁盘中提取redolog文件,演示如何在没有redo log情况下不完全恢复打开数据库。

这里通过resetlogs方式重建控制文件后打开数据库:

–后续resetlogs方式open后还需要创建临时表空间等就不写了。

SQL>  startup mount pfile=’/home/oracle/restore.ora’;

ORACLE instance started.

 

Total System Global Area  626327552 bytes

Fixed Size                  2255832 bytes

Variable Size             184550440 bytes

Database Buffers          436207616 bytes

Redo Buffers                3313664 bytes

Database mounted.

 

SQL> alter database backup controlfile to trace as ‘/home/oracle/ctl.bak’;

 

Database altered.

[oracle@luda5 ~]$ sqlplus / as sysdba

 

SQL*Plus: Release 11.2.0.4.0 Production on Thu Apr 13 15:02:35 2017

 

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

 

Connected to an idle instance.

 

SQL>  startup nomount pfile=’/home/oracle/restore.ora’;

ORACLE instance started.

 

Total System Global Area  626327552 bytes

Fixed Size                  2255832 bytes

Variable Size             184550440 bytes

Database Buffers          436207616 bytes

Redo Buffers                3313664 bytes

SQL> CREATE CONTROLFILE REUSE DATABASE “ludaRAC” RESETLOGS  ARCHIVELOG

2      MAXLOGFILES 192

3      MAXLOGMEMBERS 3

4      MAXDATAFILES 1024

5      MAXINSTANCES 32

6      MAXLOGHISTORY 292

7  LOGFILE

8    GROUP 1 (

9      ‘/u01/app/oracle/oradata/red01.log’

10    ) SIZE 50M BLOCKSIZE 512,

11    GROUP 2 (

12      ‘/u01/app/oracle/oradata/red02.log’

13    ) SIZE 50M BLOCKSIZE 512

14  — STANDBY LOGFILE

15  DATAFILE

16    ‘/u01/app/oracle/oradata/DATA_256.f’,

17    ‘/u01/app/oracle/oradata/DATA_257.f’,

18    ‘/u01/app/oracle/oradata/DATA_258.f’,

19    ‘/u01/app/oracle/oradata/DATA_259.f’,

20    ‘/u01/app/oracle/oradata/DATA_267.f’

21  CHARACTER SET AL32UTF8

22  ;

 

Control file created.

 

 

4.3 open resetlogs方式打开数据库

SQL> alter database open resetlogs;

 

Database altered.

 

 

SQL> select open_mode from v$database;

 

OPEN_MODE

——————–

READ WRITE

 

此阶段ALERT日志;

Thu Apr 13 15:03:35 2017

alter database open resetlogs

RESETLOGS after incomplete recovery UNTIL CHANGE 1181942

。。。。。

Thu Apr 13 15:03:38 2017

Checker run found 1 new persistent data failures

Clearing online redo logfile 2 complete

Online log /u01/app/oracle/oradata/red01.log: Thread 1 Group 1 was previously cleared

Online log /u01/app/oracle/oradata/red02.log: Thread 1 Group 2 was previously cleared

Thu Apr 13 15:03:41 2017

Setting recovery target incarnation to 2

Initializing SCN for created control file

Database SCN compatibility initialized to 1

Thu Apr 13 15:03:41 2017

Assigning activation ID 2696147166 (0xa0b3f0de)

LGWR: STARTING ARCH PROCESSES

Thu Apr 13 15:03:41 2017

ARC0 started with pid=24, OS id=26972

ARC0: Archival started

LGWR: STARTING ARCH PROCESSES COMPLETE

ARC0: STARTING ARCH PROCESSES

Thu Apr 13 15:03:41 2017

ARC1 started with pid=25, OS id=26974

Thread 1 opened at log sequence 1

Current log# 1 seq# 1 mem# 0: /u01/app/oracle/oradata/red01.log

Successful open of redo thread 1

MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set

Thu Apr 13 15:03:41 2017

SMON: enabling cache recovery

Thu Apr 13 15:03:41 2017

ARC2 started with pid=26, OS id=26976

ARC1: Archival started

ARC2: Archival started

ARC1: Becoming the ‘no FAL’ ARCH

ARC1: Becoming the ‘no SRL’ ARCH

ARC2: Becoming the heartbeat ARCH

Thu Apr 13 15:03:41 2017

ARC3 started with pid=27, OS id=26978

ARC3: Archival started

ARC0: STARTING ARCH PROCESSES COMPLETE

[26786] Successfully onlined Undo Tablespace 2.

Undo initialization finished serial:0 start:18610244 end:18611614 diff:1370 (13 seconds)

Dictionary check beginning

Tablespace ‘TEMP’ #3 found in data dictionary,

but not in the controlfile. Adding to controlfile.

Dictionary check complete

Verifying file header compatibility for 11g tablespace encryption..

Verifying 11g file header compatibility for tablespace encryption completed

SMON: enabling tx recovery

Database Characterset is AL32UTF8

No Resource Manager plan active

Thu Apr 13 15:03:46 2017

replication_dependency_tracking turned off (no async multimaster replication found)

Starting background process QMNC

Thu Apr 13 15:03:48 2017

QMNC started with pid=28, OS id=26989

LOGSTDBY: Validating controlfile with logical metadata

LOGSTDBY: Validation complete

Thu Apr 13 15:04:02 2017

Completed: alter database open resetlogs

Oracle amdu工具介绍及导出asm磁盘数据文件

RAC中数据库节点间的SCN同步方式
从10.2版本开始,默认使用BOC(broadcast-on-commit)方式; 这种方式前提是当某个节点的SCN发生改变(或者如commit)时才进行传播;
BOC方式传播SCN分为两种方式:直接方式、间接方式

1.用户进行提交事务–COMMIT;此时LGWR进程开始进行写日志操作;用户进行等待事件log file sync
2.LGWR如何将COMMIT SCN发送到其它节点,有如下两种方式:
LGWR进程将事务的COMMIT SCN发送给本地的LMS进程,;(此为间接方式,当有多个LMS进程时LGWR使用HASH方式决定使用哪一个-为了负载均衡)
LGWR进程直接将事务的COMMIT SCN发送给远程节点的LMS进程;(此为直接方式,同样HASH方式决定使用远程节点的哪个LMS进程)
3.LGWR将事务的redo entry写入REDO FILE;(此步骤和2是同时发生)
4.远程节点接收COMMIT SCN信息并确认:
直接方式:远程节点的LMS进程收到SCN信息并返回给本地LMS进程;
间接方式:本地节点LMS进程将SCN信息发给所有远程节点,远程节点的LMS进程收到SCN信息并返回给本地LMS进程;
5.LGWR将事务的redo entry写入REDO FILE的IO操作完成,并反馈给LGWR进程(4、5步骤的顺序与网络及IO速度相关,顺序是变化的)
6.本地节点LMS进程将所有远程节点对COMMIT SCN确认信息通知LGWR;
7.本地LGWR通知用户进程事务完成。

#################
步骤1:实验语句
SQL> alter system set events=’trace[GCS_BSCN] disk=low’; –两个节点均需要进行

System altered.

insert into t2 select dbms_flashback.get_system_change_number from dual;
commit;

SQL> select * from t2;

ID
———-
1536519
1536528
1538467
1538851
1540593

SQL> select to_number(‘1781f2′,’xxxxxxxxx’) from dual;

TO_NUMBER(‘1781F2′,’XXXXXXXXX’)
——————————-
1540594
2. 查询节点1的LGWR进行TRACE文件:
1.lgwr
*** 2016-12-17 20:42:03.068
2016-12-17 20:42:03.068586 : kjbbcastscn[0x0.1781f2][2][4] –直接广播发送SCN–1781f2–对应SCN为1540594,与实验中的可以对应。
2016-12-17 20:42:03.068650 : kjbsentscn[0x0.1781f3][to 2]
2016-12-17 20:42:03.068785 : kjbmscscn(to 2.1)(nrcvr 1)(ack 0)(scn 0x0.1781f2)

3.查询节点2的LMS进程的TRACE文件:
2.lms
*** 2016-12-17 20:42:03.069
2016-12-17 20:42:03.069106 : kjbrcvdscn[0x0.1781f3][from 1][idx 2016-12-17 20:42:03.069265 : kjbrcvdscn[no bscn <= rscn 0x0.1781f3][from 1]
2016-12-17 20:42:03.069282 : kjbmpscn(nnodes 2)(from 1)(scn 0x0.1781f2) –收到节点1发来的SCN信息1781f2
2016-12-17 20:42:03.069327 : kjbmscscn(to 1.1)(nrcvr 1)(ack 1)(scn 0x0.1781f2) –对消息进行了响应ACK
2016-12-17 20:42:03.069370 : kjbsentscn[0x0.1781f3][to 1]

4.查看节点1的LMS进程的TRACE文件(间接模式时LMS进程TRACE里有kjbsendscn 日志信息):

*** 2016-12-17 20:42:03.069
2016-12-17 20:42:03.069810 : kjbrcvdscn[0x0.1781f3][from 2][idx 2016-12-17 20:42:03.069872 : kjbrcvdscn[no bscn <= rscn 0x0.1781f3][from 2]
2016-12-17 20:42:03.069886 : kjbmpscnack(nnodes 2)(from 2)(scn 0x0.1781f2) –收到节点2对SCN 1781f2的ACK

*** 2016-12-17 20:42:12.459

5.查询节点1的LGWR进行TRACE文件:
*** 2016-12-17 20:42:03.154
2016-12-17 20:42:03.154085 : kjbmaxackscn[mscn 0x0.1781f3][nnodes 2][pend -2115384694] –可以理解为收到节点2对SCN ACK信息?
2016-12-17 20:42:03.154140 : kcrfw_post[0x0.1781f2][slot 0][mscn 0x0.1781f3][real 1] –LGWR写入完成
2016-12-17 20:42:03.154161 : kcrfw_post[lgwr post][lscn 0x0.1781f3][oods 0x0.1781f1][nods 0x0.1781f3]

————————————-
1.
lgwr —这次测试的时没有看到LGWR进程收到远程节点确认收到SCN信息的日志scn not acked:
*** 2016-12-17 20:08:00.032
2016-12-17 20:08:00.032857 : kjbbcastscn[0x0.177b24][2][4]
2016-12-17 20:08:00.032916 : kjbsentscn[0x0.177b25][to 2]
2016-12-17 20:08:00.033108 : kjbmscscn(to 2.1)(nrcvr 1)(ack 0)(scn 0x0.177b24)
2016-12-17 20:08:00.033644 : kjbmaxackscn[mscn 0x0.177b23][nnodes 2][pend 1]
2016-12-17 20:08:00.033665 : kcrfw_post[0x0.177b24][slot 0][mscn 0x0.177b23][real 1]
2016-12-17 20:08:00.033674 : kcrfw_post[scn not acked][lscn 0x0.177b25][oods 0x0.177b23][nods 0x0.177b24] –这里是scn not acked

2.2.lms
*** 2016-12-17 20:08:00.033
2016-12-17 20:08:00.033787 : kjbmpscn(nnodes 2)(from 1)(scn 0x0.177b24)
2016-12-17 20:08:00.033813 : kjbmscscn(to 1.1)(nrcvr 1)(ack 1)(scn 0x0.177b24)
2016-12-17 20:08:00.033839 : kjbsentscn[0x0.177b25][to 1]

3.1,lms
*** 2016-12-17 20:08:00.034
2016-12-17 20:08:00.034879 : kjbrcvdscn[0x0.177b25][from 2][idx 2016-12-17 20:08:00.034933 : kjbmaxackscn[mscn 0x0.177b25][nnodes 2][pend 1]
2016-12-17 20:08:00.034941 : kjbrcvdscn[old ods 0x0.177b24][new 0x0.177b25][mscn 0x0.177b25]
2016-12-17 20:08:00.034962 : kjb_post_fg[kjbrcvdscn][scn 0x0.177b24][wait time 1319us]
2016-12-17 20:08:00.034971 : kjbmpscnack(nnodes 2)(from 2)(scn 0x0.177b24)

4.1.lgwr
2016-12-17 20:08:00.033665 : kcrfw_post[0x0.177b24][slot 0][mscn 0x0.177b23][real 1]
2016-12-17 20:08:00.033674 : kcrfw_post[scn not acked][lscn 0x0.177b25][oods 0x0.177b23][nods 0x0.177b24]

Oracle RAC环境节点直接SCN同步方式介绍与测试

1.hang manager HM特性介绍

 

oracle 11g 新特性—hang 管理器(Hang Manager) ,HM 只在RAC 数据库中存在。

在我们诊断数据库问题的时候,经常会遇到一些数据库/进程 hang住的问题。对于hang的问题,一般来说,常见的原因有以下两种:

死锁(cycle),对于这种hang, 除非循环被打破,问题会永远存在。

某个堵塞者(blocker),进程在持有了某些资源后堵住了其他进程。当然,根据堵塞的情况,我们可以把blocker分为直接堵塞进程(immediate blocker)和根堵塞进程(root blocker)。

而root blocker 在通常情况下会处于两种状态。

1.根堵塞进程处于空闲状态,对于这种情况,终止这个进程能够解决问题。

  1. 根堵塞进程正在等待某些和数据库无关的资源(例如:等待I/O),对于这种情况,终止这个进程也许能解决问题。但是,从数据库的角度来讲,这已经超出了数据库的范畴。

而从数据库的角度来讲, oracle有几种死锁的发现机制。 在这篇文章中我们会介绍11g RAC的新特性 hang管理器。

hang 管理器的工作基本步骤是:

1.分配一部分内存空间用于存放hang analyze dump 信息。

2.定期搜集hang analyze dump信息(本地和全局)

  1. 分析搜集到的dump信息,并确认系统中是否存在hang。
  2. 利用分析的结果来解决hang问题。

接下来,我们对每个步骤进行具体的介绍。

步骤1: ORACLE 会分配一部分内存空间,我们称之为 hang analysis cache,用来存放搜集的hang analyze dump i信息。这部分内存空间在每个节点的数据库实例上都存在。

步骤2:oracle 会定期搜集hang analyze 信息,由于,HM特性是针对RAC数据库的特性,hang analyze的级别会包括本地和全局。另外,负责搜集这些dump 信息的后台进程是DIA0(这个进程从11g才被介绍)。默认情况下每3秒钟搜集本地级别hang analyze dump, 每10 秒搜集全局级别hang analyze dump。

步骤3:因为,每个节点都会搜集hang analyze dump 信息,那么,意味着每个实例都会拥有自己的DIA0进程,负责完成本地的hang 分析。但是,对于RAC数据库,很多hang的情况会包含多个实例的进程。所以,我们需要一个实例上的DIA0 进程作为master,来对多个实例搜集到的信息进行分析。对于11g版本,节点号最小的实例的DIA0进程会成为HM的master进程。当然,在实例级别发生了重新配置后,主(master)DIA0 进程会重新在存在的实例中重新被选举出来。

 

对于hang的问题,HM采用以下的机制来进行检测,当HM分析过几个hang analyze dump(每30秒进行一次分析,至少经过三次分析)后,就会发现有一些进程之间存在着等待关系(我们可以称之为open chain),而且在这段时间之内没有任何的改变(例如,一直等待相同的等待事件),那么,我们就可以怀疑,这些进程之间出现了hang的情况。而在进一步的验证之后,的确发现这些进程之间存在着等待关系,那么就会找到这个等待链(open chain)的根阻塞进程,并尝试通过终止阻塞进程的方式来解决这个hang.当然,对于死锁(dead lock)这种情况,我们采用的方式是,终止等待环中的一个进程。下面的图形说明了以上的基本逻辑。

 

 

步骤4: 在确认hang的确发生之后,根据hang的类型选择对应的解决方案。对于HM 来说,如果这个hang线管的进程满足以下条件之一,那么HM就无法解决这个hang.

  1. 除数据库以外的其他层面的进程也和这个hang相关,例如:asm实例的进程。
  2. 是由于用户应用层面导致的,例如:TX锁。
  3. 并行查询
  4. 需要用户手动干预。例如:阻塞进程在等待“log file switch ”(这种等待很可能是由于归档目录对应的filesystem空间不足导致的。即使HM中知道了阻塞进程,hang的情况也无法得到解决)。

如果,hang是HM无法解决的类型,那么HM会继续跟踪这个问题。
而对于HM能够解决的问题,其解决的办法就是终止根阻塞进程。但是,如果这个阻塞进程是oracle 的主要后台进程,终止它就会导致实例crash。所以,HM在解决hang的时候,也存在解决范围。这个范围是由隐含参数”_hang_resolution_scope” 控制的,这个参数可以有三个值off(也就是说HM不会去解决hang),process(11.2.0.4版本默认值,允许HM终止阻塞进程,如果该进程不是主要的后台进程),instance(允许HM终止阻塞进程,即使该进程是主要的后台进程。终止该进程会导致实例终止)。

 

 

 

hang检测的过程

1.检测阶段(DETECT)

数据库会分配一部分内存空间用于存放hanganalyze dump信息。这部分内存空间在每个节点的数据库实例上都存在。 该阶段扫描实例的所有本地会话,检测是否有可能hang的会话。 每次扫描被称作一个snap。默认保存3个snap,每个snap间隔32秒,时间间隔由隐含参数”_hang_detection_interval”决定。 一旦检测到一个或多个会话出现在3个snap中(这些会话就被认为可能是hang的会话),就会向master DIA0进程发起REQHM请求,进入HA阶段;如果没有,检测到hang的会话,HM继续保留在该阶段,除了阶段性的进入HAONLY阶段。

2.hang会话分析阶段(Hang Analysis)

检测到hang的会话后,发起REQHM请求的DIA0进程将所有检测到的会话信息发送给master DIA0进程。 全局的hang analysis被启动,创建wait for graphs(WFGs),这个过程可能会跨节点,找出本地或远程的blocker。本阶段完成后进入下一阶段,分析阶段。

负责搜集这些dump信息的后台进程是DIA0(这个进程从11g才被引入)。默认情况下每3秒钟搜集本地级别hanganalyze dump, 每10秒搜集全局级别hanganalyze dump 每个实例都会拥有自己的DIA0进程,负责完成本地的hang分析。但是,对于RAC数据库,很多hang的情况会包含多个实例的进程。所以需要一个实例上的DIA0进程作为master,来对多个实例搜集到的信息进行分析。对于11g版本,节点号最小的实例的DIA0进程会成为HM的master进程。当然,在实例级别发生了重新配置后,主(master)DIA0 进程会重新在存在的实例中重新被选举出来。

 

3. 根源分析阶段(ANALYZE)

将root waiter和immediate waiter会话信息和Hang Signature Cache(HSC)进行匹配。如果找到匹配的记录,则更新最新的时间和计数;如果没有匹配的,创建一个新的hang记录。完成该阶段后,HM回到检测阶段。 确认阶段是由其它单独参数控制的。

4.确认阶段(verify)

被怀疑会话所在的节点会检查这些会话是否还在hang状态,并将信息发送给master DIA0进程。如果这些会话还在,确认该会话是hang的。然后进入victim阶段。

5.处理阶段(victim)

如果隐含参数”_HANG_RESOLUTION_SCOPE”值为process,HM会终止会话,如果会话终止失败,就会终止进程;

如果隐含参数”_HANG_RESOLUTION_SCOPE”值为instance,并且victim是关键的后台进程,该实例会被kill掉。

 

2. HM 相关的一些参数

 

_hang_resolution=TRUE 或者 FALSE。这个参数用于控制HM是否解决hang。

_hang_resolution_scope=OFF,PORCESS或者 INSTANCE。这个参数用于控制HM解决问题的范围。

_hang_detection= <number>。 HM检测hang的时间间隔,默认值为30(秒)。

Oracle 数据库的hang manager HM特性介绍

问题背景:

某云环境,要安装12.2版本RAC,网络原因,心跳网络上的HAIP(169.254.*.*)在两台主机间无法通信,导致RAC的ASM/DB均只能启动一个节点,报错即典型的PMON……: terminating the instance due to error 481。

处理办法:

1.协调云厂商在后台虚拟化管理上放开HAIP(169.254.*.*)网络的通信,一直无法解决~~
2.决定ASM/DB实例不使用HAIP,恢复到低版本原有的心跳地址模式(即HAIP功能在集群层面仍然是开启状态,ifconfig中也有169.254.*.*虚拟IP,只是ASM/DB实例设置为不使用);
3.关于HAIP异常引起的问题,可以参考MOS文档:
ASM on Non-First Node (Second or Others) Fails to Start: PMON (ospid: nnnn): terminating the instance due to error 481 (Doc ID 1383737.1)
关闭HAIP功能可以参考HOWTO: Remove/Disable HAIP on Exadata (Doc ID 2524069.1)中的Disable HAIP章节。

官方的禁用方法:

禁用haip服务及haip依赖
crsctl modify res ora.cluster_interconnect.haip -attr “ENABLED=0″ -init
d(ora.cssd,ora.ctssd)pullup(ora.cssd,ora.ctssd)weak(ora.drivers.acfs)'” -init
crsctl modify res ora.asm -attr “STOP_DEPENDENCIES=’hard(intermediate:ora.cssd)'” -init
之后重启集群。

===
查看状态
crsctl stat res ora.cluster_interconnect.haip -init
crsctl start res ora.cluster_interconnect.haip -init

#############################
恢复haip服务,重启集群
crsctl modify res ora.cluster_interconnect.haip -attr “ENABLED=1” -init

官方建议:

1.Run “crsctl stop crs” on all nodes to stop CRS stack.
2. 关闭HAIP

2. On one node, run the following commands:
crsctl start crs -excl -nocrs
crsctl stop res ora.asm -init
crsctl modify res ora.cluster_interconnect.haip -attr “ENABLED=0” -init
crsctl modify res ora.asm -attr “START_DEPENDENCIES=’hard(ora.cssd,ora.ctssd)pullup(ora.cssd,ora.ctssd)weak(ora.drivers.acfs)’,STOP_DEPENDENCIES=’hard(intermediate:ora.cssd)'” -init
crsctl stop crs
4. 进一步测试

3. Repeat Step(2) on other nodes.

4. Run “crsctl start crs” on all nodes to restart CRS stack.

 

经实验后的确认的最简单处理办法

不需要禁用HAIP功能,只需要人工将ASM/DB实例的参数cluster_interconnects设置为本机的心跳IP即可。
步骤如下:
DB:
SQL> alter system set cluster_interconnects=’10.100.19.18′ scope=spfile sid=’bdcsq1′;
SQL> alter system set cluster_interconnects=’10.100.19.20′ scope=spfile sid=’bdcsq2′;
ASM:
SQL> alter system set cluster_interconnects=’10.100.19.18′ scope=spfile sid=’+ASM1′;
SQL> alter system set cluster_interconnects=’10.100.19.20′ scope=spfile sid=’+ASM2′;

检查ASM及DB的ALERT日志启动时使用的cluster_interconnects信息:
启动日志中在读取参数后马上有使用的心跳网络信息,示例如下:
2021-11-13T11:34:06.408938+08:00
Cluster Communication is configured to use IPs from: GPnP
IP: 10.100.19.18 Subnet: 10.100.19.0 ===>>>不使用HAIP
KSIPC Loopback IP addresses(OSD):
127.0.0.1
KSIPC Available Transports: UDP:TCP
KSIPC: Client: KCL Transport: NONE
KSIPC: Client: DLM Transport: NONE

……………………
NOTE: remote asm mode is remote (mode 0x2; from cluster type)
2021-11-11T09:26:08.588753-05:00
Cluster Communication is configured to use IPs from: GPnP
IP: 169.254.253.252 Subnet: 169.254.0.0 ===>>>使用HAIP
KSIPC Loopback IP addresses(OSD):
127.0.0.1
KSIPC Available Transports: UDP:TCP
KSIPC: Client: KCL Transport: UDP
KSIPC: Client: DLM Transport: UDP
KSIPC CAPABILITIES :IPCLW:GRPAM:TOPO:DLL
KSXP: ksxpsg_ipclwtrans: 2 UDP
cluster interconnect IPC version: [IPCLW over UDP(mode 3) ]
IPC Vendor 1 proto 2
Oracle instance running with ODM: Oracle Direct NFS ODM Library Version 4.0

云环境使用ORACLE RAC集群时HAIP导致的集群异常问题处理方法

问题现象及VNCR特性

12.2RAC,只有一个节点可以注册到SCAN监听中(即SCAN运行在哪个节点哪个节点可以注册,远程节点无法注册);
分析排查一通,是VNCR特性原因,人工增加SCAN监听属性的invitednodes节点信息即可。
srvctl modify scan_listener -update -invitednodes “test01,test02”

参考MOS文档How to Enable VNCR on RAC Database to Register only Local Instances (Doc ID 1914282.1),这是VNCR特性,介绍如下:

On 11.2.0.4 RAC databases, the parameter VALID_NODE_CHECKING_REGISTRATION_listener_name is set to off.

However, sometimes this allows other instances in the same subnet to register against these listeners. We want to prevent that and allow only local instances to that RAC database to be registered with these listeners.

Version 12.1.0.2 Change to VNCR

On 12.1 RAC databases, the parameter VALID_NODE_CHECKING_REGISTRATION_listener_name for both local and scan listeners is set by default to ON/1/LOCAL
to specify valid node checking registration is on, and all local IP addresses can register.
12c introduces the option of using srvctl to set ‘invitednodes’ or ‘invitedsubnets’.

排查步骤:

1.监听状态及配置文件
[grid@test01 admin]$ lsnrctl status listener_scan1

LSNRCTL for Linux: Version 12.2.0.1.0 – Production on 13-NOV-2021 13:42:08

Copyright (c) 1991, 2016, Oracle. All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN1)))
STATUS of the LISTENER
————————
Alias LISTENER_SCAN1
Version TNSLSNR for Linux: Version 12.2.0.1.0 – Production
Start Date 13-NOV-2021 13:16:56
Uptime 0 days 0 hr. 25 min. 11 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /oracle/app/12.2.0/grid/network/admin/listener.ora
Listener Log File /oracle/app/grid/diag/tnslsnr/test01/listener_scan1/alert/log.xml
Listening Endpoints Summary…
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN1)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.100.18.252)(PORT=1521)))
Services Summary…
Service “test” has 1 instance(s).
Instance “test1”, status READY, has 1 handler(s) for this service…
Service “testXDB” has 1 instance(s).
Instance “test1”, status READY, has 1 handler(s) for this service…
The command completed successfully
配置文件
[grid@test01 admin]$ cat listener.ora
MGMTLSNR=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=IPC)(KEY=MGMTLSNR)))) # line added by Agent
# listener.ora Network Configuration File: /oracle/app/12.2.0/grid/network/admin/listener.ora
# Generated by Oracle configuration tools.

ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER_SCAN1 = ON

VALID_NODE_CHECKING_REGISTRATION_LISTENER_SCAN1 = OFF

VALID_NODE_CHECKING_REGISTRATION_ASMNET1LSNR_ASM = SUBNET

ASMNET1LSNR_ASM =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = IPC)(KEY = ASMNET1LSNR_ASM))
)
)

VALID_NODE_CHECKING_REGISTRATION_LISTENER = SUBNET

LISTENER =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = IPC)(KEY = LISTENER))
)
)

ENABLE_GLOBAL_DYNAMIC_ENDPOINT_ASMNET1LSNR_ASM = ON

ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER = ON

LISTENER_SCAN1 =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = IPC)(KEY = LISTENER_SCAN1))
)
)

ENABLE_GLOBAL_DYNAMIC_ENDPOINT_MGMTLSNR=ON # line added by Agent
VALID_NODE_CHECKING_REGISTRATION_MGMTLSNR=SUBNET # line added by Agent
[grid@test01 admin]$

2.查看监听资源的配置情况
[grid@test01 admin]$ srvctl config scan_listener
SCAN Listener LISTENER_SCAN1 exists. Port: TCP:1521
Registration invited nodes:
Registration invited subnets:
SCAN Listener is enabled.
SCAN Listener is individually enabled on nodes:
SCAN Listener is individually disabled on nodes:
[grid@test01 admin]$ srvctl config scan_listener -a
SCAN Listener LISTENER_SCAN1 exists. Port: TCP:1521
Registration invited nodes:
Registration invited subnets:
SCAN Listener is enabled.
SCAN Listener is individually enabled on nodes:
SCAN Listener is individually disabled on nodes:
[grid@test01 admin]$ olsnodes
test01
test02

3.修改SCAN监听资源,增加允许的节点信息
[grid@test01 admin]$ srvctl modify scan_listener -update -invitednodes “test01,test02”
[grid@test01 admin]$ srvctl config scan_listener -a
SCAN Listener LISTENER_SCAN1 exists. Port: TCP:1521
Registration invited nodes: test01,test02
Registration invited subnets:
SCAN Listener is enabled.
SCAN Listener is individually enabled on nodes:
SCAN Listener is individually disabled on nodes:

重启SCAN监听后已经正常注册
[grid@test01 admin]$ srvctl stop scan_listener
[grid@test01 admin]$ srvctl start scan_listener
[grid@test01 admin]$ lsnrctl status listener_scan1

LSNRCTL for Linux: Version 12.2.0.1.0 – Production on 13-NOV-2021 13:49:13

Copyright (c) 1991, 2016, Oracle. All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN1)))
STATUS of the LISTENER
————————
Alias LISTENER_SCAN1
Version TNSLSNR for Linux: Version 12.2.0.1.0 – Production
Start Date 13-NOV-2021 13:48:56
Uptime 0 days 0 hr. 0 min. 16 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /oracle/app/12.2.0/grid/network/admin/listener.ora
Listener Log File /oracle/app/grid/diag/tnslsnr/test01/listener_scan1/alert/log.xml
Listening Endpoints Summary…
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN1)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.100.18.252)(PORT=1521)))
Services Summary…
Service “-MGMTDBXDB” has 1 instance(s).
Instance “-MGMTDB”, status READY, has 1 handler(s) for this service…
Service “_mgmtdb” has 1 instance(s).
Instance “-MGMTDB”, status READY, has 1 handler(s) for this service…
Service “test” has 2 instance(s).
Instance “test1”, status READY, has 1 handler(s) for this service…
Instance “test2”, status READY, has 1 handler(s) for this service…
Service “testXDB” has 2 instance(s).
Instance “test1”, status READY, has 1 handler(s) for this service…
Instance “test2”, status READY, has 1 handler(s) for this service…
Service “d08316cafb076493e0531212640a3e9e” has 1 instance(s).
Instance “-MGMTDB”, status READY, has 1 handler(s) for this service…
Service “gimr_dscrep_10” has 1 instance(s).
Instance “-MGMTDB”, status READY, has 1 handler(s) for this service…
The command completed successfully

Oracle 12.2RAC环境只有一个节点可以注册到SCAN监听的问题分析