在今天的某银行的集群测试中遇到一个问题,主要是关于11.2.0.2后对rebootless restart这个特性的理解。中间有些对这个特性有些鸡肋的感受。
碰到的问题是,在好几套集群中做的单节点把光纤线,或者拔网线测试主机是否会重启,结果有一个集群在把光纤线拔掉之后集群不重启~,愁死我们这群做技术的,官方的dba被我们烦死了还是没能给出合理解释回复,最终把责任推倒aix系统的问题上,说是aix把hang的ocssd进程干掉了,所以其他集群才重启,没有重启才是正常的!我有种买表的感觉。逐对这个东西进行如下分析,如有错误之处,还请勘误。
梳理:
1.在这个11.2.0.2版本中,rebootless restart的特性是为了干嘛呢?减少os的重启数量。ok
那么我们先提出几个问题,
1.1 这个特性什么时候会触发?
1.2 这个特性什么时候是无效(不被触发)的?
1.3 这个特性触发后,如果特性想要的目的不能达到的时候是否会重启OS?那么这个特性是否存在尝试次数的要求?
2.在11.2.0.1以前的版本中,重启的情况我们统计几个例子
2.1 network heart超过misscount的设置,这个情况是网络心跳引起,如果网络心跳通信出现问题,这个时候超过misscount,那么节点就会脑裂,脑裂后分两种情况:子集群节点数不一样,节点数量少的集群在reboottimr重启;子集群节点数一样,这个时候node id大的子集群在reboottime内重启;
2.2 某个节点的voting disk的timeout超过disktimeout(2个节点时候,多个节点稍微有点不一样),这个时候css会在reboot time内把这个节点OS进行重启
2.3 meberkill upgrade to node kill
而在官方中解释如上的3个例子,在11.2.0.2中都将会被避免,这些都有cssd进程来完成,他首先GI会让集群重启,让cssd通知其他进程来停止资源以及有io能力的进程。那么cssd进程来完成这些动作的话,很明显,如果在cssd进程hang,或者在主机系统资源异常繁忙的情况下请求cssd.bin进程发出的相关指令延时严重也会对rebootless restart的特性严重影响。
3. GI 在重启集群之前,首先要对集群进行graceful shutdown, 基本的步骤如下。
1.停止本地节点的所有心跳(网络心跳,磁盘心跳和本地心跳)。
2.通知cssd agent,ocssd.bin即将停止
3.停止所有注册到css的具有i/o能力的进程,例如 lmon。
4.cssd通知crsd 停止所有资源,如果crsd不能成功的停止所有的资源,节点重启仍然会发生。
5.cssd等待所有的具有i/o能力的进程退出,如果这些进程在short i/o timeout时间内不能不能全部推迟,节点重启仍然会发生。
6.通知cssd agent 所有的有i/o能力的进程全部退出。
7.Ohasd 重新启动集群。
8.本地节点通知其他节点进行集群重配置。
4.分析
分析一:我们来分析下心跳网络中断时候的情况:(TEST14)
这个时候根据前面讲到的,超过了misscount不会主机reboot主机,而是gi会重启集群件,这时候回到3,问题又到了gi启动关闭集群程序的过程,graceful shutdown的步骤,
基本所有情况下(根据我们的实践结果)资源都不会被停止或者相关的有IO能力进程没有被成功的中止,这时候会导致节点reboot。(在某银行所有的集群中,10多个集群2个节点的,拔心跳线,2号节点死,这是规律了。)
万一如果清理干净了(这个会导致所有客户端的连接中断),那么gi集群程序了了,这个时候本身心跳线是被拔掉的状态,还是无法访问对方节点,这时候是不是会一直尝试验证心跳呢?也就是不关机。那这样是不是也算是服务丢失的一种情况?
分析二:我们再来分析下光纤线中断时候的情况:(TEST18,19)
这个时候理论上就是ocr和voting都无法访问了,那么这2个无法访问的情况下,集群件会做什么事情?
首先OCR无法访问,节点上的crsd进程就会由于无法访问ocr被判断中止(国外一个文档上介绍ocr无法访问,crsd就会abort,而事实上有点出入),下来ocssd进程会通知ohasd进程对crsd尝试重启10次,如果还是失败咋办?前面讲到cssd停止crsd先停止所有资源,这个时候资源无法被中止了,只好被迫重启主机。如果到这里觉得是最终答案,我就不会写那么多了,这也是本文疑问的地方,其实节点1上crsd并没有abort,而是在停止相关资源后,才被gi graceful shutdown。但是节点2上的同样的操作就发生了crsd进程直接abort,ohasd进程尝试重启10次crsd进程失败后,导致资源中止失败,节点重启。
具体看下面的分析日志。
其次到voting disk的访问上,如果超过了disktimeout,这时候GI会尝试重启集群件,这个过程会中止所有具有io能力和资源进程,如果清理干净,GI重启完集群程序,cssd进程会反复去获取voting file的位置。如果清理失败,则会重启主机。
2013-09-24 11:06:57.229
[cssd(2031962)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rhdiskpower0 will be considered not functional in 99149 milliseconds — 发生io错误,开始计时对比
2013-09-24 11:07:33.995
[cssd(2031962)]CRS-1649:An I/O error occured for voting file: /dev/rhdiskpower0; details at (:CSSNM00059:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log.
2013-09-24 11:07:34.364
[/grid/database/11.2.0/bin/oraagent.bin(5046282)]CRS-5011:Check of resource “fxdb” failed: details at “(:CLSN00007:)” in “/grid/database/11.2.0/log/fxdb1/agent/crsd/oraagent_oracle/oraagent_oracle.log”
2013-09-24 11:07:35.000
[cssd(2031962)]CRS-1649:An I/O error occured for voting file: /dev/rhdiskpower0; details at (:CSSNM00060:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log.
2013-09-24 11:07:47.043
[cssd(2031962)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/rhdiskpower0 will be considered not functional in 49335 milliseconds
2013-09-24 11:08:17.318
[cssd(2031962)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/rhdiskpower0 will be considered not functional in 19060 milliseconds
2013-09-24 11:08:36.449
[cssd(2031962)]CRS-1604:CSSD voting file is offline: /dev/rhdiskpower0; details at (:CSSNM00058:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log.
2013-09-24 11:08:36.449
[cssd(2031962)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log –voting fileoffline导致cssd进程中止
2013-09-24 11:08:36.450
[cssd(2031962)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log ————–css守护进程被中止
2013-09-24 11:08:36.501
[cssd(2031962)]CRS-1652:Starting clean up of CRSD resources. —————-发起清除crsd的资源
2013-09-24 11:08:37.979
[/grid/database/11.2.0/bin/oraagent.bin(3080748)]CRS-5016:Process “/grid/database/11.2.0/opmn/bin/onsctli” spawned by agent “/grid/database/11.2.0/bin/oraagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/grid/database/11.2.0/log/fxdb1/agent/crsd/oraagent_grid/oraagent_grid.log”
2013-09-24 11:08:38.589
[/grid/database/11.2.0/bin/oraagent.bin(3080748)]CRS-5016:Process “/grid/database/11.2.0/bin/lsnrctl” spawned by agent “/grid/database/11.2.0/bin/oraagent.bin” for action “check” failed: details at “(:CLSN00010:)” in “/grid/database/11.2.0/log/fxdb1/agent/crsd/oraagent_grid/oraagent_grid.log”
2013-09-24 11:08:38.597
[cssd(2031962)]CRS-1654:Clean up of CRSD resources finished successfully. ——————–提示发起清楚crsd的资源成功,这里是用什么清楚的?这里还未提示crsd abort!所以应该还是crsd进程本身在中止资源,但是浩南的文档中提到了crsd会abort这时候还没发现。
2013-09-24 11:08:38.598
[cssd(2031962)]CRS-1655:CSSD on node fxdb1 detected a problem and started to shutdown. ———提示开始关闭
2013-09-24 11:08:38.684
[/grid/database/11.2.0/bin/oraagent.bin(3080748)]CRS-5822:Agent ‘/grid/database/11.2.0/bin/oraagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:1:8} in /grid/database/11.2.0/log/fxdb1/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-09-24 11:08:38.692
[/grid/database/11.2.0/bin/orarootagent.bin(5112062)]CRS-5822:Agent ‘/grid/database/11.2.0/bin/orarootagent_root’ disconnected from server. Details at (:CRSAGF00117:) {0:3:3784} in /grid/database/11.2.0/log/fxdb1/agent/crsd/orarootagent_root/orarootagent_root.log.
2013-09-24 11:08:38.778
[cssd(2031962)]CRS-1660:The CSS daemon shutdown has completed ——— 守护进程关闭
2013-09-24 11:08:39.565
[ohasd(2950012)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb1’. ——crsd进程失败报错了,从这里开始提示后续的进程都开始失败了
2013-09-24 11:08:40.052
[/grid/database/11.2.0/bin/oraagent.bin(3211384)]CRS-5011:Check of resource “+ASM” failed: details at “(:CLSN00006:)” in “/grid/database/11.2.0/log/fxdb1/agent/ohasd/oraagent_grid/oraagent_grid.log”
2013-09-24 11:08:40.225
[ohasd(2950012)]CRS-2765:Resource ‘ora.asm’ has failed on server ‘fxdb1’.
2013-09-24 11:08:40.248
[/grid/database/11.2.0/bin/oraagent.bin(3211384)]CRS-5011:Check of resource “+ASM” failed: details at “(:CLSN00006:)” in “/grid/database/11.2.0/log/fxdb1/agent/ohasd/oraagent_grid/oraagent_grid.log”
2013-09-24 11:08:40.503
[ohasd(2950012)]CRS-2765:Resource ‘ora.ctssd’ has failed on server ‘fxdb1’.
2013-09-24 11:08:40.772
[crsd(3277342)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in /grid/database/11.2.0/log/fxdb1/crsd/crsd.log. ——crsd进程到这里宕掉了。现在问题明了了,先前的梳理在于CRSD清理资源的顺序上有误,实际上在清除之后crsd才会宕掉。
2013-09-24 11:08:40.777
[ohasd(2950012)]CRS-2765:Resource ‘ora.evmd’ has failed on server ‘fxdb1’.
2013-09-24 11:08:41.558
[ohasd(2950012)]CRS-2765:Resource ‘ora.diskmon’ has failed on server ‘fxdb1’.
2013-09-24 11:08:41.572
[ctssd(4064022)]CRS-2402:The Cluster Time Synchronization Service aborted on host fxdb1. Details at (:ctss_css_init1:) in /grid/database/11.2.0/log/fxdb1/ctssd/octssd.log.
2013-09-24 11:08:41.621
[ohasd(2950012)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb1’.
2013-09-24 11:08:41.644
[ohasd(2950012)]CRS-2765:Resource ‘ora.cluster_interconnect.haip’ has failed on server ‘fxdb1’.
2013-09-24 11:08:42.125
[ohasd(2950012)]CRS-2765:Resource ‘ora.cssd’ has failed on server ‘fxdb1’.
2013-09-24 11:08:43.954
[cssd(2556520)]CRS-1713:CSSD daemon is started in clustered mode —— gi重启了集群程序,这个时候cssd就开始工作,接下来就找voting file就有了下面的情况,一直在找,但是并不重启了。
2013-09-24 11:08:44.244
[cssd(2556520)]CRS-1714:Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /grid/database/11.2.0/log/fxdb1/cssd/ocssd.log
2013-09-24 11:08:58.247
[ohasd(2950012)]CRS-2765:Resource ‘ora.cssdmonitor’ has failed on server ‘fxdb1’.
在11点32分37秒发现磁盘,此时光纤线接回去了。就是说如果主机没重启,gi会不断去尝试扫描磁盘找voting file。但是前提是有io能力的进程以及资源进程被成功停止,下面之后开始看crsd日志
2013-09-24 11:32:37.496: [ SKGFD][1286]OSS discovery with ::
2013-09-24 11:32:37.527: [ SKGFD][1286]Handle 11125b910 from lib :UFS:: for disk :/dev/rhdiskpower9:
2013-09-24 11:32:37.558: [ SKGFD][1286]Handle 11125d1b0 from lib :UFS:: for disk :/dev/rhdiskpower8:
2013-09-24 11:32:37.588: [ SKGFD][1286]Handle 111268970 from lib :UFS:: for disk :/dev/rhdiskpower7:
2013-09-24 11:32:37.618: [ SKGFD][1286]Handle 111268df0 from lib :UFS:: for disk :/dev/rhdiskpower6:
2013-09-24 11:32:37.648: [ SKGFD][1286]Handle 11146bf50 from lib :UFS:: for disk :/dev/rhdiskpower5:
2013-09-24 11:32:37.681: [ SKGFD][1286]Handle 1114e5c30 from lib :UFS:: for disk :/dev/rhdiskpower4:
2013-09-24 11:32:37.712: [ SKGFD][1286]Handle 1114e6390 from lib :UFS:: for disk :/dev/rhdiskpower3:
2013-09-24 11:32:37.743: [ SKGFD][1286]Handle 1114e6af0 from lib :UFS:: for disk :/dev/rhdiskpower2:
2013-09-24 11:32:37.773: [ SKGFD][1286]Handle 1114fa5d0 from lib :UFS:: for disk :/dev/rhdiskpower14:
2013-09-24 11:32:37.803: [ SKGFD][1286]Handle 111501490 from lib :UFS:: for disk :/dev/rhdiskpower13:
在11点08分之后,crsd就在被启动过程,具体看crsd日志如下:
2013-09-24 11:08:39
Changing directory to /grid/database/11.2.0/log/fxdb1/crsd
2013-09-24 11:08:39
CRSD REBOOT
Attempt to add duplicate debug module CRSUI (old description CRSUI Component, new description CRSUI Component.
Attempt to add duplicate debug module CRSCOMM (old description CRSCOMM Component, new description CRSCOMM Component.
Attempt to add duplicate debug module CRSRTI (old description CRSRTI Component, new description CRSRTI Component.
Attempt to add duplicate debug module CRSMAIN (old description CRSMAIN Component, new description CRSMAIN Component.
Attempt to add duplicate debug module CRSPLACE (old description CRSPLACE Component, new description CRSPLACE Component.
Attempt to add duplicate debug module CRSAPP (old description CRSAPP Component, new description CRSAPP Component.
Attempt to add duplicate debug module CRSRES (old description CRSRES Component, new description CRSRES Component.
Attempt to add duplicate debug module CRSTIMER (old description CRSTIMER Component, new description CRSTIMER Component.
Attempt to add duplicate debug module CRSEVT (old description CRSEVT Component, new description CRSEVT Component.
Attempt to add duplicate debug module CRSD (old description CRSD Component, new description CRSD Component.
Attempt to add duplicate debug module CLUCLS (old description CLUCLS Component, new description CLUCLS Component.
Attempt to add duplicate debug module CLSVER (old description CLSVER Component, new description CLSVER Component.
Attempt to add duplicate debug module COMMCRS (old description COMMCRS, new description COMMCRS.
Attempt to add duplicate debug module COMMNS (old description COMMNS, new description COMMNS.
CRSD exiting: Could not init the CSS context, error: 3
2013-09-24 11:20:05
Changing directory to /grid/database/11.2.0/log/fxdb1/crsd
2013-09-24 11:20:05
CRSD REBOOT
Attempt to add duplicate debug module CRSUI (old description CRSUI Component, new description CRSUI Component.
Attempt to add duplicate debug module CRSCOMM (old description CRSCOMM Component, new description CRSCOMM Component.
Attempt to add duplicate debug module CRSRTI (old description CRSRTI Component, new description CRSRTI Component.
Attempt to add duplicate debug module CRSMAIN (old description CRSMAIN Component, new description CRSMAIN Component.
Attempt to add duplicate debug module CRSPLACE (old description CRSPLACE Component, new description CRSPLACE Component.
Attempt to add duplicate debug module CRSAPP (old description CRSAPP Component, new description CRSAPP Component.
Attempt to add duplicate debug module CRSRES (old description CRSRES Component, new description CRSRES Component.
Attempt to add duplicate debug module CRSTIMER (old description CRSTIMER Component, new description CRSTIMER Component.
Attempt to add duplicate debug module CRSEVT (old description CRSEVT Component, new description CRSEVT Component.
Attempt to add duplicate debug module CRSD (old description CRSD Component, new description CRSD Component.
Attempt to add duplicate debug module CLUCLS (old description CLUCLS Component, new description CLUCLS Component.
Attempt to add duplicate debug module CLSVER (old description CLSVER Component, new description CLSVER Component.
Attempt to add duplicate debug module COMMCRS (old description COMMCRS, new description COMMCRS.
Attempt to add duplicate debug module COMMNS (old description COMMNS, new description COMMNS.
CRSD exiting: Could not init the CSS context, error: 3
2013-09-24 11:33:09
Changing directory to /grid/database/11.2.0/log/fxdb1/crsd
2013-09-24 11:33:09
CRSD REBOOT
Attempt to add duplicate debug module CRSUI (old description CRSUI Component, new description CRSUI Component.
Attempt to add duplicate debug module CRSCOMM (old description CRSCOMM Component, new description CRSCOMM Component.
Attempt to add duplicate debug module CRSRTI (old description CRSRTI Component, new description CRSRTI Component.
Attempt to add duplicate debug module CRSMAIN (old description CRSMAIN Component, new description CRSMAIN Component.
Attempt to add duplicate debug module CRSPLACE (old description CRSPLACE Component, new description CRSPLACE Component.
Attempt to add duplicate debug module CRSAPP (old description CRSAPP Component, new description CRSAPP Component.
Attempt to add duplicate debug module CRSRES (old description CRSRES Component, new description CRSRES Component.
Attempt to add duplicate debug module CRSTIMER (old description CRSTIMER Component, new description CRSTIMER Component.
Attempt to add duplicate debug module CRSEVT (old description CRSEVT Component, new description CRSEVT Component.
Attempt to add duplicate debug module CRSD (old description CRSD Component, new description CRSD Component.
Attempt to add duplicate debug module CLUCLS (old description CLUCLS Component, new description CLUCLS Component.
Attempt to add duplicate debug module CLSVER (old description CLSVER Component, new description CLSVER Component.
Attempt to add duplicate debug module COMMCRS (old description COMMCRS, new description COMMCRS.
Attempt to add duplicate debug module COMMNS (old description COMMNS, new description COMMNS.
通过分析了节点一的告警日志基本可以把上个邮件的分析二否决了,而且整个过程是gi在管理集群程序的过程,而不是重启gi,是gi在执行graceful shutdown以及startup的过程。而且crsd是在停止完资源才被gi程序中止的或者是自己abort的,可以通过prostat crsd进程来看就知道了,但是这个只是节点一上的表现,到了节点2同样的操作,反馈的信息就完全不一样了。
分析二:我们再来分析下光纤线中断时候的情况:(TEST18,19)
这个时候理论上就是ocr和voting都无法访问了,那么这2个无法访问的情况下,集群件会做什么事情?
首先OCR无法访问,
节点上的crsd进程就会被通知去中止相关资源,然后再被gi进程关闭。这时候集群被graceful shutdown成功,cssd被重启后会反复读取votingdisk ,这个是节点1的情况。
那如果crsd进程abort了,无法被ohasd成功启动,那资源无法关闭,会发生什么?重启。 这个是在节点二发生的情况
如下节点2拔除光纤线的现象分析:
2013-09-24 11:38:50.475
[crsd(3735838)]CRS-2765:Resource ‘ora.fxdb.db’ has failed on server ‘fxdb2’.
2013-09-24 11:39:08.351
[cssd(3081164)]CRS-1615:No I/O has completed after 50% of the maximum interval. Voting file /dev/rhdiskpower12 will be considered not functional in 99785 milliseconds ————- 这个时候开始把节点2的光纤线拔掉。
2013-09-24 11:39:24.852
[cssd(3081164)]CRS-1649:An I/O error occured for voting file: /dev/rhdiskpower12; details at (:CSSNM00060:) in /grid/database/11.2.0/log/fxdb2/cssd/ocssd.log.
2013-09-24 11:39:58.549
[cssd(3081164)]CRS-1614:No I/O has completed after 75% of the maximum interval. Voting file /dev/rhdiskpower12 will be considered not functional in 49587 milliseconds
2013-09-24 11:40:26.560
[crsd(3735838)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log. —————— 提示找不到ocr,ocr不能访问
2013-09-24 11:40:26.561
[crsd(3735838)]CRS-1006:The OCR location is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.—————— 提示找不到ocr,ocr不能访问,这个时候gi开始进行graceful shutdown 集群程序
2013-09-24 11:40:26.882
[/grid/database/11.2.0/bin/oraagent.bin(3932520)]CRS-5822:Agent ‘/grid/database/11.2.0/bin/oraagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:1:6} in /grid/database/11.2.0/log/fxdb2/agent/crsd/oraagent_grid/oraagent_grid.log.
2013-09-24 11:40:26.891
[/grid/database/11.2.0/bin/orarootagent.bin(4063734)]CRS-5822:Agent ‘/grid/database/11.2.0/bin/orarootagent_root’ disconnected from server. Details at (:CRSAGF00117:) {0:2:10662} in /grid/database/11.2.0/log/fxdb2/agent/crsd/orarootagent_root/orarootagent_root.log.
2013-09-24 11:40:26.895
[/grid/database/11.2.0/bin/scriptagent.bin(4456740)]CRS-5822:Agent ‘/grid/database/11.2.0/bin/scriptagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:6:7} in /grid/database/11.2.0/log/fxdb2/agent/crsd/scriptagent_grid/scriptagent_grid.log.
2013-09-24 11:40:27.324
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’. ———提示crsd进程失败
2013-09-24 11:40:28.768
[cssd(3081164)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/rhdiskpower12 will be considered not functional in 19369 milliseconds
2013-09-24 11:40:28.964
[crsd(9371958)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:29.005
[crsd(9371958)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage —–crsd abort
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:29.442
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’. ———提示crsd进程失败,ohasd进程重启CRSD? 1次
2013-09-24 11:40:31.025
[crsd(5243254)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:31.061
[crsd(5243254)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:31.521
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 2次
2013-09-24 11:40:33.106
[crsd(4981632)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:33.151
[crsd(4981632)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:33.601
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 3次
2013-09-24 11:40:35.179
[crsd(4457080)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:35.216
[crsd(4457080)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:35.683
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 4次
2013-09-24 11:40:37.269
[crsd(3146492)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:37.306
[crsd(3146492)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:37.765
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 5次
2013-09-24 11:40:39.338
[crsd(4260124)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:39.375
[crsd(4260124)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:39.845
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 6次
2013-09-24 11:40:41.423
[crsd(2556490)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:41.456
[crsd(2556490)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:41.923
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 7次
2013-09-24 11:40:43.501
[crsd(6357962)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:43.538
[crsd(6357962)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:44.008
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 8次
2013-09-24 11:40:45.578
[crsd(3146246)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:45.615
[crsd(3146246)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:46.094
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 9次
2013-09-24 11:40:47.671
[crsd(6750230)]CRS-1013:The OCR location in an ASM disk group is inaccessible. Details in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:47.709
[crsd(6750230)]CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /grid/database/11.2.0/log/fxdb2/crsd/crsd.log.
2013-09-24 11:40:48.171
[ohasd(2949250)]CRS-2765:Resource ‘ora.crsd’ has failed on server ‘fxdb2’.———提示crsd进程失败,ohasd进程重启CRSD? 10次
2013-09-24 11:40:48.172
[ohasd(2949250)]CRS-2771:Maximum restart attempts reached for resource ‘ora.crsd’; will not restart..———提示crsd进程重启次数超过最大限制,ohasd进程重启CRSD进程已经10次,但是没能重新起来,导致资源无法被关闭。
2013-09-24 11:40:48.878
[cssd(3081164)]CRS-1604:CSSD voting file is offline: /dev/rhdiskpower12; details at (:CSSNM00058:) in /grid/database/11.2.0/log/fxdb2/cssd/ocssd.log.
2013-09-24 11:40:48.878
[cssd(3081164)]CRS-1606:The number of voting files available, 0, is less than the minimum number of voting files required, 1, resulting in CSSD termination to ensure data integrity; details at (:CSSNM00018:) in /grid/database/11.2.0/log/fxdb2/cssd/ocssd.log
2013-09-24 11:40:48.878
[cssd(3081164)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /grid/database/11.2.0/log/fxdb2/cssd/ocssd.log
2013-09-24 11:40:48.930
[cssd(3081164)]CRS-1652:Starting clean up of CRSD resources.
2013-09-24 11:40:49.111
[cssd(3081164)]CRS-1653:The clean up of the CRSD resources failed. .———提示crsd进程失败了,清除资源肯定失败啊,但是节点1就没有以上重启现象以及清除资源是成功的。
2013-09-24 11:47:18.633
[ohasd(1966398)]CRS-2112:The OLR service started on node fxdb2.
2013-09-24 11:47:18.718
[ohasd(1966398)]CRS-1301:Oracle High Availability Service started on node fxdb2.
2013-09-24 11:47:18.768
[ohasd(1966398)]CRS-8011:reboot advisory message from host: fxdb2, component: cssagent, with time stamp: L-2013-09-24-11:40:50.570 ——- 到了这一步了,完全验证我第一次的分析结果,同样的操作,2个节点不一样的反馈,为什么??这是所有疑问的终点。
[ohasd(1966398)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
最终给我们留下了疑问,为什么同样的操作,在节点1和节点2出现的现象在crsd进程上会截然相反?而在其他集群上,无论是哪个节点都和节点2现象一致,这个集群的节点1到目前为止是个案。个中缘由,我想是真的“rebootless restart”生效了吧。这给我带来的疑惑,让我有点往bug方向考虑的倾向,但是官方没回复,我也没有进一步trace,这就是疑问的终点?
鲁大,这个看不懂
有文档介绍么
在11gr2.2的在线文档中有个新特性的介绍就有提及
看起来有点像bug
是有点,但是在节点1里voting 是正常offline的,这个可能是这里的主要原因,但是官方也没给出合理的解释,期待被