Sunday, November 1, 2015

How to Recover the Voting disk in RAC Environment

I am demonstrating  the Voting disk recovery on this blog.  This exercise is tested in oracle12c test cluster environment. It is two node RAC and node names are RACTEST1 and RACTEST2.

My voting disk has external redundancy.   Hence it is only one copy of voting disk present on the ASM disk.

My goal is to corrupt the voting disk and restore in new disk.  Here are the high level steps.

  1. corrupt the voting disk
  2. shutdown the database
  3. stop the CRS service on all nodes
  4. start the CRS service on exclusive mode for ONE node
  5. create the new disk or Use existing disk
  6. Restore the voting disk to newly created node
  7. stop the CRS which is started on exclusive mode
  8. start the CRS on both node
  9. start the database and make sure all the instance are up 
  10. verify the cluster service and make sure all good!


Step 1

Let us check where the Voting disk is present on the cluster.

crsctl query css votedisk







The voting disk is stored on VOTE1 disk on the VOTE1 disk group.

Step 2

Let me corrupt the voting disk. Here is the command to corrupt the disk.

dd if=/dev/zero of=/dev/oracleasm/disks/VITE1 bs=4096 count=100










At this stage, clusterware should have stopped working. 

I rebooted the node and checked the cluster and the service was down. It is not necessary to reboot the node.  It is test environment and just rebooted to check the cluster service.  In production environment, it is enough to bounce the CRS service. 

Step 3

crsctl check crs
crsctl check cluster







Step 4   Shutdown the database



[oracle@RACTEST1 ~]$ srvctl status database -db govinddb
Instance govinddb1 is running on node ractest1
Instance govinddb2 is running on node ractest2
[oracle@RACTEST1 ~]$ srvctl stop database -db govinddb
[oracle@RACTEST1 ~]$  srvctl status database -db govinddb
Instance govinddb1 is not running on node ractest1
Instance govinddb2 is not running on node ractest2

[oracle@RACTEST1 ~]$

Step 5  Stop the CRS  on all nodes

[root@RACTEST1 bin]# ./crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ractest1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'ractest1'
CRS-2673: Attempting to stop 'ora.evmd' on 'ractest1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'ractest1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'ractest1'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'ractest1'
CRS-2673: Attempting to stop 'ora.gipcd' on 'ractest1'
CRS-2677: Stop of 'ora.drivers.acfs' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.cssdmonitor' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.gipcd' on 'ractest1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ractest1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@RACTEST1 bin]#


[root@RACTEST2 bin]# ./crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ractest2'
CRS-2673: Attempting to stop 'ora.gipcd' on 'ractest2'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'ractest2'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'ractest2'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'ractest2'
CRS-2673: Attempting to stop 'ora.evmd' on 'ractest2'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'ractest2'
CRS-2677: Stop of 'ora.cssdmonitor' on 'ractest2' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'ractest2' succeeded
CRS-2677: Stop of 'ora.gipcd' on 'ractest2' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'ractest2' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'ractest2' succeeded
CRS-2677: Stop of 'ora.evmd' on 'ractest2' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ractest2' has completed
CRS-4133: Oracle High Availability Services has been stopped.

[root@RACTEST2 bin]#

Step 6  Start  the CRS  on one node with exclusive mode

[root@RACTEST1 bin]# ./crsctl start crs -excl
CRS-4123: Oracle High Availability Services has been started.
CRS-2672: Attempting to start 'ora.evmd' on 'ractest1'
CRS-2672: Attempting to start 'ora.mdnsd' on 'ractest1'
CRS-2676: Start of 'ora.mdnsd' on 'ractest1' succeeded
CRS-2676: Start of 'ora.evmd' on 'ractest1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'ractest1'
CRS-2676: Start of 'ora.gpnpd' on 'ractest1' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'ractest1'
CRS-2672: Attempting to start 'ora.gipcd' on 'ractest1'
CRS-2676: Start of 'ora.cssdmonitor' on 'ractest1' succeeded
CRS-2676: Start of 'ora.gipcd' on 'ractest1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'ractest1'
CRS-2672: Attempting to start 'ora.diskmon' on 'ractest1'
CRS-2676: Start of 'ora.diskmon' on 'ractest1' succeeded
CRS-2676: Start of 'ora.cssd' on 'ractest1' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'ractest1'
CRS-2672: Attempting to start 'ora.ctssd' on 'ractest1'
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'ractest1'
CRS-2676: Start of 'ora.crf' on 'ractest1' succeeded
CRS-2676: Start of 'ora.ctssd' on 'ractest1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'ractest1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'ractest1'
CRS-2676: Start of 'ora.asm' on 'ractest1' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'ractest1'
CRS-2676: Start of 'ora.storage' on 'ractest1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'ractest1'
CRS-2676: Start of 'ora.crsd' on 'ractest1' succeeded
[root@RACTEST1 bin]#

Step 7  Verify the ASM instance and make sure it is up and running.  Start the ASM instance if it is down. My case, ASM instance is started as part of starting the CRS service. We need ASM instance to restore the voting disk.

Login to ASM instance.

echo INSTANCE_TYPE=ASM >> /u01/app/oracle/init+ASM1.ora 
startup pfile='/u01/app/oracle/init+ASM1.ora';

Step 7  Create the new disk VOTE2 for restoring the voting disk.

SQL>CREATE DISKGROUP VOTE2 EXTERNAL REDUNDANCY
DISK 'ORCL:VOTE2'
ATTRIBUTE 'au_size'='4M',
'compatible.asm' = '11.2.0.2.0',
'compatible.rdbms' = '11.2.0.2.0',
'compatible.advm' = '11.2.0.2.0';
SQL>

Diskgroup created.


SQL>

Step 8  Restore the voting disk on newly created disk VOTE2.  

Let us check the current voting disk location.  Now no voting disk is displaying here. Since the voting disk is already corrupted.

[oracle@RACTEST1 bin]$ ./crsctl query css votedisk
Located 0 voting disk(s).

Now recover  the voting disk as below. Voting disk is automatically recovererd using the lastest available copy of OCR.

[oracle@RACTEST1 bin]$ ./crsctl replace votedisk +VOTE2
Successful addition of voting disk 5d17422445e54f1abf131f15b967c07f.
Successfully replaced voting disk group with +VOTE2.
CRS-4266: Voting file(s) successfully replaced

Let us check the voting disk again.


[oracle@RACTEST1 bin]$  ./crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   5d17422445e54f1abf131f15b967c07f (ORCL:VOTE2) [VOTE2]
Located 1 voting disk(s).
[oracle@RACTEST1 bin]$

Step 9  Stop the CRS on RACTEST1 which was started in exclusive mode.


[root@RACTEST1 bin]# ./crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ractest1'
CRS-2673: Attempting to stop 'ora.crsd' on 'ractest1'
CRS-2677: Stop of 'ora.crsd' on 'ractest1' succeeded
CRS-2673: Attempting to stop 'ora.storage' on 'ractest1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'ractest1'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'ractest1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'ractest1'
CRS-2677: Stop of 'ora.storage' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'ractest1' succeeded
CRS-2673: Attempting to stop 'ora.crf' on 'ractest1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'ractest1'
CRS-2673: Attempting to stop 'ora.evmd' on 'ractest1'
CRS-2673: Attempting to stop 'ora.asm' on 'ractest1'
CRS-2677: Stop of 'ora.gpnpd' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.crf' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.evmd' on 'ractest1' succeeded
CRS-2677: Stop of 'ora.asm' on 'ractest1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'ractest1'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'ractest1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'ractest1'
CRS-2677: Stop of 'ora.cssd' on 'ractest1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'ractest1'
CRS-2677: Stop of 'ora.gipcd' on 'ractest1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ractest1' has complet ed
CRS-4133: Oracle High Availability Services has been stopped.

Step 10  Start the CRS on RACTEST1  and RACTEST1


[root@RACTEST1 bin]# ./crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
[root@RACTEST1 bin]#


[root@RACTEST2 bin]# ./crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
[root@RACTEST2 bin]#

Step 11 Make sure Cluster is up and running. Restart the  cluster if it is down.  But my case, cluster is up and running. 


[root@RACTEST1 bin]# ./crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online



[root@RACTEST1 bin]# ./crsctl check cluster

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

[root@RACTEST1 bin]#


Step 11  Start the database and monitor the alert log

[oracle@RACTEST1 ~]$  srvctl status database -db govinddb
Instance govinddb1 is not running on node ractest1
Instance govinddb2 is not running on node ractest2
[oracle@RACTEST1 ~]$ srvctl start database -db govinddb
[oracle@RACTEST1 ~]$ srvctl status database -db govinddb
Instance govinddb1 is running on node ractest1
Instance govinddb2 is running on node ractest2

[oracle@RACTEST1 ~]$

Additional note......

We do not need to bring the CRS  service down when there is at least one working copy is intact.

CRS should be up  on all nodes for the following operations(per Doc ID 428681.1)

1. Adding additional voting disk on the disk group
2. Moving the voting disk
3. Deleting one of the voting disk on the diskgroup
4. Adding another copy of OCR file on different disk
5. Moving the OCR file to different disk
6. Removing one copy of OCR file

CRS should be up in ONLY one node with exclusive mode as per Doc ID 1062983.1 for the following operations.

1. We have ONLY one copy of OCR file  and it is corrupted.
2. We have ONLY one copy of voting disk and it is corrupted

No comments: