I am taking a break from my regular route to Beginners Guide because indeed today I
faced an issue which typically was confusing. So confusing that at-last when I got
over it I thought of blogging it so that if similar issues you face in the near
future you can get an instant solution to it and wont go around pulling your
hairs.
We have a Production Database where heavy OLTP transactions occur and to support
the disaster recovery for this server we have a Data-guard solution set up. The
Oracle version we are using here is 10.2.0.4 Enterprise Edition on Sun SPARC OS.
Yesterday, the client reported that DG is not in Synch with the Production so
thats where my job came into scene. Logging into the database I found that indeed
there was an archive log gap and due to this the DG had stop synching with the
Production.
STANDBY APPLIED
-----------------
65783 NO
65523 YES
From above output it is quite clear that there is an archive gap and due to which
the recovery services had stopped. The first thing to do insuch scenario is to
check if the missing archives are present at the Production. If they are present..
Pheww!!!! You are saved from a hectic recovery schedule because you just need to
take the backup of this archives from production or transfer these archives from
Production to DR and apply them there. But in case you are not so lucky and you
dont have archive backups at the production you need to follow the other way
around where you need to take the incremental RMAN backup at Production and restore
the same at the DR end. But that is the topic for another blog.
In my case I was too lucky as the retention policy at the Production for archives
is 3 days. So the archives were present there. Lucky me So what I did was just
transferred the missing archives from Production to DR and at DR stopped the media
recovery process, recovered the DR with RMAN and started the media recovery
process. Guess this should have solved my problem. Right!!!!!! Well I was wrong So
what went wrong? These are the steps I followed.
Used Simple SCP to transfer the missing archives from sequence 65524 to 65533 from
Production to DR.
Stopped the Media recovery Process at DR.
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
At first instance all was well. The recovery process started well and the gap was
filled and synching was completed. Then the real problem arose.
Well I was checking the DR end regarding the synching using following commands.
STANDBY APPLIED
-----------------
65794 YES
Seems fine But hell No Client was not convinced he ran the following query at the
Production and the output was always NO.
Whats wrong in this then? Why the output different at Production and DR when it
should be same? After lots of searching, I found that this is bug in Oracle 10g
whereby the MEDIA RECOVERY PROCESS gets hanged due to some reason causing this
issue. I wont go in more detail regarding the bug but if you have metalink account
on Oracle you can logged in and go through Note: 1369630.1 if not you can download
the documentation here.
Note that this error doesnt mean that Production and DR are not in synch but as
the Media Recovery Process is hanged at Production Site it is not able to update
the internal views. The solution to this is upgrade to Oracle 11g or higher (Oracle
12c is already introduced) or you can restart the Production Database if possible
to overcome the issue temporarily.
Hope this would have been help to all you folks!!!! Comments are WELCOME!!!!
==============
Resolving Gaps in Data Guard Apply Using Incremental RMAN BAckup
Recently, we had a glitch on a Data Guard (physical standby database) on
infrastructure. This is not a critical database; so the monitoring was relatively
lax. And that being done by an outsourcer does not help it either. In any case, the
laxness resulted in a failure remaining undetected for quite some time and it was
eventually discovered only when the customer complained. This standby database is
usually opened for read only access from time to time.This time, however, the
customer saw that the data was significantly out of sync with primary and raised a
red flag. Unfortunately, at this time it had become a rather political issue.
Since the DBA in charge couldnt resolve the problem, I was called in. In this
post, I will describe the issue and how it was resolved. In summary, there are two
parts of the problem:
What Happened
Lets look at the first question what caused the standby to lag behind. First, I
looked for the current SCN numbers of the primary and standby databases. On the
primary:
CURRENT_SCN
-----------
1447102
On the standby:
CURRENT_SCN
-----------
1301571
Clearly there is a difference. But this by itself does not indicate a problem;
since the standby is expected to lag behind the primary (this is an asynchronous
non-real time apply setup). The real question is how much it is lagging in the
terms of wall clock. To know that I used the scn_to_timestamp function to translate
the SCN to a timestamp:
SCN_TO_TIMESTAMP(1447102)
-------------------------------
18-DEC-09 08.54.28.000000000 AM
I ran the same query to know the timestamp associated with the SCN of the standby
database as well (note, I ran it on the primary database, though; since it will
fail in the standby in a mounted mode):
SCN_TO_TIMESTAMP(1301571)
-------------------------------
15-DEC-09 07.19.27.000000000 PM
This shows that the standby is two and half days lagging! The data at this point is
not just stale; it must be rotten.
The next question is why it would be lagging so far back in the past. This is a
10.2 database where FAL server should automatically resolved any gaps in archived
logs. Something must have happened that caused the FAL (fetch archived log) process
to fail. To get that answer, first, I checked the alert log of the standby
instance. I found these lines that showed the issue clearly:
This clearly showed the issue. On December 15th at 17:16:15, the Managed Recovery
Process encountered an error while receiving the log information from the primary.
The error was ORA-12514 TNS:listener does not currently know of service requested
in connect descriptor. This is usually the case when the TNS connect string is
incorrectly specified. The primary is called DEL1 and there is a connect string
called DEL1 in the standby server.
The connect string works well. Actually, right now there is no issue with the
standby getting the archived logs; so there connect string is fine - now. The
standby is receiving log information from the primary. There must have been some
temporary hiccups causing that specific archived log not to travel to the standby.
If that log was somehow skipped (could be an intermittent problem), then it should
have been picked by the FAL process later on; but that never happened. Since the
sequence# 700 was not applied, none of the logs received later 701, 702 and so on
were applied either. This has caused the standby to lag behind since that time.
So, the fundamental question was why FAL did not fetch the archived log sequence#
700 from the primary. To get to that, I looked into the alert log of the primary
instance. The following lines were of interest:
The archived log simply was not available. The process could not see the file and
couldnt get it across to the standby site.
Upon further investigation I found that the DBA actually removed the archived logs
to make some room in the filesystem without realizing that his action has removed
the most current one which was yet to be transmitted to the remote site. The
mystery surrounding why the FAL did not get that log was finally cleared.
Solution
Now that I know the cause, the focus was now on the resolution. If the archived log
sequence# 700 was available on the primary, I could have easily copied it over to
the standby, registered the log file and let the managed recovery process pick it
up. But unfortunately, the file was gone and I couldnt just recreate the file.
Until that logfile was applied, the recovery will not move forward. So, what are my
options?
One option is of course to recreate the standby - possible one but not technically
feasible considering the time required. The other option is to apply the
incremental backup of primary from that SCN number. Thats the key the backup
must be from a specific SCN number. I have described the process since it is not
very obvious. The following shows the step by step approach for resolving this
problem. I have shown where the actions must be performed [Standby] or [Primary].
Database altered.
2. [Standby] Shutdown the standby database
3. [Primary] On the primary, take an incremental backup from the SCN number where
the standby has been stuck:
RMAN> run {
2> allocate channel c1 type disk format '/u01/oraback/%U.rmb';
3> backup incremental from scn 1301571 database;
4> }
Database altered.
8. [Standby] Replace the controlfile with the one you just created in primary.
9. $ cp /u01/oraback/DEL1_standby.ctl /u01/oradata/standby_cntfile.ctl
11.[Standby] RMAN does not know about these files yet; so you must let it know by
a process called cataloging. Catalog these files:
$ rman target=/
Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done
13. After some time, the recovery fails with the message:
This happens because we have come to the last of the archived logs. The expected
archived log with sequence# 8008 has not been generated yet.
14.At this point exit RMAN and start managed recovery process:
SQL> alter database recover managed standby database disconnect from session;
Database altered.
CURRENT_SCN
-----------
1447474
[Primary] SQL> select current_scn from v$database;
CURRENT_SCN
-----------
1447478
Now they are very close to each other. The standby has now caught up.
=================
Resolving Gaps in Data Guard Apply Using Incremental RMAN BAckup
Resolving Gaps in Data Guard Apply Using Incremental RMAN Backup:-
On the primary:
CURRENT_SCN
-----------
2076457
On the standby:
DATABASE_ROLE SWITCHOVER_STATUS
---------------- --------------------
PHYSICAL STANDBY NOT ALLOWED
CURRENT_SCN
-----------
1998045
Solution:-
1. [Standby] Stop the managed standby apply process:
3. [Primary] On the primary, take an incremental backup from the SCN number where
the standby has been stuck;
8. [Standby] Replace the controlfile with the one you just created in primary. and
rename the existing file which is already exist on the standby.
$ cp /u01/tspr/stdb_cont.ctl /u01/app/oracle/oradata/stdb/control01.ctl
Database altered.
10 .[Standby] RMAN does not know about these files yet; so you must let it know
by a process called cataloging. Catalog these files:
Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done
List of Cataloged Files
=======================
File Name: /u01/tspr/29pij5mb_1_1
File Name: /u01/tspr/2apij5p1_1_1
if you will get this error then follow the step. and database must be mounted;
Database altered.
archived log for thread 1 with sequence 373 is already on disk as file
/u01/app/oracle/flash_recovery_area/STDB/orcl_373_1_850048081.arc
archived log file
name=/u01/app/oracle/flash_recovery_area/STDB/orcl_373_1_850048081.arc thread=1
sequence=373
unable to find archived log
archived log thread=1 sequence=374
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 09/15/2014 18:50:43
RMAN-06054: media recovery requesting unknown archived log for thread 1 with
sequence 374 and starting SCN of 2078522
Note:- After some time, the recovery fails with the message. This happens because
we have come to the last of the archived logs.
12 .At this point exit RMAN and start managed recovery process:
SQL> alter database recover managed standby database disconnect from session;
Database altered.
CURRENT_SCN
-----------
2082113
DATABASE_ROLE SWITCHOVER_STATUS
---------------- --------------------
PRIMARY TO STANDBY