My team and I have been digging deep into Oracle database recovery lately, and we’ve noticed that some mistakes by administrators can jeopardize recoverability. Oracle’s Recovery Manager (RMAN) product does a great job of catching issues, but it’s still possible to induce these failures. This blog posting should help to explain how these problems occur and what can be done to correct problems with what we call “range gaps” in the recovery stream.
Restore and Recovery Range Gaps can be induced through improper use of Oracle’s Recovery Manager (RMAN) as well as through improper management of backup data stored in a Media Manager or other backup target. Oracle’s Zero Data Loss Recovery Appliance (ZDLRA) will prevent these problems, or will detect, report and allow them to be corrected as outlined below. Let’s start with a bit of terminology to set the stage.
Restore Range Gaps – Restore Range Gaps refer to gaps in recoverability of data file backups as part of a full plus incremental backup strategy using a combination of LEVEL0 and LEVEL1 backups. Lost Differential Incremental (LEVEL1) backups will essentially invalidate subsequent differential incremental backups.
Recovery Range Gaps – Recovery Range Gaps refer to gaps in recoverability of data due to lost ranges of redo logs.
Ordering Waits in ZDLRA – Oracle’s Zero Data Loss Recovery Appliance (ZDLRA) will place backup pieces into an Ordering Wait status when Restore Range Gaps are detected.
Fetch Archive Log – When Real Time Redo Protection is used with ZDLRA, Recovery Range Gaps can be reconciled by the applications through the Fetch Archive Log (FAL) process.
Backup Polling – ZDLRA supports use of a “polling location” for ingesting backups. RMAN backups written in Disk Format can be directly ingested into ZDLRA through the polling feature.
How Restore Range Gaps Occur
In a non-ZDLRA environment, gaps in the Restore Range can be induced through improper use of RMAN. The following example occurs when a DBA performs a supplementary LEVEL0 backup to disk at SCN 500 in the following diagram:
In the example above, the LEVEL0 is either un-tagged or uses the same tag as backups sent to the primary backup solution. This means the LEVEL0 is part of the backup, even though it has been sent to a different location.
Restore Range Gaps in ZDLRA
ZDLRA is susceptible to the same mistake as shown in the previous section, but ZDLRA will detect the gap and allows the gap to be filled. The example below shows that Virtual L0 backups stop being generated due to the Restore Range Gap.
Gap Detection – ZDLRA will identify Restore Range Gaps, and will place subsequent backup pieces into Ordering Wait status. The above diagram shows a LEVEL0 backup being taken to an auxiliary location such as a space on disk. The resulting LEVEL0 backup includes changed blocks that are critical to database recovery. Generation of Virtual L0 backups will terminate when a Restore Range gap is detected. The LEVEL1 backups following the range gap will be placed into Ordering Wait state until the gap is resolved.
Gap Resolution – ZDLRA is able to correct this problem by simply “polling” the LEVEL0 into the Delta Store. The LEVEL0 backup contained un-changed blocks, but also contains the changed blocks require to fill the Restore Range Gap. ZDLRA will de-duplicate the data by simply discarding the un-changed blocks.
Intermediate Recovery Range Gaps
Recovery Range Gaps occur due to improper management of redo logs. Recovery Range Gaps represent an unrecoverable range in the timeline, meaning the database simply cannot be recovered into that gap. The example below shows a Recovery Range Gap approximately from SCN 425 to SCN 575.
Existence of the associated LEVEL1 backups mean that the database can be recovered to points prior to SCN 425, or after SCN 600, but cannot be recovered to points between SCN 425 and SCN 600. Recovery Range Gaps like this occur in one of 3 ways:
- Deletion of Redo on Source Prior to Backup
- Failed Backup Processing
- Deletion of Redo on Backup Target
Redo logs are stored on each database server, either in an Log Archive Destination, or in a Fast Recovery Area. The Log Archive Destination (specified by the LOG_ARCHIVE_DEST_n parameter) configuration is supported for backward compatibility. Customers should implement the Fast Recovery Area feature instead of using the older Log Archive Destination configuration.
The Fast Recovery Area simplifies management of redo logs and is designed to prevent improper deletion of redo that can jeopardize recoverability. For more information on the Fast Recovery Area feature, please see the Oracle database documentation here.
The Fast Recovery Area allows DBAs to manage redo according to policies, and prevents deleting redo before it is backed-up. Redo is marked as eligible for deletion only after being backed-up. Administrators can still override this setting and delete redo even if it hasn’t been backed-up, but this will create a Recovery Range Gap.
We have seen Recovery Range Gaps generated due to failed backup processing. Some 3rd party backup products have been known to NOT send error messages to RMAN even though the backup data was not saved. One method to detect such problems is to execute the following RMAN command:
RMAN> CROSSCHECK BACKUP
The crosscheck command will check contents of the RMAN catalog against 3rd party media catalog and report any missing data.
Trailing Recovery Range Gaps
Trailing Recovery Range Gaps result in the same problem as described above, but are more likely to be caused by improper deletion of redo on the backup target instead of on the source. The following diagram illustrates this type of failure:
The above example indicates that REDO is being deleted too quickly even though it is required for recoverability. Media recovery cannot be performed in any range prior to SCN 400 in the above example. If all redo prior to SCN 380 has been deleted, this seems to indicate an improper deletion policy on the backup target.
Depending on change rates, redo logs can represent significant space consumption. Redo logs tend to be less compressible, but simply will not de-duplicate because each redo log contains unique (non-duplicate) data by definition.
ZDLRA Resolves Recovery Range Gaps
ZDLRA is able to automatically resolve Recovery Range Gaps, and will ALERT when Recovery Range Gaps are detected. The following diagram shows this capability:
ZDLRA will detect Recovery Range Gaps and fill those gaps using the FAL (Fetch Archive Log) process, which reaches back into the FRA (Fast Recovery Area) of the protected database. For databases not configured for Real Time Redo protection, any un-transmitted redo will be send to ZDLRA using the following backup command:
RMAN> BACKUP DATABASE ... PLUS ARCHIVELOG NOT BACKED UP...
The standard RMAN backup command for use with ZDLRA always includes backup of archived redo that has not been backed-up, even when Real Time Redo Protection is enabled. This configuration provides a failsafe that ensures the redo will be swept from the FRA regardless of whether Real Time Redo Protection is functioning or not.
Range Gaps occur in both the “restore” stream as well as in the redo log or “recovery” stream of data sent to a backup target. Any experienced database administrator knows that databases cannot be recovered to SCN’s that reside within a range of redo that has been lost. It’s important to also know that loss of incremental LEVEL1 backups will create a similar gap in the Restore Range represented by a set of LEVEL0 and LEVEL1 backups. When such Range Gaps occur, database administrators won’t know these problems exist unless a database recovery is attempted. Oracle’s Zero Data Loss Recovery Appliance (ZDLRA) will identify and alert when Restore & Recovery Range Gaps occur. These gaps can also be corrected using some of the unique features of ZDLRA, preserving database recoverability and meeting business needs for data protection.