Simple Snapshots – Guaranteed Data Loss

This blog post is a continuation of my series on the use of storage snapshots for backup/recovery of Oracle databases.  #SnapshotBackups

Elegant Simplicity Brings Limitations

It’s very easy to see why snapshots seem so magical, especially in the Simple Snapshot implementation.  This is a very simple solution and quite easy to understand.  However, there are several disadvantages as we shall see below.  The simplicity is both a blessing and a curse of this design.

Simple Snapshot Configuration

In a simple snapshot configuration, the entire database and online redo logs reside within a single snapshot or consistency group.  A snapshot is created at a consistent point in time for all of those data structures.

screen-shot-2017-01-03-at-1-00-37-pm

This example shows a database and online redo logs that belong to a single snapshot or consistency group.  Three snapshots are taken over the course of time as indicated in red, orange, and blue.  Snapshots always reside within the same disk array as the primary copy of the database, so the disk array is a single point of failure.  Storage Administrator intervention is required to revert to (restore) a snapshot, so the DBA cannot work alone during recovery.  Finally, the simple snapshot configuration only allows you to REVERT to a snapshot, not recovery to points in time (or SCN’s) between snapshots.

Fundamental Challenges

The fundamental challenge is that Oracle databases consist of a set of interrelated structures that are constantly changing.  Oracle databases are never “quiesced” unless they are completely shutdown for a COLD backup, and very few customers can tolerate database downtime simply to perform a backup.  The following diagram shows some of the object relationship that exist within an Oracle database.

screen-shot-2017-01-03-at-1-17-03-pm

The diagram above shows a very simple database consisting of 2 tables, which are the EMP (Employee) and DEPT (Department) tables.  Each employee belongs to a department, so these two tables have a relationship defined between them.  Each table typically has one or more indexes defined on it, and indexes are obviously related to the data in the tables.  All of the indexes, tables, and other structures are recorded in the database metadata, which is stored in the Data Dictionary.  As transactions are executed against the database, Oracle stores UNDO information (needed to back-out changes) as well as REDO information, which is needed during database recovery.

All of the data in the database is mapped onto a set of files.  The database is constantly changing, and the underlying files are constantly changing.  Storage snapshots essentially emulate what happens during a system crash.  One of the core features of any database engine is to recover from a system crash without causing corruption in the database.  The Simple Snapshot configuration leverages this core feature of the database engine.

Archived Redo – Not Necessary and Not Usable

The Simple Snapshot configuration gives the ability to REVERT the entire database to specific points in time, but does not allow for RECOVERY to points on the timeline between snapshots.  In this configuration, there is no reason to use ARCHIVELOG MODE in the database, and no reason to create or backup archived redo logs.  The online logs contain all information needed to perform crash recovery when you REVERT to a snapshot.

Single Point of Failure

Since the database and all snapshots (the backups) reside within a single disk array, the array itself is a single point of failure.  I have personally seen the failure of an entire disk array on virtually every brand of disk array on the market.  It’s possible to lose the entire disk array, and this does happen in reality.  Please be sure to EXTERNALIZE the backup somewhere else if you are using storage snapshots for backups.

Plan for the Worst, but Hope for the Best

Assuming you understand the exposure of having your database and all of the backups (snapshots) residing inside a single disk array, you should understand why it’s important to make a copy of those backups somewhere else (external to the disk array).  While it’s very FAST to simply revert to a snapshot, this is your BEST CASE time and performance of the recovery.

When you set expectations with your business users, be sure to plan for the worst and hope for the best.  Quote to your users the time it takes to bring the database back from the EXTERNAL location, then be happy if you’re able to use a snapshot instead.

 

 

Overview of Backup using Snapshots

This blog posting is one in a series of posts from my years working in the Oracle field support organization known as Advanced Customer Services (ACS).  #DatabaseParamedics

I will also be making a series of blog posts regarding how customers use storage snapshots for backup/recovery of Oracle databases.  #SnapshotBackups

Magical Qualities of Storage Snapshots

Storage Snapshots seem to have a magical quality when it comes to data protection, but not so magical once you’ve seen the dark side.  One of my first managers (before I joined Oracle) used to say that people often need a “Significant Emotional Event” before they truly understand the concept of risk.  In this series of blog posts, I hope everyone will come to understand in more detail why Oracle backup/recovery experts don’t like using snapshots for backups.  I also hope you can avoid having your own Significant Emotional Event and learn from the experience of others.

Snapshots for Cloning & Transient Fallback – Good Idea!

It’s important to begin by saying that storage snapshot have a place in this world.  Snapshots are excellent for cloning Oracle databases, especially with thin-provisioned snapshots.  The typical Oracle EBS development environment might have 6 application modules and 6 project phases, resulting in 36 development instances (6×6=36).  All of these 36 instances only have slight variations in the data they contain, making them perfect candidates for thin-provisioned cloning.

Transient fallback is another great use-case for storage snapshots, providing an easy method to back-out from a failed upgrade including system (O/S) upgrade, database engine upgrade, application upgrade, etc.  It’s still critical to have a proper backup & recovery solution, but snapshots can provide additional options.  Just don’t rely on snapshots as your only data protection solution.

17+ Variations of Snapshots

I have encountered no fewer than 17 variations of storage snapshot implementations for protecting Oracle databases.  It normally takes a lengthy conversation to fully understand how the customer has implemented snapshots in their environment and what exposure they might have.

  • Simple Snapshot of primary DB storage
  • Simple Snapshot of primary DB storage with Storage Replication
  • Simple Snapshot of primary DB storage with Sweep
  • Recoverable, Multi-Snapshot of primary DB storage
  • Recoverable, Multi-Snapshots of primary DB storage with Storage Replication
  • Recoverable, Multi-Snapshots of primary db storage with Sweep
  • Snapshot of primary DB storage using User Managed Backup
  • Snapshot of primary DB storage using User Managed Backup with Redo Snapshot
  • Snapshots as RMAN Proxy Copy
  • Snapshots as RMAN Proxy Copy with Storage Replication
  • Snapshots as RMAN Proxy Copy with Sweep
  • Snapshot of RMAN Backup Sets
  • Snapshots of RMAN Backup Sets with replication
  • Snapshots of RMAN Backup Sets storage with sweep
  • Snapshots of RMAN Incremental Merge
  • Snapshots of RMAN Incremental Merge with Storage Replication
  • Snapshots of RMAN Incremental Merge with Sweep

Each of these implementations brings a variety of advantages and disadvantages.  The variety of implementations are one indication that snapshots aren’t so “magical” after all.

Simple Snapshots

The simplest to understand and most “elegant” snapshot solution is what I call the simple snapshot configuration.  I have also called this a guaranteed data loss configuration, since you can only REVERT to a previous snapshot rather than applying redo logs to recover to a specific SCN.  In the simple snapshot configuration, the entire database is contained within a single snapshot, including data files, online redo logs, control files, etc.

In a simple snapshot configuration, there is no reason to run the database in Archive Log Mode because it’s not possible to recover the database.  The only option is to REVERT the database to the time of a previous snapshot.

It’s also important to note that a “valid” simple snapshot emulates a database “crash” by snapshotting all database files (data files, redo logs, control files, etc.) at a specific moment in time.  This technique cannot be used on storage that doesn’t provide consistency across all files or across all disk volumes under the datagbase.

Storage Replication vs. Sweep

There is essentially very little functional difference between use of storage replication versus “sweep” processing, but these are vastly different to implement.  Sweep processing means that you are writing code (usually shell script) to copy files to another location.  Storage Replication is a feature of a disk storage array or NAS filer.  You get more control with a custom-written “sweep” process, but nothing is automated out of the box.

Recoverable Snapshots

The word “recoverable” means that Oracle database recovery or the “log apply” process, moving the database forward in time to the desired SCN.  The recoverable snapshot configuration means that online redo, archived redo, and control files are kept in a separate snapshot from the data files.  You then revert the data files to the desired snapshot and use the redo logs (archived and online) to move the database forward in time to the desired SCN.

One challenge with Recoverable Snapshots is that the Oracle database is never quiesced, and data files can have blocks written at a higher SCN than the data file checkpoint SCN.  This simple fact means that you must determine the highest SCN written of all data files in the database, then recover to any SCN that’s equal to or higher than the high-written SCN.  Oracle12c includes a feature that can be used to determine the high-write SCN using the timestamp of the snapshot, so this will get easier moving forward.

Snapshots with User Managed Backups

The legacy Oracle database backup method used BEGIN and END commands to signal the start and end of a hot backup.  The BEGIN and END commands were originally done for each tablespace, but Oracle later added the ability to put the entire database into hot backup mode.  This old feature generated a spike in redo log activity, since the DBMS writes entire BLOCKS into the redo log for any blocks written while in hot backup mode.

Oracle supports the old User Managed Backups mainly for backward compatibility.  Some storage technologies still aren’t able to generate a consistent snapshot across all disk volumes that underlie a database, making this legacy feature a necessity.

This old User Managed Backup feature (of course) doesn’t involve RMAN, so doesn’t have any sort of catalog.  You will have to develop your own scheme for locating backups, determining which files belong together as a group, ensuring all of the backup files are handled as a group, etc.

Proxy Copy

For storage arrays that support the “proxy copy” feature of RMAN, this is a good way to use snapshots and still get benefits from RMAN such as RMAN Catalog capability.  Third party vendor solutions such as Commvault’s IntelliSnap implement the Proxy Copy capability of RMAN.  Some conventional Media Manager products also work with RMAN Proxy Copy.

Snapshots of RMAN Backup Sets

An RMAN Backup Set is a discrete set of files that won’t change after they are generated, so this doesn’t seem to be a good use of snapshot technologies.  However, I have seen customers who implemented snapshots of RMAN Backup Sets in order to prevent the DBA from deleting the backups.  The DBA has access to the files of the Backup Set and can freely modify or delete those files, but the DBA cannot modify the snapshot.

Of course once you implement snapshots of RMAN Backup Sets, you can then either replicate or sweep those backups to another location (usually on a different disk array).

Snapshots of Incremental Merge

The most intriguing implementation of snapshots is in combination with the RMAN Incremental Merge feature.  RMAN includes a feature that allows you to generate an Image Copy backup, then generate incremental backups and merge those incremental into the image copy.  This is functionally very similar to Oracle Data Guard, in that you have a database that is constantly rolling forward.  The main differences are that the database moves forward through incremental apply instead of through log-apply, and the roll forward is typically done using horsepower of the primary database server.  Incremental Merge typically happens once per day, but you can match that capability with Data Guard using a delay setting.

Once you have a database being moved forward in time using Incremental Merge, you then layer storage snapshots on top, providing multiple “restore points” for that copy of the database.  You can also add storage replication, making a remote copy of the Image Copy to another location.

In my next blog post on this topic, I will get into details of the “simple snapshot” implementation including why this is a guaranteed data loss solution.