Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • Lesson 12 – Tuning Oracle RMAN Jobs
  • Author: Dave Anderson, SkillBuilders
  • www.skillbuilders.com
  • Last Update: October 17, 2007

Other Text:
(Examples or comments displayed on slide, if any).

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • This lesson provides a solid introduction to tuning RMAN backup and recovery operations. You will learn:
    • How to set reasonable goals.
    • How fast RMAN can go – by reviewing statistics from Oracle Corp’s benchmark tests.
    • What factors impact the speed of the operations.
    • How to capture statistics to monitor the performance of the operations. (i.e. What tools to use.)
    • How to check the status of an RMAN operation – including determining if a job is waiting for a resource.
  • Most of the theories presented herein are validated with tests that show the improvement of a change.

Other Text:
(Examples or comments displayed on slide, if any).

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • How fast will your backups go? Well, that depends on factors such as the transfer rates of your disk and tape devices (the connection from the tape/disk device to the computer can also be a limiting factor). You will not be able to make an RMAN operation go faster than the cumulative speed of your disk and/or tape devices.

Other Text:
(Examples or comments displayed on slide, if any).

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • Oracle conducted a performance test a the Sun Customer Benchmark Center in October of 2001. The results are documented in the white paper “Oracle Recovery Manager: Performance Testing at Sun Customer Benchmark Center” (An Oracle White Paper October 2001). This white paper is available at http://technet.oracle.com. (Search using the document title.)
  • Oracle used the “Stripe and Mirror Everything” (S.A.M.E.) volume management as described in the white paper “Optimal Storage Management Made Easy” by Juan Loaiza. Refer to this white paper (also available at http://technet.oracle.com) and the section on S.A.M.E. later in this lesson.

Other Text:
(Examples or comments displayed on slide, if any).

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • In this laboratory environment, Oracle was able to achieve backup speeds of approximately 530 GB per hour using 20 output tape drives. Adding 20 additional tape drives increased the throughput to approximately 1TB per hour (approximately doubling the throughput), showing near linear scalability with the addition of tape drives.

Other Text:
(Examples or comments displayed on slide, if any).

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • Let's discuss the factors that will impact the performance of your RMAN operations. I will list them here (with a brief note on each), then look at each in more detail and incorporate these factors in case studies throughout the remainder of this lesson.
  • The performance factors include:
    • Disk configuration – The number of disks you have containing the Oracle data to be backed up will have a definite impact on RMAN performance. Using an extreme example, imagine the performance bottleneck if only a single device was available to hold all your Oracle data. The backup or restore operation could not go any faster that the transfer rate of that one device. The SAME document summarizes the issue as follows: “stripe data to equalize the workload across disks and eliminate hotspots” and “stripe data to enable many disks to serve requests for any subset of data”. Striping will have a positive impact on performance. Device transfer rate will limit performance. The SAME methodology also finds that, for availability, mirroring will perform better that RAID-5.


  • Notes for this slide continue on the next page…

Other Text:
(Examples or comments displayed on slide, if any).

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

    • Output device configuration – Whether you use disk or tape for your output device, the number of available devices will have a large impact on RMAN performance. Of course, the specifications of the output devices will have an impact. You will not be able to backup or recover faster than the cumulative transfer rate of these devices.
    • RMAN settings – Several RMAN settings such as the degree of parallelism and multiplexing will have a performance impact. Of course, we will study them in this lesson.
    • Database settings – We will need to tune memory areas and allocate I/O slaves if we use synchronous devices such as tapes (to implement asynchronous operations to those devices).
    • Processor speed – How fast can your processor issue I/O requests and verify Oracle data? Though the I/O factors are usually the problem, CPU speed can be a factor.
    • Existing load on server – Of course, the existing load on your server will impact how long an RMAN job will take. Usually there’s not a lot we can do about that.


Other Text:
(Examples or comments displayed on slide, if any).

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • Oracle 10g provides a new view call V$RMAN_BACKUP_JOB_DETAILS. This view contains elapsed time and read rate for RMAN jobs and should make the PL/SQL package runstats_pkg presented later in this chapter unnecessary.
  • I will present several test cases in this lesson. The tools we will use to capture RMAN runtime statistics are:
    • V$BACKUP_SYNC_IO – This view contains statistics on “ongoing and recently completed” RMAN synchronous I/O operations. We will see that we actually do not want to see statistics here because we will prefer asynchronous I/O. So we will just query this view to insure there is nothing here – confirming that we are using asynchronous I/O.
    • V$BACKUP_ASYNC_IO - This view contains statistics on “ongoing and recently completed” RMAN asynchronous I/O operations.
    • RUNSTATS.SQL – This script (supplied with the course book) is derived from a Tom Kyte script (see http://asktom.oracle.com). This script uses V$MYSTAT to measure the total overall elapsed time for the RMAN job.
  • Refer to the Oracle Database Reference manual for detailed information on these V$ views.

Other Text:
(Examples or comments displayed on slide, if any).

Oracle 10g R2 and above<br>

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • RMAN reports its activity to the V$SESSION_LONGOPS view. Therefore we can query this view to determine the status of a job and also determine if a job is hung (perhaps waiting for a tape mount). The example shown above displays activity that is not complete (sofar <> totalwork). It also shows the LAST_UPDATE. If a job is taking longer than you think it should, query again after a couple of minutes. If the SOFAR value or the LAST_UPDATE value has not changed, try to find a reason that the session is hung. The V$SESSION_WAIT view can assist here (though it will not show a tape media manager event).
  • A more detailed version of this query – including a query of V$SESSION_WAIT – is supplied in the script file V$SESSION_LONGOPS.SQL.

Other Text:
(Examples or comments displayed on slide, if any).

1 select sid, opname, sofar, totalwork,
2 to_char(last_update_time, 'mm-dd hh:mi')
3 as last_update
4 from v$session_longops
5* where sofar <> totalwork
system@TEST> /

SID OPNAME SOFAR TOTALWORK LAST_UPDATE
---------- ------------------------- ---------- ----------
15 RMAN: aggregate input 108990 124960 02-02 03:37
16 RMAN: aggregate output 65980 0 02-02 03:37

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • We see in this example the Oracle reports the wait event as “sbtbackup”. I have been waiting for 1000 seconds (16 minutes) for a tape to be mounted.
  • Chapter 12 of the Oracle 10g Backup and Recovery Advanced User’s Guide and Chapter 15 of the Oracle9i Recovery Manager User’s Guide lists all the possible tape-related events:
    • sbtinit sbtopen sbtread sbtwrite sbtbackup.
  • See the supplied script V$SESSION_LONGOPS.SQL for a working sample of this code.

Other Text:
(Examples or comments displayed on slide, if any).

system@TEST> SELECT sid, seconds_in_wait AS sec_wait, event
2 FROM v$session_wait
3 WHERE wait_time = 0
4 ORDER BY sid;

SID SEC_WAIT EVENT
---------- ---------- -------------------------------------
1 82491 pmon timer
. . .
12 1006 SQL*Net message from client <br>
16 0 SQL*Net message from client <br>
17 1006 imm op
18 1006 sbtbackup

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • Though the cost of very large disk devices (e.g. 80GB) is always dropping, and thus it is easy to place more of your data on fewer disks, this is not recommended for best overall Oracle performance. The Oracle benchmark referenced earlier in this lesson showed that the “stripe and mirror everything” (S.A.M.E.) configuration provided the best overall performance. The white paper states “The SAME configuration produces close to optimal performance for all workloads: OLTP, Warehouse and Batch.”
  • The S.A.M.E. methodology calls for:
    • Striping all Oracle files over many disks (the more the better) and using a 1MB stripe size. Note that the 1MB stripe size has to do with disk transfer speed and seek rate and could change as these characteristics change with advancement in disk technology.
    • Placing high use application files and system files (e.g. redo logs) on the outer half of the disk device (because the transfer rate on the outer half of disk devices is faster than the inner half).
    • Using disk mirroring rather than RAID-5 (for availability) because RAID-5 introduces unnecessary overhead.
  • Volume managers such as the “Veritas Volume Manager” should be used because they provide tools for easy striping of files across many disks (up to hundreds of disks).

Author’s Note

Unlike subsequent theories discussed later in this lesson, the limitations of my lab environment did not permit actual testing of the theories in SAME. I am relying on the tests conducted by Oracle and documented in the white paper “Oracle Recovery Manager: Performance Testing at Sun Customer Benchmark Center” for the conclusions described on this page.

Other Text:
(Examples or comments displayed on slide, if any).

Mirror + Stripe = RAID 1 + 0 or “RAID ten”<br>

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • When using RMAN, the number of allocated channels (manually or configured) determines the degree of parallelization. Parallelism in RMAN is the number of server processes (i.e. channels) that are reading and writing Oracle data. If one channel is allocated, the degree of parallelization is one; if two channels are allocated the degree of parallelization is two, etc.
  • The general guideline is to allocate one channel per output device. This is because allocating more channels than output devices will often just create a bottleneck by sending more data (issuing more I/O requests) than the device can handle, thus the additional channels/processes will wait. While one channel per output device is common, this is not always the case. Backup to disk devices that have high transfer rates might warrant more channels. The best advice is to test. See the next page for an example of how to test the performance of an RMAN operation.
  • A point to remember is that each channel allocated to a backup creates a backup set. Too many backup sets can become a management headache.

Other Text:
(Examples or comments displayed on slide, if any).

Performance guideline<br>Manageability issue<br>

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • This test invokes a backup to two disk devices using two channels. Because Oracle9i RMAN did not have the ability to show elapsed time by RMAN job, I execute the routines “RS_START” and “RS_STOP” to capture performance statistics (see the statistics on the following page). These scripts are derived from a script I found on the Tom Kyte web site, http://asktom.oracle.com/~tkyte/runstats.html. The scripts are provided with this course book.
  • To run similar tests on your server, review, modify where necessary and execute the following supplied script and log files:
    • MORE_CHANNELS_THAN_DISKS.SQL
    • TWO_CHANNELS.RMAN
    • TEN_CHANNELS.RMAN
    • TWO_CHANNELS.LOG
    • TEN_CHANNELS.LOG
  • The scripts required to capture and display elapsed time are:
    • RUNSTATS.SQL
    • DISPLAY_ELAPSED_TIME.SQL

  • NOTE: Oracle 10g Release 2 provides a new view V$RMAN_BACKUP_JOB_DETAILS which should negate the need for the technique demonstrated here.

Other Text:
(Examples or comments displayed on slide, if any).

RMAN> # SkillBuilders, 2003
2> # RMAN Tuning
3>
4> # Get elapsed time of full backup (~560Mb) with these specs:
5> # TWO channels
6> # TWO disks
7> # 20 datafiles
8> # FILESPERSET 64 (default)
9> # MAXOPENFILES 16 (default)
10>
11> connect target system/dave@beatles
12> connect catalog rman/rman
13>
14> sql 'begin runstats_pkg.rs_start; end;';
15> run {
16> allocate channel d1 type disk format '/mnt/disk2/%U';
17> allocate channel d2 type disk format '/mnt/disk3/%U';
18> backup full force database; }
19>
20> sql "begin runstats_pkg.rs_stop(P_OP_NAME=>''two_channels''); end;";
21> sql 'commit';
22>
23> exit;

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • My tests showed that adding more channels than available devices increased elapsed time. The first test (script “TWO_CHANNELS.RMAN”) allocated two channels for two output disk devices. The elapsed time for the backup is 84 seconds (rounded). Backing up the same exact data using ten channels required 99 seconds, an 18% increase in elapsed time.
  • Note that these tests were run on a server dedicated to RMAN testing; no other processes where active. The specifications of the my test environment are:
    • Dell Poweredge 2400, 1 Pentium 733 Mhz CPU
    • Linux OS (Red Hat, 8.0)
    • Oracle9i (9.2.0.4)
    • The input devices are three 9 GB Seagate Cheetah disk drives, tied with RAID 0, 40MB/second transfer rate
    • The output devices are two 36 GB Seagate Cheetah disk drives, 40MB/second transfer rate
    • One Quantum (Benchmark) DLT1 tape drive, 40GB cap, 80GB compressed, sustained transfer rate 3MB / second (6MB compressed), burst transfer rate 20MB / second.

Other Text:
(Examples or comments displayed on slide, if any).

sys@TEST> host rman cmdfile=c:\oraclescripts\rman\two_channels.rman

sys@TEST> @display_elapsed_time two_channels

OP_NAME RUN_TIME ELAPSED_TIME ET
-------------------- -------------------- ------------ ----------
two_channels 26-NOV-0312.00.56.1 66701 PM 8364 83.64


sys@TEST> host rman cmdfile=c:\oraclescripts\rman\ten_channels.rman

sys@TEST> @display_elapsed_time ten_channels

OP_NAME RUN_TIME ELAPSED_TIME ET
-------------------- -------------------- ------------ ----------
ten_channels 26-NOV-0312.21.30.0 24133 PM 9906 99.06

Using more channels than output disks increased elapsed time 18%

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • A major performance benefit is the ability of most operating systems support for asynchronous I/O disk devices. This means that a channel can issue an I/O to write a block to the disk, then, while that operation is taking place, move on to the next operation. You will definitely need asynchronous I/O for best performance. To check if you are achieving asynchronous I/O for your disk devices, query the V$BACKUP_SYNC_IO view after running a backup. If it contains data related to the backup operations channels, you are using synchronous I/O. THIS IS AN ISSUE YOU WILL WANT TO ADDRESS. If this is the case, set initialization parameter DBWR_IO_SLAVES to a non-zero value and allocate additional memory to the large pool. This will cause RMAN to spawn four I/O slaves for backup and restore operations, simulating asynchronous I/O. Refer to Chapter 14 of the Oracle9i Recovery Manager User’s Guide for more information on this issue. (Initialization parameter DISK_ASYNCH_IO should be set to TRUE if your system supports asynchronous I/O to disk.)
  • Tape devices, on the other hand, are synchronous devices. Thus, if your backing up or restoring from tapes, you will want to enable support for asynchronous I/O with the BACKUP_TAPE_IO_SLAVES=TRUE initialization parameter. See the next page for an example.

Other Text:
(Examples or comments displayed on slide, if any).

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • This example shows how to enable asynchronous I/O to tape device(s). This is accomplished with the BACKUP_TAPE_IO_SLAVES parameter. The Oracle9i Database Reference says this about this parameter: “BACKUP_TAPE_IO_SLAVES specifies whether I/O server processes (also called slaves) are used by the Recovery Manager to back up, copy, or restore data to tape. When the value is set to true, Oracle uses an I/O server process to write to or read from a tape device.”
  • The BACKUP_TAPE_IO_SLAVES parameter cannot be changed for the current instance. You must restart the database to put the change into effect.
  • Note that it is also important to allocate memory to the large pool for the tape buffers; otherwise the memory is taken from the shared pool, which can cause unnecessary contention in for shared pool resources. Chapter 14 of the Oracle9i Recovery Manager User’s Guide contains this formula for calculating the space required in the shared pool for tape buffers:
      • RMAN large pool requirement* =
      • number_of_allocated_channels * (16 MB + ( 4 * size_of_tape_buffer ) )
  • * Add the result of this calculation to your existing large pool allocation.

Other Text:
(Examples or comments displayed on slide, if any).

sys@TEST> alter system set backup_tape_io_slaves = true
sys@TEST> scope = spfile;

System altered.
sys@TEST> startup force
ORACLE instance started.

Total System Global Area 236000356 bytes
. . .
Database opened.
sys@TEST> show parameter backup_tape

NAME TYPE VALUE
------------------------------------ ----------- ---------
backup_tape_io_slaves boolean TRUE
sys@TEST> show parameter large_pool

NAME TYPE VALUE
------------------------------------ ----------- ---------
large_pool_size big integer 16777216

Alter parameter
Restart database
Insure large pool allocated

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • One of the methods of monitoring performance of RMAN operations is the V$BACKUP_ASYNC_IO view. Each row containing TYPE “AGGREGATE” provides statistics on a channel operation, including elapsed time and data transfer speed (“effective bytes per second”). To test the performance gain of enabling asynchronous I/O to tape, first I ran a backup using synchronous I/O.
  • The supplied script V$BACKUP_SYNC_IO.SQL generated the first report shown in the slide:
  • sys@TEST> select to_char(close_time, 'mm/dd hh:mi') as time, type,
  • 2 elapsed_time as et,effective_bytes_per_second as eb
  • 3 from v$backup_sync_io
  • 4 where close_time > sysdate - (1/24 / 4) /* within last .25 hour

  • The supplied script V$BACKUP_ASYNC_IO.SQL revealed the statistics for the backup using asynchronous I/O:
  • -- Display performance data for a channel:
      • select to_char(close_time, 'mm/dd hh:mi') as time, type,
      • elapsed_time as et, effective_bytes_per_second as eb
      • from v$backup_async_io
      • where close_time > sysdate - (1/24 / 4)
      • and type = 'AGGREGATE';

Other Text:
(Examples or comments displayed on slide, if any).

@v$backup_sync_io

-- TIME TYPE ET EB
-- ----------- --------- ---------- ----------
-- 11/28 11:23 OUTPUT 12900 4362716


@v$backup_async_io
-- sys@TEST> @v$backup_async_io

-- TIME TYPE ET EB
-- ----------- --------- ---------- ----------
-- 11/28 11:46 AGGREGATE 11300 8985827

Async IO doubled “effective” bytes / second. This is GOOD!
However, elapsed time did not drop by ½. Let’s continue tuning.

Slides © 2004-2007 SkillBuilders.

Instructor's Notes

(If applicable)

  • To help identify RMAN processes, use the SET COMMAND ID command to populate the CLIENT_INFO column of the V$SESSION view. By default, the processes might look like this:

  • However, we can see that SID 18 process is not clearly marked. This was the tape IO slave started because I had the parameter BACKUP_TAPE_IO_SLAVE set to TRUE to enable asynchronous writes to the tape device.
  • See the supplied script SET_COMMAND_ID.RMAN for this code.
  • Warning: On my client&