Sunday, January 29, 2012

Disk space layout on your Exadata

This blog post is a product of my last post on Exadata disk usage.

I have multiple exadatas (both full Rack and 1/2 Racks), and I want to know exactly how each one is configured, now that ACS has left.  How do I go about finding how they are set up.

Well let's start with the basics.

Each Storage cell

  • Has 12 physical spinning disks.
  • The first 2 disks contain the os which utilizes ~29g of space
  • The disks come in either 600g (SAS) or 2tb (SATA). The newer model now has 3tb (SATA).
  • Each cell contains 384G of flash cache, made up of 4 96g f20 PCI cards..

Now lets logon to a storage cell and see how it is configuring.

First go to cellcli, and look at the physical disks.
CellCLI| list physicaldisk
         20:0            R0DQF8          normal
         20:1            R1N71G          normal
         20:2            R1NQVB          normal
         20:3            R1N8DD          normal
         20:4            R1NNBC          normal
         20:5            R1N8BW          normal
         20:6            R1KFW3          normal
         20:7            R1EX24          normal
         20:8            R2LWZC          normal
         20:9            R0K8MF          normal
         20:10           R0HR55          normal
         20:11           R0JQ9A          normal
         FLASH_1_0       3047M04YEC      normal
         FLASH_1_1       3047M05079      normal
         FLASH_1_2       3048M052FD      normal
         FLASH_1_3       3047M04YF7      normal
         FLASH_2_0       3047M04WXN      normal
         FLASH_2_1       3047M04YAJ      normal
         FLASH_2_2       3047M04WTR      normal
         FLASH_2_3       3047M04Y9L      normal
         FLASH_4_0       3047M0500W      normal
         FLASH_4_1       3047M0503G      normal
         FLASH_4_2       3047M0500X      normal
         FLASH_4_3       3047M0501G      normal
         FLASH_5_0       3047M050XG      normal
         FLASH_5_1       3047M050XP      normal
         FLASH_5_2       3047M05098      normal
         FLASH_5_3       3047M050UH      normal


From this you can see that there are 12 physical disks (20:0 - 20:11), and 16 flash disks.
Now lets look at the detail from these 2 types of disks.  I will use the command


list physicaldisk {diskname} detail



CellCLI| list physicaldisk 20:0 detail
         name:                   20:0
         deviceId:               19
         diskType:               HardDisk
         enclosureDeviceId:      20
         errMediaCount:          0
         errOtherCount:          0
         foreignState:           false
         luns:                   0_0
         makeModel:              "SEAGATE ST32000SSSUN2.0T"
         physicalFirmware:       0514
         physicalInsertTime:     2011-09-20T10:19:00-04:00
         physicalInterface:      sata
         physicalSerial:         R0DQF8
         physicalSize:           1862.6559999994934G
         slotNumber:             0
         status:                 normal



This is what you would see for a SAS 600g Disk
CellCLI| list physicaldisk 20:0 detail

         name:                   20:9
         deviceId:               17
         diskType:               HardDisk
         enclosureDeviceId:      20
         errMediaCount:          23
         errOtherCount:          0
         foreignState:           false
         luns:                   0_9
         makeModel:              "TEST ST360057SSUN600G"
         physicalFirmware:       0805
         physicalInsertTime:     0000-03-24T22:10:19+00:00
         physicalInterface:      sas
         physicalSerial:         E08XLW
         physicalSize:           558.9109999993816G
         slotNumber:             9
         status:                 normal


This is what the configuration of the FLASH drives are

CellCLI| list physicaldisk FLASH_5_0 detail
         name:                   FLASH_5_0
         diskType:               FlashDisk
         errCmdTimeoutCount:     0
         errHardReadCount:       0
         errHardWriteCount:      0
         errMediaCount:          0
         errOtherCount:          0
         errSeekCount:           0
         luns:                   5_0
         makeModel:              "MARVELL SD88SA02"
         physicalFirmware:       D20Y
         physicalInsertTime:     2011-09-20T10:20:17-04:00
         physicalInterface:      sas
         physicalSerial:         3047M050XG
         physicalSize:           22.8880615234375G
         sectorRemapCount:       0
         slotNumber:             "PCI Slot: 5; FDOM: 0"
         status:                 normal



So this gives me a good idea of what disks the storage is made up of. In my case you can see that the 12 disks are SATA, and they contain 1862 of usable space.
In the case of the SAS, you can see they contain 558g of usable space.

You can also see that the flash disks comprise of 16 separate disks, that are connected through 4 PCI cards. Each card contains 4 22g flashdisks.

For now (and the rest of this post), I will not talk about the flash. It is possible to use these cell disks, and provision them as usable storage, but I won't be discussing that.

Now that we have  the physical disk layout, we can move to next level  First to review.

We have 12 physical disks.  Each disk contains 1862.65 g of space. (22,352g/cell)

Now the next step is to look at the luns that were created out of the physical disks.  The lun, is the amount of usable space left after the disks have been turned into block devices and presented to the server. You can see that is is a small amount, and below is the output(truncated after the first 2 disks, then I've included the flashdisk to show that detail.


CellCLI| list lun detail
         name:                   0_0
         cellDisk:               CD_00_tpfh1
         deviceName:             /dev/sda
         diskType:               HardDisk
         id:                     0_0
         isSystemLun:            TRUE
         lunAutoCreate:          FALSE
         lunSize:                1861.712890625G
         lunUID:                 0_0
         physicalDrives:         20:0
         raidLevel:              0
         lunWriteCacheMode:      WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
         status:                 normal

         name:                   0_1
         cellDisk:               CD_01_tpfh1
         deviceName:             /dev/sdb
         diskType:               HardDisk
         id:                     0_1
         isSystemLun:            TRUE
         lunAutoCreate:          FALSE
         lunSize:                1861.712890625G
         lunUID:                 0_1
         physicalDrives:         20:1
         raidLevel:              0
         lunWriteCacheMode:      WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
         status:                 normal

         name:                   2_2
         cellDisk:               FD_06_tpfh1
         deviceName:             /dev/sdab
         diskType:               FlashDisk
         id:                     2_2
         isSystemLun:            FALSE
         lunAutoCreate:          FALSE
         lunSize:                22.8880615234375G
         overProvisioning:       100.0
         physicalDrives:         FLASH_2_2
         status:                 normal



So from this you can see that we have 1861.7 g of usable space on each drive, and you can see that the LUNS are given names that refer to the server. In this case the tpfh1 is the name of the storage cell, and this is included in the cellDisk name to easily identify the disk.

The next step is to take a look at the cell disks that were created out of these luns.

The items to note on this output is that first 2 disks contain the OS. You will see that the usable space left after the creation of the os partitions is less than the other disks.  The overhead for the cell software on each disk is also taken (though it is a small amount).

Here is what we have next as celldisks.



CellCLI| list celldisk detail
         name:                   CD_00_tpfh1
         comment:
         creationTime:           2011-09-23T00:19:30-04:00
         deviceName:             /dev/sda
         devicePartition:        /dev/sda3
         diskType:               HardDisk
         errorCount:             0
         freeSpace:              0
         id:                     a15671cd-2bab-4bfe
         interleaving:           none
         lun:                    0_0
         raidLevel:              0
         size:                   1832.59375G
         status:                 normal

         name:                   CD_01_tpfh1
         comment:
         creationTime:           2011-09-23T00:19:34-04:00
         deviceName:             /dev/sdb
         devicePartition:        /dev/sdb3
         diskType:               HardDisk
         errorCount:             0
         freeSpace:              0
         id:                     de0ee154-6925-4281
         interleaving:           none
         lun:                    0_1
         raidLevel:              0
         size:                   1832.59375G
         status:                 normal

         name:                   CD_02_tpfh1
         comment:
         creationTime:           2011-09-23T00:19:34-04:00
         deviceName:             /dev/sdc
         devicePartition:        /dev/sdc
         diskType:               HardDisk
         errorCount:             0
         freeSpace:              0
         id:                     711765f1-90cc-4b53
         interleaving:           none
         lun:                    0_2
         raidLevel:              0
         size:                   1861.703125G
         status:                 normal


Now you can see the first 2 disks have 1832.6g available, and the remaining 10 disks have 1861.7g available (I didn't include the last 9 disks in the output).

So to review where we are. There are 12 physical disks, which are carved into luns, then become cell disks.  These cells have (2 x 1832.6) + (10 x 1861.7) = 22,282g of raw disk available.

Now these disks get carved up into Grid disks. The grid disks are what is presented to ASM.  Lets see how my storage cell is carved up.  While looking at the output, notice that the celldisks are named CD_00_{cellname} through  CD_11_{cellname}.  Here is a snippet



CellCLI| list griddisk detail
         name:                   DATA_DMPF_CD_00_tpfh1
         availableTo:
         cellDisk:               CD_00_tpfh1
         comment:
         creationTime:           2011-09-23T00:21:59-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     2f72fb5a-adf5
         offset:                 32M
         size:                   733G
         status:                 active

         name:                   DATA_DMPF_CD_01_tpfh1
         availableTo:
         cellDisk:               CD_01_tpfh1
         comment:
         creationTime:           2011-09-23T00:21:59-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     0631c4a2-2b39
         offset:                 32M
         size:                   733G
         status:                 active
.......
.......
.......

        name:                   DATA_DMPF_CD_11_tpfh1
         availableTo:
         cellDisk:               CD_11_tpfh1
         comment:
         creationTime:           2011-09-23T00:22:00-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     ccd79051-0e24
         offset:                 32M
         size:                   733G
         status:                 active

         name:                   DBFS_DG_CD_02_tpfh1
         availableTo:
         cellDisk:               CD_02_tpfh1
         comment:
         creationTime:           2011-09-23T00:20:37-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     d292062b-0e26
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active

         name:                   DBFS_DG_CD_03_tpfh1
         availableTo:
         cellDisk:               CD_03_tpfh1
         comment:
         creationTime:           2011-09-23T00:20:38-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     b8c478a9-5ae1
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active

         name:                   DBFS_DG_CD_04_tpfh1
         availableTo:
         cellDisk:               CD_04_tpfh1
         comment:
         creationTime:           2011-09-23T00:20:39-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     606e3d69-c25b
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active
.....
.....
.....
         name:                   DBFS_DG_CD_11_tpfh1
         availableTo:
         cellDisk:               CD_11_tpfh1
         comment:
         creationTime:           2011-09-23T00:20:45-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     58af96a8-3fc8
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active

         name:                   RECO_DMPF_CD_00_tpfh1
         availableTo:
         cellDisk:               CD_00_tpfh1
         comment:
         creationTime:           2011-09-23T00:22:09-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     77f73bbf-09a9
         offset:                 733.046875G
         size:                   1099.546875G
         status:                 active

.....
.....
.....

         name:                   RECO_DMPF_CD_11_tpfh1
         availableTo:
         cellDisk:               CD_11_tpfh1
         comment:
         creationTime:           2011-09-23T00:22:09-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     fad57e10-414f
         offset:                 733.046875G
         size:                   1099.546875G
         status:                 active



Now by looking at this you can see that there are 3 sets of grid disks.

DATA - this carved out of every disk, and contains 733g of storage.  This starts at offset 32m (the beginning of the disks)..

RECO - this is carved out of every disk also, and contains 1099.5g of storage. This starts at offset 733G.

So now we are getting the picture.. Each celldisk is carved into 2 gridisk, starting with Data, followed by reco.

DBFS - This is carved out of the last 10 disks (starting with disk 2) at offset 1832.59, and it  contains 29.1g.  I can only conclude this is the size of the OS parition on the first 2 disks.

So here is what we have for sizing on each Storage cell.

DATA  -  8,796g
RECO - 13,194g
DBFS -       290g

Total   22,280

The thing to keep in mind with this number, is that the OS partitions has caused us a bit of trouble. There are only 10 of these grid disks per cell, and the are only 29g.  If we pull this out, we have ~22tb of disk usable on each storage cell.

Now to figure out how much space is in each disk group (assuming these grid disks will all go directly into 3 disk groups).

The first thing to remember is the redundance level.  Are they going to be normal redundancy (mirrored) or High redundancy (triple mirrored) ?  With normal redundancy, the disk groups are configured with a disk being redundant with a disk on another cell.  With High redundancy the disk is redundant with 2 other disks on 2 other cells. To maintain this level of redundancy, you must set aside 1 storage cells worth of storage for normal redudnacy, and 2 storage cells worth of storage for high redundancy to ensure that you are completely protected.

So what does this mean for sizing ??  The larger your array, the more usable disk you get. With a half rack, you must set aside 1 out of 7 storage cells, or 2 out of 7 storage cells for redudnacy.  For a full rack you need to set aside 1 out of 14 storage cells, or 2 out of 14 storage cells for redundancy.

Now lets run the numbers.


HALF RACK  -

Data -  Normal  (8,796g / 2) * 6 usable racks = 26,388g of usable space
            High       (8,796g / 3) * 5 usable racks = 14,660g of usable space

Reco - Normal (13,194g / 2) * 6 usable racks = 39,562g of usable space
           High      (13,194g / 3) * 5 usable racks = 21,990g of usable space

Dbfs - Normal (290g / 2) * 6 usable racks = 870g of usable space
           High      (290g / 3) * 5 usable racks = 483g of usable space

TOTAL usable (minus DBFS)
    Normal Redundancy - 65.9tb
    High Redundancy        36.6tb


FULL RACK -


Data - Normal (8,796g / 2) * 13 usable racks = 57,174g of usable space
           High (8,796g / 3) * 12 usable racks = 35,184g of usable space


Reco - Normal (13,194g / 2) * 13 usable racks = 85,761g of usable space
           High (13,194g / 3) * 12 usable racks = 52,776g of usable space


Dbfs - Normal (290g / 2) * 13 usable racks = 1885g of usable space
         High (290g / 3) * 12 usable racks = 1160g of usable space

TOTAL usable (minus DBFS)
    Normal Redundancy - 142.9 tb
    High Reundancy        - 87.96tb


So the take I get from this is.

There is a much higher cost for redunancy levels, and this cost is higher for smaller rack systems.
A certain portion of the the cells is a small gid disk, that is only on 10 of the physical disks, and is hard to utilize well.

Tuesday, January 24, 2012

Calculating disk space usage on an exadata.

I have been working on figuring out where all our space is, and how much space is actually available on our Exadata.  First to clarify what all the calculations are based on.

1/2 Rack. Sata drives. normal redundancy

This means we have
  • 7 storage cells
  • Each storage cell contains 12 disks
  • each disk is 2tb (which is about 1.862 tb usable)
  • The first 2 disks in each storage cell has ~30g already partitioned for the OS (which is mirrored).
Next I looked to see how the disks were allocated within each storage cell (they are all consistent)

list griddisk attributes name, celldisk, size
         DATA_DMPF_CD_00_srcell1        CD_00_srcell1  733G
         DATA_DMPF_CD_01_srcell1        CD_01_srcell1  733G
         DATA_DMPF_CD_02_srcell1        CD_02_srcell1  733G
         DATA_DMPF_CD_03_srcell1        CD_03_srcell1  733G
         DATA_DMPF_CD_04_srcell1        CD_04_srcell1  733G
         DATA_DMPF_CD_05_srcell1        CD_05_srcell1  733G
         DATA_DMPF_CD_06_srcell1        CD_06_srcell1  733G
         DATA_DMPF_CD_07_srcell1        CD_07_srcell1  733G
         DATA_DMPF_CD_08_srcell1        CD_08_srcell1  733G
         DATA_DMPF_CD_09_srcell1        CD_09_srcell1  733G
         DATA_DMPF_CD_10_srcell1        CD_10_srcell1  733G
         DATA_DMPF_CD_11_srcell1        CD_11_srcell1  733G
         DBFS_DG_CD_02_srcell1          CD_02_srcell1  29.109375G
         DBFS_DG_CD_03_srcell1          CD_03_srcell1  29.109375G
         DBFS_DG_CD_04_srcell1          CD_04_srcell1  29.109375G
         DBFS_DG_CD_05_srcell1          CD_05_srcell1  29.109375G
         DBFS_DG_CD_06_srcell1          CD_06_srcell1  29.109375G
         DBFS_DG_CD_07_srcell1          CD_07_srcell1  29.109375G
         DBFS_DG_CD_08_srcell1          CD_08_srcell1  29.109375G
         DBFS_DG_CD_09_srcell1          CD_09_srcell1  29.109375G
         DBFS_DG_CD_10_srcell1          CD_10_srcell1  29.109375G
         DBFS_DG_CD_11_srcell1          CD_11_srcell1  29.109375G
         RECO_DMPF_CD_00_srcell1        CD_00_srcell1  1099.546875G
         RECO_DMPF_CD_01_srcell1        CD_01_srcell1  1099.546875G
         RECO_DMPF_CD_02_srcell1        CD_02_srcell1  1099.546875G
         RECO_DMPF_CD_03_srcell1        CD_03_srcell1  1099.546875G
         RECO_DMPF_CD_04_srcell1        CD_04_srcell1  1099.546875G
         RECO_DMPF_CD_05_srcell1        CD_05_srcell1  1099.546875G
         RECO_DMPF_CD_06_srcell1        CD_06_srcell1  1099.546875G
         RECO_DMPF_CD_07_srcell1        CD_07_srcell1  1099.546875G
         RECO_DMPF_CD_08_srcell1        CD_08_srcell1  1099.546875G
         RECO_DMPF_CD_09_srcell1        CD_09_srcell1  1099.546875G
         RECO_DMPF_CD_10_srcell1        CD_10_srcell1  1099.546875G
         RECO_DMPF_CD_11_srcell1        CD_11_srcell1  1099.546875G


This is giving me a lot of information of how things were configured as griddisks.

I can tell from this that there are 3 sets of griddisks (for my diskgroups).

Data - this is composed of 12 disks containing 733g luns
reco  - this is composed of 12 disks containing 1100g luns
dbfs  - this is composed of 10 disks containing 29g luns

Notice that I mentioned previously, that the first 2 disks are used for the os (mirrored), this is why there are only 10 luns of 29g available for the dbfs disk group.

I then run the numbers for each one (7 cells * #disks * luns)

data -  61.572 tb
reco -  92.4 tb
dbfs  -   2.03 tb

Remember this is raw disk available, and I am running in normal reduncy (mirrored), if you are running triple mirrored keep this in mind.

Now this gets me a starting point, and took a look at the what asm is showing for disk usage to try see what is going on..


There are 3 values that I am looking at trying to figure out.

 Disk Group      SIZE         USED              USABLE FREE
data                     61.572   32.692             10.042
reco                     92.4         3.003             38.082
dbfs                       2.03       2.018            -.135 

Now these numbers don't seem to add up.. Only the size seems to match what I was expecting.

These are the things I started wondering about
  • How can I be using 33 tb out of 62tb raw when I am mirrored (unless it is the total raw used)
  • How can my usable free be 10tb if I am using 1/2 of the raw disk ?
  • How can my usable free be negative ???
Well in looking at the number further, and looking at the data I was able to answer the first question. The 32 tb is the raw so to state it again in actual usage...

Disk group   mirrored_used
data               16.346
reco                 1.502
dbfs                  1.009

Ok this makes a little more sense.  Looking at this this the following must be true also....

Disk group      raw left
data                 28.88
reco                 89.397
dbfs                     .019


OK, first number solved.. now lets see the next number.. The usable free must be the amount of mirred storage available (rather then raw), so If I go back to the usable free, and convert back to raw (x2 for mirrored) I get

Disk group       Usable free     Raw usable
data                   10.042          20.082
reco                    38.082         76.164
dbfs                      -.135             -.270

OK, I'm getting close, but why the discrepency, and why the negative number ??? Lets look at the diff

Disk group    Raw left     raw usable      missing raw storage
data                28.88          20.082              8.8
reco                 89.397       76.164            13.233
dbfs                     .019         -.270              -.29

Now lets take a closer look at the numbers...   and what it means to be negative.

TIP The usable free space specifies the amount of space that can be safely used for data. A value above zero means that redundancy can be properly restored after a disk failure.

So we need to reserve some space to absorb a disk loss.. hmm, in this case, it means being able to lose a storage cell, and be able to mirror on a different cell.. So lets take that calculation and see what happens
Lun_size  * disks

Disk group      calculation                   Storage cell usage
data               (.733 x 12)                     8.8
reco                (1.1 x 12)                    13.23
dbfs               (.029 x 10)                     .29

Well there is missing, space and I got answers to all my questions.

Well to summarize.

1) How much space is there to use on a 1/2 rack with 2tb Sata drives mirrored (normal redundancy) ???
    ((29g * 10 disks)     * 6 cells +
    (1833g * 12 disks) * 6 cells)/2

    66.858 tb mirrored

2) What does the values USED and SIZE mean when I am looking at ASM ?
These are the raw space avalailable across all cells, and it is the amount of raw space allocated.

3) What does the USABLE FREE show me ?
This is the amount of space you can safely allocate to your data. this (like the 2 above values) is not measured in raw, but it is measured in usable.


If anyone see's anything wrong with with my calculations let me know.  they seem to add up, and explain all the numbers...





Here is some good information, and the display from the storage cell to comfirm my sizes on whats available from the disks. My numbers match up.

http://blog.enkitec.com/wp-content/uploads/2011/02/Enkitec-Exadata-Storage-Layout11.pdf


CellCLI> list celldisk attributes name, devicePartition, size where diskType = 'HardDisk'
         CD_00_srcell1  /dev/sda3       1832.59375G
         CD_01_srcell1  /dev/sdb3       1832.59375G
         CD_02_srcell1  /dev/sdc        1861.703125G
         CD_03_srcell1  /dev/sdd        1861.703125G
         CD_04_srcell1  /dev/sde        1861.703125G
         CD_05_srcell1  /dev/sdf        1861.703125G
         CD_06_srcell1  /dev/sdg        1861.703125G
         CD_07_srcell1  /dev/sdh        1861.703125G
         CD_08_srcell1  /dev/sdi        1861.703125G
         CD_09_srcell1  /dev/sdj        1861.703125G
         CD_10_srcell1  /dev/sdk        1861.703125G
         CD_11_srcell1  /dev/sdl        1861.703125G

Thursday, January 12, 2012

AWR compare script

I don't know if anyone else has seen this, but with 12c, you can't compare before and after AWR reports.  I found this quite annoying so I used the oracle scripts, and tweaked them.. Here is the script I use.

http://dl.dropbox.com/u/23998484/awr_bsg.sql

Here are the really intesting, that you can use for any AWR period compars.

   define  num_days     = 0;
      define  dbid         =1595564914; 
      define  instance_numbers_or_ALL    = 'ALL';
      define  begin_snap   = 47097;
      define  end_snap     = 47111;
      define  num_days2    = 0;
      define  dbid2        = 1342862996;
      define  instance_numbers_or_ALL2    = 'ALL';
      define  begin_snap2  = 45907;
      define  end_snap2    = 45921; 
      define  report_type  = 'html';
      define  report_name  = /tmp/awr_gg.html
      define top_n_files        = 50;
      define top_n_segments     = 50;
      define top_n_services     = 50;
      define top_n_sql          = 100;
      @@?/rdbms/admin/awrgdrpi


Note all the Juicy goodies..  I am choosing 'ALL' for the instances in my RAC cluster.  Also notice I am overriding the number of top sql, and other things in this file.  The default number of top sql is never enough

This can be very useful to get a more meaningful AWR report in a rac environment.