Sunday, January 29, 2012

Disk space layout on your Exadata

This blog post is a product of my last post on Exadata disk usage.

I have multiple exadatas (both full Rack and 1/2 Racks), and I want to know exactly how each one is configured, now that ACS has left.  How do I go about finding how they are set up.

Well let's start with the basics.

Each Storage cell

  • Has 12 physical spinning disks.
  • The first 2 disks contain the os which utilizes ~29g of space
  • The disks come in either 600g (SAS) or 2tb (SATA). The newer model now has 3tb (SATA).
  • Each cell contains 384G of flash cache, made up of 4 96g f20 PCI cards..

Now lets logon to a storage cell and see how it is configuring.

First go to cellcli, and look at the physical disks.
CellCLI| list physicaldisk
         20:0            R0DQF8          normal
         20:1            R1N71G          normal
         20:2            R1NQVB          normal
         20:3            R1N8DD          normal
         20:4            R1NNBC          normal
         20:5            R1N8BW          normal
         20:6            R1KFW3          normal
         20:7            R1EX24          normal
         20:8            R2LWZC          normal
         20:9            R0K8MF          normal
         20:10           R0HR55          normal
         20:11           R0JQ9A          normal
         FLASH_1_0       3047M04YEC      normal
         FLASH_1_1       3047M05079      normal
         FLASH_1_2       3048M052FD      normal
         FLASH_1_3       3047M04YF7      normal
         FLASH_2_0       3047M04WXN      normal
         FLASH_2_1       3047M04YAJ      normal
         FLASH_2_2       3047M04WTR      normal
         FLASH_2_3       3047M04Y9L      normal
         FLASH_4_0       3047M0500W      normal
         FLASH_4_1       3047M0503G      normal
         FLASH_4_2       3047M0500X      normal
         FLASH_4_3       3047M0501G      normal
         FLASH_5_0       3047M050XG      normal
         FLASH_5_1       3047M050XP      normal
         FLASH_5_2       3047M05098      normal
         FLASH_5_3       3047M050UH      normal

From this you can see that there are 12 physical disks (20:0 - 20:11), and 16 flash disks.
Now lets look at the detail from these 2 types of disks.  I will use the command

list physicaldisk {diskname} detail

CellCLI| list physicaldisk 20:0 detail
         name:                   20:0
         deviceId:               19
         diskType:               HardDisk
         enclosureDeviceId:      20
         errMediaCount:          0
         errOtherCount:          0
         foreignState:           false
         luns:                   0_0
         makeModel:              "SEAGATE ST32000SSSUN2.0T"
         physicalFirmware:       0514
         physicalInsertTime:     2011-09-20T10:19:00-04:00
         physicalInterface:      sata
         physicalSerial:         R0DQF8
         physicalSize:           1862.6559999994934G
         slotNumber:             0
         status:                 normal

This is what you would see for a SAS 600g Disk
CellCLI| list physicaldisk 20:0 detail

         name:                   20:9
         deviceId:               17
         diskType:               HardDisk
         enclosureDeviceId:      20
         errMediaCount:          23
         errOtherCount:          0
         foreignState:           false
         luns:                   0_9
         makeModel:              "TEST ST360057SSUN600G"
         physicalFirmware:       0805
         physicalInsertTime:     0000-03-24T22:10:19+00:00
         physicalInterface:      sas
         physicalSerial:         E08XLW
         physicalSize:           558.9109999993816G
         slotNumber:             9
         status:                 normal

This is what the configuration of the FLASH drives are

CellCLI| list physicaldisk FLASH_5_0 detail
         name:                   FLASH_5_0
         diskType:               FlashDisk
         errCmdTimeoutCount:     0
         errHardReadCount:       0
         errHardWriteCount:      0
         errMediaCount:          0
         errOtherCount:          0
         errSeekCount:           0
         luns:                   5_0
         makeModel:              "MARVELL SD88SA02"
         physicalFirmware:       D20Y
         physicalInsertTime:     2011-09-20T10:20:17-04:00
         physicalInterface:      sas
         physicalSerial:         3047M050XG
         physicalSize:           22.8880615234375G
         sectorRemapCount:       0
         slotNumber:             "PCI Slot: 5; FDOM: 0"
         status:                 normal

So this gives me a good idea of what disks the storage is made up of. In my case you can see that the 12 disks are SATA, and they contain 1862 of usable space.
In the case of the SAS, you can see they contain 558g of usable space.

You can also see that the flash disks comprise of 16 separate disks, that are connected through 4 PCI cards. Each card contains 4 22g flashdisks.

For now (and the rest of this post), I will not talk about the flash. It is possible to use these cell disks, and provision them as usable storage, but I won't be discussing that.

Now that we have  the physical disk layout, we can move to next level  First to review.

We have 12 physical disks.  Each disk contains 1862.65 g of space. (22,352g/cell)

Now the next step is to look at the luns that were created out of the physical disks.  The lun, is the amount of usable space left after the disks have been turned into block devices and presented to the server. You can see that is is a small amount, and below is the output(truncated after the first 2 disks, then I've included the flashdisk to show that detail.

CellCLI| list lun detail
         name:                   0_0
         cellDisk:               CD_00_tpfh1
         deviceName:             /dev/sda
         diskType:               HardDisk
         id:                     0_0
         isSystemLun:            TRUE
         lunAutoCreate:          FALSE
         lunSize:                1861.712890625G
         lunUID:                 0_0
         physicalDrives:         20:0
         raidLevel:              0
         lunWriteCacheMode:      WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
         status:                 normal

         name:                   0_1
         cellDisk:               CD_01_tpfh1
         deviceName:             /dev/sdb
         diskType:               HardDisk
         id:                     0_1
         isSystemLun:            TRUE
         lunAutoCreate:          FALSE
         lunSize:                1861.712890625G
         lunUID:                 0_1
         physicalDrives:         20:1
         raidLevel:              0
         lunWriteCacheMode:      WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
         status:                 normal

         name:                   2_2
         cellDisk:               FD_06_tpfh1
         deviceName:             /dev/sdab
         diskType:               FlashDisk
         id:                     2_2
         isSystemLun:            FALSE
         lunAutoCreate:          FALSE
         lunSize:                22.8880615234375G
         overProvisioning:       100.0
         physicalDrives:         FLASH_2_2
         status:                 normal

So from this you can see that we have 1861.7 g of usable space on each drive, and you can see that the LUNS are given names that refer to the server. In this case the tpfh1 is the name of the storage cell, and this is included in the cellDisk name to easily identify the disk.

The next step is to take a look at the cell disks that were created out of these luns.

The items to note on this output is that first 2 disks contain the OS. You will see that the usable space left after the creation of the os partitions is less than the other disks.  The overhead for the cell software on each disk is also taken (though it is a small amount).

Here is what we have next as celldisks.

CellCLI| list celldisk detail
         name:                   CD_00_tpfh1
         creationTime:           2011-09-23T00:19:30-04:00
         deviceName:             /dev/sda
         devicePartition:        /dev/sda3
         diskType:               HardDisk
         errorCount:             0
         freeSpace:              0
         id:                     a15671cd-2bab-4bfe
         interleaving:           none
         lun:                    0_0
         raidLevel:              0
         size:                   1832.59375G
         status:                 normal

         name:                   CD_01_tpfh1
         creationTime:           2011-09-23T00:19:34-04:00
         deviceName:             /dev/sdb
         devicePartition:        /dev/sdb3
         diskType:               HardDisk
         errorCount:             0
         freeSpace:              0
         id:                     de0ee154-6925-4281
         interleaving:           none
         lun:                    0_1
         raidLevel:              0
         size:                   1832.59375G
         status:                 normal

         name:                   CD_02_tpfh1
         creationTime:           2011-09-23T00:19:34-04:00
         deviceName:             /dev/sdc
         devicePartition:        /dev/sdc
         diskType:               HardDisk
         errorCount:             0
         freeSpace:              0
         id:                     711765f1-90cc-4b53
         interleaving:           none
         lun:                    0_2
         raidLevel:              0
         size:                   1861.703125G
         status:                 normal

Now you can see the first 2 disks have 1832.6g available, and the remaining 10 disks have 1861.7g available (I didn't include the last 9 disks in the output).

So to review where we are. There are 12 physical disks, which are carved into luns, then become cell disks.  These cells have (2 x 1832.6) + (10 x 1861.7) = 22,282g of raw disk available.

Now these disks get carved up into Grid disks. The grid disks are what is presented to ASM.  Lets see how my storage cell is carved up.  While looking at the output, notice that the celldisks are named CD_00_{cellname} through  CD_11_{cellname}.  Here is a snippet

CellCLI| list griddisk detail
         name:                   DATA_DMPF_CD_00_tpfh1
         cellDisk:               CD_00_tpfh1
         creationTime:           2011-09-23T00:21:59-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     2f72fb5a-adf5
         offset:                 32M
         size:                   733G
         status:                 active

         name:                   DATA_DMPF_CD_01_tpfh1
         cellDisk:               CD_01_tpfh1
         creationTime:           2011-09-23T00:21:59-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     0631c4a2-2b39
         offset:                 32M
         size:                   733G
         status:                 active

        name:                   DATA_DMPF_CD_11_tpfh1
         cellDisk:               CD_11_tpfh1
         creationTime:           2011-09-23T00:22:00-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     ccd79051-0e24
         offset:                 32M
         size:                   733G
         status:                 active

         name:                   DBFS_DG_CD_02_tpfh1
         cellDisk:               CD_02_tpfh1
         creationTime:           2011-09-23T00:20:37-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     d292062b-0e26
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active

         name:                   DBFS_DG_CD_03_tpfh1
         cellDisk:               CD_03_tpfh1
         creationTime:           2011-09-23T00:20:38-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     b8c478a9-5ae1
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active

         name:                   DBFS_DG_CD_04_tpfh1
         cellDisk:               CD_04_tpfh1
         creationTime:           2011-09-23T00:20:39-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     606e3d69-c25b
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active
         name:                   DBFS_DG_CD_11_tpfh1
         cellDisk:               CD_11_tpfh1
         creationTime:           2011-09-23T00:20:45-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     58af96a8-3fc8
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active

         name:                   RECO_DMPF_CD_00_tpfh1
         cellDisk:               CD_00_tpfh1
         creationTime:           2011-09-23T00:22:09-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     77f73bbf-09a9
         offset:                 733.046875G
         size:                   1099.546875G
         status:                 active


         name:                   RECO_DMPF_CD_11_tpfh1
         cellDisk:               CD_11_tpfh1
         creationTime:           2011-09-23T00:22:09-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     fad57e10-414f
         offset:                 733.046875G
         size:                   1099.546875G
         status:                 active

Now by looking at this you can see that there are 3 sets of grid disks.

DATA - this carved out of every disk, and contains 733g of storage.  This starts at offset 32m (the beginning of the disks)..

RECO - this is carved out of every disk also, and contains 1099.5g of storage. This starts at offset 733G.

So now we are getting the picture.. Each celldisk is carved into 2 gridisk, starting with Data, followed by reco.

DBFS - This is carved out of the last 10 disks (starting with disk 2) at offset 1832.59, and it  contains 29.1g.  I can only conclude this is the size of the OS parition on the first 2 disks.

So here is what we have for sizing on each Storage cell.

DATA  -  8,796g
RECO - 13,194g
DBFS -       290g

Total   22,280

The thing to keep in mind with this number, is that the OS partitions has caused us a bit of trouble. There are only 10 of these grid disks per cell, and the are only 29g.  If we pull this out, we have ~22tb of disk usable on each storage cell.

Now to figure out how much space is in each disk group (assuming these grid disks will all go directly into 3 disk groups).

The first thing to remember is the redundance level.  Are they going to be normal redundancy (mirrored) or High redundancy (triple mirrored) ?  With normal redundancy, the disk groups are configured with a disk being redundant with a disk on another cell.  With High redundancy the disk is redundant with 2 other disks on 2 other cells. To maintain this level of redundancy, you must set aside 1 storage cells worth of storage for normal redudnacy, and 2 storage cells worth of storage for high redundancy to ensure that you are completely protected.

So what does this mean for sizing ??  The larger your array, the more usable disk you get. With a half rack, you must set aside 1 out of 7 storage cells, or 2 out of 7 storage cells for redudnacy.  For a full rack you need to set aside 1 out of 14 storage cells, or 2 out of 14 storage cells for redundancy.

Now lets run the numbers.


Data -  Normal  (8,796g / 2) * 6 usable racks = 26,388g of usable space
            High       (8,796g / 3) * 5 usable racks = 14,660g of usable space

Reco - Normal (13,194g / 2) * 6 usable racks = 39,562g of usable space
           High      (13,194g / 3) * 5 usable racks = 21,990g of usable space

Dbfs - Normal (290g / 2) * 6 usable racks = 870g of usable space
           High      (290g / 3) * 5 usable racks = 483g of usable space

TOTAL usable (minus DBFS)
    Normal Redundancy - 65.9tb
    High Redundancy        36.6tb


Data - Normal (8,796g / 2) * 13 usable racks = 57,174g of usable space
           High (8,796g / 3) * 12 usable racks = 35,184g of usable space

Reco - Normal (13,194g / 2) * 13 usable racks = 85,761g of usable space
           High (13,194g / 3) * 12 usable racks = 52,776g of usable space

Dbfs - Normal (290g / 2) * 13 usable racks = 1885g of usable space
         High (290g / 3) * 12 usable racks = 1160g of usable space

TOTAL usable (minus DBFS)
    Normal Redundancy - 142.9 tb
    High Reundancy        - 87.96tb

So the take I get from this is.

There is a much higher cost for redunancy levels, and this cost is higher for smaller rack systems.
A certain portion of the the cells is a small gid disk, that is only on 10 of the physical disks, and is hard to utilize well.


  1. Each Storage cell

    Has 14 physical spinning disks.

    Should be:-

    Each Storage cell

    Has 12 physical spinning disks.

  2. I believe the content need to more refined in respect to exadata, As you mentioned:-

    the disk groups are configured with a disk being redundant with a disk on another cell.

    but in exadata redundancy is @ cell level.. not on disk level

  3. Great post Bryan! Thank you.

    - Keith Gauvin
