Monday, February 27, 2012

Setting aside a node for maintenance on Exadata

Actually, this isn't exadata specific, but it becomes even more important on a multi-node cluster.

First the background.

  I have a data warehouse application in which we are loading up lots of data.  At the same time, we have users reporting off the data.  I am finding that we actually have 2 needs, and they are opposed

USERS -- Their needs
  • Lots of concurrency
  • Small amounts of data
  • Small PGA
  • small temp
  • Large SGA
If the users need more than this something probably went wrong with their query..

 DBA/ODI jobs

  • Very little concurrency (except for some parallelization)
  • Large amounts of data
  • HUGE PGA
  • HUGE Temp
  • HUGE Undo segments
  • Small SGA

The Temp issue is easy enough to fix with a separate temp  for each user, and by setting up a temporary tablespace group for users.

But what about my the other things data load jobs need ??  The only answer seems to be to set aside 1 (or more) nodes out of my cluster for maintenance/loading.  This node (or nodes) will have a different configuration.  This node, let's say node 8, has the following characteristics.

  • The only service  running on this node is my ODI (data load) service, and a service the DBA's to use for index rebuilds
  • PGA Automatic memory management is not enabled
  • work_area_size_policy is manual
  • sort_area_size=60g
  • hash_area_size=60g
  • undo tablespace size is set to 1tb, much, much larger than the other nodes.  Undo_retention is set to a very large number.
The only work done on Node 8 will be loading of large tables, and  rebuild/creation of indexes.

I was even thinking about getting the memory expansion kit for just this node, to bring it up to 144g from 96g.

Anyone else do this ? set aside a node specifically for "heavy lifting" with a different configuration ?

UPDATE ---  After writing this, and looking at my load process, I noticed that most of my load time is going to temp.. Both reads, and writes since I am doing huge hashes.  I am considering Dedicating SSD luns to the temp area for my ODI jobs only.  I might even try dedicating SSD to the i$ (intermediate staging tables) that ODI uses.

Friday, February 10, 2012

OLTP compression slow for large data set

I am working on loading a large dataset (about 500 Million rows that take up about 100g of space).

I am loading them into a partitioned table and I was hoping to use HCC compression, but at least OLTP compression.

After loading for a while, the inserts seem to go slow and slower,  I was able to test my table structure with OLTP and no compression, and found that there was indeed a bottleneck with compress, but it didn't really get bad until about an hour into the procession.

My process is to do a bulk collect (in pl/sql) of 500 rows, and insert them into my partitioned table.




Below is an example.. From 7:00 am, until 9:30 (in red) I was inserting data into the table with compression off.
You can see that that the number of rows processed in each interval (15 minutes) was consistently ~30 million. for a throughput of 127 Million/hour.
Also take a look at buffer gets/exec, elapsed time/exec, and CPU time/exec.  These values all remain fairly consistent after the first 45 minutes.
At the end of 2.5 hours 351 Million rows were loaded

Now compare to the Blue  (10:15 - 10:45), I was inserting data into the same table with compression on.
You can see that the rows processed started at 22 Million (for the first 15 minutes), but it kept trending downward.  You will also notice that the reads went up,
Compare same values (buffer gets, rows processed, cpu time), and you can see the performance just continues to degrade.
Finally, look at the Violet. This is a snapshot of the currently running load after over 250 Million of data has been loaded.
Notice that we are processing at about 10 Million rows/ hour, the buffer gets are up, and the CPU time has increased.

OLTP compression seems to be significantly slowing down the loads once they get moving along.

 
Anyone see this, or have any idea why it slows down ?  The only theory I can come up with is "garbage collection" for the partitions.. I reach a point, where I am inserting into blocks, that haven't been compressed, and oracle is now going back and compressing the blocks to make room.

Here are the performance numbes.  I've also included the AWR logic read output, If you take number of executions * buffer gets, you find that the logical reads are all from the inserts.


END_TIME
ELAPSED_TIME
EXECUTIONS
TOTAL_READS_PER_EXECUTION
ROWS_PROCESSED Total
BUFFER_GETS
CPU_TIME


2/10/2012 7:00
0.004
56,711
283
28,355,500
283
4,316


2/10/2012 7:15
0.004
83,225
262
41,612,500
262
4,206


2/10/2012 7:30
0.004
81,178
293
40,589,000
293
4,332


2/10/2012 7:45
0.007
66,630
945
33,315,000
945
6,821


2/10/2012 8:00
0.007
62,374
1,190
31,187,000
1,190
7,353


2/10/2012 8:15
0.009
58,031
1,640
29,015,500
1,640
8,912


2/10/2012 8:30
0.008
59,598
1,442
29,799,000
1,442
8,292


2/10/2012 8:45
0.009
57,116
1,648
28,558,000
1,648
8,952


2/10/2012 9:00
0.008
60,477
1,410
30,238,500
1,410
8,057


2/10/2012 9:15
0.009
56,334
1,710
28,167,000
1,710
9,060


2/10/2012 9:30
0.008
61,627
1,293
30,813,500
1,293
7,681






351,650,500







Throughput
127,872,909






















2/10/2012 10:15
0.013
45,964
1,940
22,982,000
1,940
12,878


2/10/2012 10:30
0.019
33,048
3,014
16,524,000
3,014
19,466


2/10/2012 10:45
0.018
36,192
2,235
18,096,000
2,235
18,024


2/10/2012 11:00
0.017
37,362
1,737
18,681,000
1,737
17,507


2/10/2012 11:15
0.018
34,992
1,526
17,496,000
1,526
17,799


2/10/2012 11:30
0.036
20,757
6,253
10,378,500
6,253
35,703


2/10/2012 11:45
0.046
16,744
8,714
8,372,000
8,714
46,436






112,529,500







throughput
64,302,571













END_TIME
ELAPSED_TIME_DELTA
EXECUTIONS_DELTA
TOTAL_READS_PER_EXECUTION
ROWS_PROCESSED_DELTA
DISK_READS_DELTA
BUFFER_GETS
CPU_TIME

2/9/2012 22:00
0.186
4,572
33,631
2,286,000
11
33,620
171,338

2/9/2012 22:15
0.188
4,632
33,240
2,316,000
11
33,229
171,302

2/9/2012 22:30
0.19
4,545
33,574
2,272,500
10
33,564
174,641

2/9/2012 22:45
0.182
4,762
33,027
2,381,000
11
33,016
167,433














9,255,500













Segments by Logical Reads

  • Total Logical Reads: 159,380,402
  • Captured Segments account for 99.5% of Total
Owner
Tablespace Name
Object Name
Subobject Name
Obj. Type
Logical Reads
%Total
BGRENN
BGRENN_2009
FACT_BGRENN_DETL
ERD_BGRENN_2009
TABLE PARTITION
36,379,968
22.83
BGRENN
BGRENN_2010
FACT_BGRENN_DETL
ERD_BGRENN_2010
TABLE PARTITION
35,459,344
22.25
BGRENN
BGRENN_2011
FACT_BGRENN_DETL
ERD_BGRENN_2011
TABLE PARTITION
34,801,888
21.84
BGRENN
BGRENN_2008
FACT_BGRENN_DETL
ERD_BGRENN_2008
TABLE PARTITION
33,651,856
21.11
BGRENN
BGRENN_2007
FACT_BGRENN_DETL
ERD_BGRENN_2007
TABLE PARTITION
9,641,168
6.05

Sunday, January 29, 2012

Disk space layout on your Exadata

This blog post is a product of my last post on Exadata disk usage.

I have multiple exadatas (both full Rack and 1/2 Racks), and I want to know exactly how each one is configured, now that ACS has left.  How do I go about finding how they are set up.

Well let's start with the basics.

Each Storage cell

  • Has 12 physical spinning disks.
  • The first 2 disks contain the os which utilizes ~29g of space
  • The disks come in either 600g (SAS) or 2tb (SATA). The newer model now has 3tb (SATA).
  • Each cell contains 384G of flash cache, made up of 4 96g f20 PCI cards..

Now lets logon to a storage cell and see how it is configuring.

First go to cellcli, and look at the physical disks.
CellCLI| list physicaldisk
         20:0            R0DQF8          normal
         20:1            R1N71G          normal
         20:2            R1NQVB          normal
         20:3            R1N8DD          normal
         20:4            R1NNBC          normal
         20:5            R1N8BW          normal
         20:6            R1KFW3          normal
         20:7            R1EX24          normal
         20:8            R2LWZC          normal
         20:9            R0K8MF          normal
         20:10           R0HR55          normal
         20:11           R0JQ9A          normal
         FLASH_1_0       3047M04YEC      normal
         FLASH_1_1       3047M05079      normal
         FLASH_1_2       3048M052FD      normal
         FLASH_1_3       3047M04YF7      normal
         FLASH_2_0       3047M04WXN      normal
         FLASH_2_1       3047M04YAJ      normal
         FLASH_2_2       3047M04WTR      normal
         FLASH_2_3       3047M04Y9L      normal
         FLASH_4_0       3047M0500W      normal
         FLASH_4_1       3047M0503G      normal
         FLASH_4_2       3047M0500X      normal
         FLASH_4_3       3047M0501G      normal
         FLASH_5_0       3047M050XG      normal
         FLASH_5_1       3047M050XP      normal
         FLASH_5_2       3047M05098      normal
         FLASH_5_3       3047M050UH      normal


From this you can see that there are 12 physical disks (20:0 - 20:11), and 16 flash disks.
Now lets look at the detail from these 2 types of disks.  I will use the command


list physicaldisk {diskname} detail



CellCLI| list physicaldisk 20:0 detail
         name:                   20:0
         deviceId:               19
         diskType:               HardDisk
         enclosureDeviceId:      20
         errMediaCount:          0
         errOtherCount:          0
         foreignState:           false
         luns:                   0_0
         makeModel:              "SEAGATE ST32000SSSUN2.0T"
         physicalFirmware:       0514
         physicalInsertTime:     2011-09-20T10:19:00-04:00
         physicalInterface:      sata
         physicalSerial:         R0DQF8
         physicalSize:           1862.6559999994934G
         slotNumber:             0
         status:                 normal



This is what you would see for a SAS 600g Disk
CellCLI| list physicaldisk 20:0 detail

         name:                   20:9
         deviceId:               17
         diskType:               HardDisk
         enclosureDeviceId:      20
         errMediaCount:          23
         errOtherCount:          0
         foreignState:           false
         luns:                   0_9
         makeModel:              "TEST ST360057SSUN600G"
         physicalFirmware:       0805
         physicalInsertTime:     0000-03-24T22:10:19+00:00
         physicalInterface:      sas
         physicalSerial:         E08XLW
         physicalSize:           558.9109999993816G
         slotNumber:             9
         status:                 normal


This is what the configuration of the FLASH drives are

CellCLI| list physicaldisk FLASH_5_0 detail
         name:                   FLASH_5_0
         diskType:               FlashDisk
         errCmdTimeoutCount:     0
         errHardReadCount:       0
         errHardWriteCount:      0
         errMediaCount:          0
         errOtherCount:          0
         errSeekCount:           0
         luns:                   5_0
         makeModel:              "MARVELL SD88SA02"
         physicalFirmware:       D20Y
         physicalInsertTime:     2011-09-20T10:20:17-04:00
         physicalInterface:      sas
         physicalSerial:         3047M050XG
         physicalSize:           22.8880615234375G
         sectorRemapCount:       0
         slotNumber:             "PCI Slot: 5; FDOM: 0"
         status:                 normal



So this gives me a good idea of what disks the storage is made up of. In my case you can see that the 12 disks are SATA, and they contain 1862 of usable space.
In the case of the SAS, you can see they contain 558g of usable space.

You can also see that the flash disks comprise of 16 separate disks, that are connected through 4 PCI cards. Each card contains 4 22g flashdisks.

For now (and the rest of this post), I will not talk about the flash. It is possible to use these cell disks, and provision them as usable storage, but I won't be discussing that.

Now that we have  the physical disk layout, we can move to next level  First to review.

We have 12 physical disks.  Each disk contains 1862.65 g of space. (22,352g/cell)

Now the next step is to look at the luns that were created out of the physical disks.  The lun, is the amount of usable space left after the disks have been turned into block devices and presented to the server. You can see that is is a small amount, and below is the output(truncated after the first 2 disks, then I've included the flashdisk to show that detail.


CellCLI| list lun detail
         name:                   0_0
         cellDisk:               CD_00_tpfh1
         deviceName:             /dev/sda
         diskType:               HardDisk
         id:                     0_0
         isSystemLun:            TRUE
         lunAutoCreate:          FALSE
         lunSize:                1861.712890625G
         lunUID:                 0_0
         physicalDrives:         20:0
         raidLevel:              0
         lunWriteCacheMode:      WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
         status:                 normal

         name:                   0_1
         cellDisk:               CD_01_tpfh1
         deviceName:             /dev/sdb
         diskType:               HardDisk
         id:                     0_1
         isSystemLun:            TRUE
         lunAutoCreate:          FALSE
         lunSize:                1861.712890625G
         lunUID:                 0_1
         physicalDrives:         20:1
         raidLevel:              0
         lunWriteCacheMode:      WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
         status:                 normal

         name:                   2_2
         cellDisk:               FD_06_tpfh1
         deviceName:             /dev/sdab
         diskType:               FlashDisk
         id:                     2_2
         isSystemLun:            FALSE
         lunAutoCreate:          FALSE
         lunSize:                22.8880615234375G
         overProvisioning:       100.0
         physicalDrives:         FLASH_2_2
         status:                 normal



So from this you can see that we have 1861.7 g of usable space on each drive, and you can see that the LUNS are given names that refer to the server. In this case the tpfh1 is the name of the storage cell, and this is included in the cellDisk name to easily identify the disk.

The next step is to take a look at the cell disks that were created out of these luns.

The items to note on this output is that first 2 disks contain the OS. You will see that the usable space left after the creation of the os partitions is less than the other disks.  The overhead for the cell software on each disk is also taken (though it is a small amount).

Here is what we have next as celldisks.



CellCLI| list celldisk detail
         name:                   CD_00_tpfh1
         comment:
         creationTime:           2011-09-23T00:19:30-04:00
         deviceName:             /dev/sda
         devicePartition:        /dev/sda3
         diskType:               HardDisk
         errorCount:             0
         freeSpace:              0
         id:                     a15671cd-2bab-4bfe
         interleaving:           none
         lun:                    0_0
         raidLevel:              0
         size:                   1832.59375G
         status:                 normal

         name:                   CD_01_tpfh1
         comment:
         creationTime:           2011-09-23T00:19:34-04:00
         deviceName:             /dev/sdb
         devicePartition:        /dev/sdb3
         diskType:               HardDisk
         errorCount:             0
         freeSpace:              0
         id:                     de0ee154-6925-4281
         interleaving:           none
         lun:                    0_1
         raidLevel:              0
         size:                   1832.59375G
         status:                 normal

         name:                   CD_02_tpfh1
         comment:
         creationTime:           2011-09-23T00:19:34-04:00
         deviceName:             /dev/sdc
         devicePartition:        /dev/sdc
         diskType:               HardDisk
         errorCount:             0
         freeSpace:              0
         id:                     711765f1-90cc-4b53
         interleaving:           none
         lun:                    0_2
         raidLevel:              0
         size:                   1861.703125G
         status:                 normal


Now you can see the first 2 disks have 1832.6g available, and the remaining 10 disks have 1861.7g available (I didn't include the last 9 disks in the output).

So to review where we are. There are 12 physical disks, which are carved into luns, then become cell disks.  These cells have (2 x 1832.6) + (10 x 1861.7) = 22,282g of raw disk available.

Now these disks get carved up into Grid disks. The grid disks are what is presented to ASM.  Lets see how my storage cell is carved up.  While looking at the output, notice that the celldisks are named CD_00_{cellname} through  CD_11_{cellname}.  Here is a snippet



CellCLI| list griddisk detail
         name:                   DATA_DMPF_CD_00_tpfh1
         availableTo:
         cellDisk:               CD_00_tpfh1
         comment:
         creationTime:           2011-09-23T00:21:59-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     2f72fb5a-adf5
         offset:                 32M
         size:                   733G
         status:                 active

         name:                   DATA_DMPF_CD_01_tpfh1
         availableTo:
         cellDisk:               CD_01_tpfh1
         comment:
         creationTime:           2011-09-23T00:21:59-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     0631c4a2-2b39
         offset:                 32M
         size:                   733G
         status:                 active
.......
.......
.......

        name:                   DATA_DMPF_CD_11_tpfh1
         availableTo:
         cellDisk:               CD_11_tpfh1
         comment:
         creationTime:           2011-09-23T00:22:00-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     ccd79051-0e24
         offset:                 32M
         size:                   733G
         status:                 active

         name:                   DBFS_DG_CD_02_tpfh1
         availableTo:
         cellDisk:               CD_02_tpfh1
         comment:
         creationTime:           2011-09-23T00:20:37-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     d292062b-0e26
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active

         name:                   DBFS_DG_CD_03_tpfh1
         availableTo:
         cellDisk:               CD_03_tpfh1
         comment:
         creationTime:           2011-09-23T00:20:38-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     b8c478a9-5ae1
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active

         name:                   DBFS_DG_CD_04_tpfh1
         availableTo:
         cellDisk:               CD_04_tpfh1
         comment:
         creationTime:           2011-09-23T00:20:39-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     606e3d69-c25b
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active
.....
.....
.....
         name:                   DBFS_DG_CD_11_tpfh1
         availableTo:
         cellDisk:               CD_11_tpfh1
         comment:
         creationTime:           2011-09-23T00:20:45-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     58af96a8-3fc8
         offset:                 1832.59375G
         size:                   29.109375G
         status:                 active

         name:                   RECO_DMPF_CD_00_tpfh1
         availableTo:
         cellDisk:               CD_00_tpfh1
         comment:
         creationTime:           2011-09-23T00:22:09-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     77f73bbf-09a9
         offset:                 733.046875G
         size:                   1099.546875G
         status:                 active

.....
.....
.....

         name:                   RECO_DMPF_CD_11_tpfh1
         availableTo:
         cellDisk:               CD_11_tpfh1
         comment:
         creationTime:           2011-09-23T00:22:09-04:00
         diskType:               HardDisk
         errorCount:             0
         id:                     fad57e10-414f
         offset:                 733.046875G
         size:                   1099.546875G
         status:                 active



Now by looking at this you can see that there are 3 sets of grid disks.

DATA - this carved out of every disk, and contains 733g of storage.  This starts at offset 32m (the beginning of the disks)..

RECO - this is carved out of every disk also, and contains 1099.5g of storage. This starts at offset 733G.

So now we are getting the picture.. Each celldisk is carved into 2 gridisk, starting with Data, followed by reco.

DBFS - This is carved out of the last 10 disks (starting with disk 2) at offset 1832.59, and it  contains 29.1g.  I can only conclude this is the size of the OS parition on the first 2 disks.

So here is what we have for sizing on each Storage cell.

DATA  -  8,796g
RECO - 13,194g
DBFS -       290g

Total   22,280

The thing to keep in mind with this number, is that the OS partitions has caused us a bit of trouble. There are only 10 of these grid disks per cell, and the are only 29g.  If we pull this out, we have ~22tb of disk usable on each storage cell.

Now to figure out how much space is in each disk group (assuming these grid disks will all go directly into 3 disk groups).

The first thing to remember is the redundance level.  Are they going to be normal redundancy (mirrored) or High redundancy (triple mirrored) ?  With normal redundancy, the disk groups are configured with a disk being redundant with a disk on another cell.  With High redundancy the disk is redundant with 2 other disks on 2 other cells. To maintain this level of redundancy, you must set aside 1 storage cells worth of storage for normal redudnacy, and 2 storage cells worth of storage for high redundancy to ensure that you are completely protected.

So what does this mean for sizing ??  The larger your array, the more usable disk you get. With a half rack, you must set aside 1 out of 7 storage cells, or 2 out of 7 storage cells for redudnacy.  For a full rack you need to set aside 1 out of 14 storage cells, or 2 out of 14 storage cells for redundancy.

Now lets run the numbers.


HALF RACK  -

Data -  Normal  (8,796g / 2) * 6 usable racks = 26,388g of usable space
            High       (8,796g / 3) * 5 usable racks = 14,660g of usable space

Reco - Normal (13,194g / 2) * 6 usable racks = 39,562g of usable space
           High      (13,194g / 3) * 5 usable racks = 21,990g of usable space

Dbfs - Normal (290g / 2) * 6 usable racks = 870g of usable space
           High      (290g / 3) * 5 usable racks = 483g of usable space

TOTAL usable (minus DBFS)
    Normal Redundancy - 65.9tb
    High Redundancy        36.6tb


FULL RACK -


Data - Normal (8,796g / 2) * 13 usable racks = 57,174g of usable space
           High (8,796g / 3) * 12 usable racks = 35,184g of usable space


Reco - Normal (13,194g / 2) * 13 usable racks = 85,761g of usable space
           High (13,194g / 3) * 12 usable racks = 52,776g of usable space


Dbfs - Normal (290g / 2) * 13 usable racks = 1885g of usable space
         High (290g / 3) * 12 usable racks = 1160g of usable space

TOTAL usable (minus DBFS)
    Normal Redundancy - 142.9 tb
    High Reundancy        - 87.96tb


So the take I get from this is.

There is a much higher cost for redunancy levels, and this cost is higher for smaller rack systems.
A certain portion of the the cells is a small gid disk, that is only on 10 of the physical disks, and is hard to utilize well.