Wednesday, September 28, 2011

partitioning Local vs Global

In my last post I talked about creating a function based index on a GTT after my query plans went to hell after partitioning.  Someone asked me to elaborate why my query plans went south.

Well to start with, I deal with very large tables.. Not terribly large (250g 2 billion rows).  We are in the processess of partitioning this data, so we can purge it.. The performance on the data is very good, but we keep eating up disk space.

Seems simple enough right ? partition by date ranges, with some hash partitions thrown in on the column used the most for lookups.  Nice and neat. At this point we have 116 partitions.  Smaller is better right.

Since the whole reason for doing this is being able to purge, we created local indexes on almost all the columns except for the primary key.  Being able to maintain the partitions is critical.

Doing all this I assumed we would be OK with local indexes. The application does index lookups, and the ones that don't use the primary key (or the hash partitioned key) are close to unique.

How long can a lookup take with an index and number of distict values = num_rows.  Easy..

Then the dbreplay came, and the queries were slower.. much slower.. plan was similar but buffer gets was off.
% Total Gets Gets per Exec #Executions Exec Time (ms) per Exec CPU Time (ms) per Exec I/O Time (ms) per Exec Physical Reads per Exec #Rows Processed per Exec #Plans   
SQL Id 1st 1st Total 2nd 2nd Total Diff 1st 2nd 1st 2nd 1st 2nd 1st 2nd 1st 2nd 1st 2nd 1st 2nd 1st/2nd/Both SQL Text 
d76xhcfh5dsrs 0.71 1.42 25.46 50.96 24.75 12,401.18 875,159.54 2,389 2,518 347 2,819 135 2,130 203 351 23.43 82.44 9.34 9.48 1/ 1/ 2 SELECT vpcyd_wrkr_id, vpcyd_cl... 

It's hard to see above, but the 2 critical values are 12,401 buffer gets vs 875,159 buffer gets.. and 347 ms/exec vs 2,819 ms/exex

Buffer gets was making a huge difference with my partitioned tables.. Now to dig into the trace file.

Here is the part that really stood out..147,157 cr (buffer block reads), to get 54 rows of data.. wow..
       545        545        545                     PARTITION RANGE ALL PARTITION: 1 29 (cr=147157 pr=19 pw=0 time=1142973 us cost=232 size=0 card=1)
       545        545        545                      PARTITION HASH ALL PARTITION: 1 4 (cr=147157 pr=19 pw=0 time=1126073 us cost=232 size=0 card=1)
       545        545        545                       INDEX RANGE SCAN PIDX_CUST_ID PARTITION: 1 116 (cr=147157 pr=19 pw=0 time=1032020 us cost=232 size=0 card=

I isolated this lookup, and found that it was a "unique" key (it had no duplicate values).. Why would 545 rows of data take all that time? (this was where the time was going).

I created a small query, and did a index lookup for one row and compared partitioned vs non-partitioned.

Partitioned     gz67xt981w53p  3,540,849,128       7,323          6  472,928      642,455  625,819       0 168,862 
Non-Partitioned gz67xt981w53p    791,655,517          32          6    3,999        4,473    2,847       0 0 


Comparing the index partitioned vs non-partitioned, (with 116 subpartitions), you can see the difference. 3,999 ms vs 472,928 ms .  What caused me the biggest issue is that I didn't realize it was doing a nested loop, 54 times.. this made the difference 36,000 ms vs 4,256,352 ms.  116x longer with a local partion vs global. 

Lesson learned was that with partitioning you need to balance performance with maintainability.. Local indexes can be very expensive.  Especially with nested loops.

Tuesday, September 27, 2011

Why are the developers using functions ?

Hi all,

  I have been working all week on trying to figure out why a query went to hell when we partitioned the tables.  I dug into it, and found one good fix.. But  I can't implement it.

The detail on what happened in my last post.. Keep in mind I found that issue, but working through this one, and moving the bottleneck.

Here is the problem.. They are joining to a GTT (global temporary table), but they are using a function on the column in the table. ARGH.. They are making it impossible for the optimizer to find the best plan.

Here is an example of what's happening...

First here the GTT

  tmp_strt_dt  date,
  tmp_end_dt   date
) ON COMMIT preserve ROWS;

Here is table and lets load 128 rows of data into it.

create table test_table
(  strt_dt   date,
   end_dt    date,
   col1      varchar(1));

insert into test_table values(sysdate-1000,sysdate+1000,'Y');
insert into test_table select * from test_table;
insert into test_table select * from test_table;
insert into test_table select * from test_table;
insert into test_table select * from test_table;
insert into test_table select * from test_table;
insert into test_table select * from test_table;
insert into test_table select * from test_table;

Now lets insert into the temporary table, and analyze both tables.

insert into my_temp_table values(sysdate,sysdate);

exec dbms_stats.gather_table_stats(ownname=> null, tabname=> 'MY_TEMP_TABLE',estimate_percent=>null, cascade=>true, method_opt=> 'FOR ALL COLUMNS SIZE 1');
exec dbms_stats.gather_table_stats(ownname=> null, tabname=> 'TEST_TABLE',estimate_percent=>null, cascade=>true, method_opt=> 'FOR ALL COLUMNS SIZE 1');

Now for my query ..

select * from my_temp_table ,test_table 
where "END_DT">=TRUNC("TMP_STRT_DT") AND                                                                              

and the explain plan.. Notice the cardinality of 1, though there are 128 rows that match

Plan hash value: 1231029307

| Id  | Operation          | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT   |               |     1 |    34 |     4   (0)| 00:00:01 |
|   1 |  NESTED LOOPS      |               |     1 |    34 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| MY_TEMP_TABLE |     1 |    16 |     2   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| TEST_TABLE    |     1 |    18 |     2   (0)| 00:00:01 |

Predicate Information (identified by operation id):


So what to do ??? I removed the trunc function, and the cardinality was right...

select * from my_temp_table ,test_table 
where "END_DT">="TMP_STRT_DT" AND                                                                              

Plan hash value: 1231029307

| Id  | Operation          | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT   |               |   128 |  4352 |     4   (0)| 00:00:01 |
|   1 |  NESTED LOOPS      |               |   128 |  4352 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| MY_TEMP_TABLE |     1 |    16 |     2   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| TEST_TABLE    |   128 |  2304 |     2   (0)| 00:00:01 |

Predicate Information (identified by operation id):

   3 - filter("END_DT">="TMP_STRT_DT" AND "STRT_DT"<="TMP_END_DT")

Unfortunately, I can't change the code.. How do I get the optimizer to get the right cardinality ?? Function based indexes to the rescue. Here is what I did.. First create the indexes on the 2 columns.

create index my_temp_table_fbi1 on my_temp_table(TRUNC("TMP_STRT_DT"));
create index my_temp_table_fbi2 on my_temp_table(TRUNC("TMP_END_DT"));

Next insert into the table, and gather stats.. Notice that I am using "hidden" column clause.

insert into my_temp_table values(sysdate,sysdate);

exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'MY_TEMP_TABLE',estimate_percent=>null, cascade=>true, method_opt=> 'FOR ALL HIDDEN COLUMNS SIZE 1');

Now to run my query and look at the cardinality.

elect * from my_temp_table ,test_table 
where "END_DT">=TRUNC("TMP_STRT_DT") AND                                                                              

Plan hash value: 1231029307

| Id  | Operation          | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT   |               |   128 |  6400 |     4   (0)| 00:00:01 |
|   1 |  NESTED LOOPS      |               |   128 |  6400 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| MY_TEMP_TABLE |     1 |    32 |     2   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| TEST_TABLE    |   128 |  2304 |     2   (0)| 00:00:01 |

Predicate Information (identified by operation id):


Notice that the index is not used for the query plan, but by having the index, and gathering statistics, the optimizer is able to figure out the correct cardinality even though a function is used for the column. Problem solved without changing the query.

As always, you can find my script here

Saturday, September 24, 2011

My LIO silly little benchmark

I have been working on a benchmark for LIO.  I know there are TPC and TPH transactions numbers that are published, on CPU speeds, but how much does that directly releate to LIO's, the heart of an Oracle database ?

  To help benchmark, I wrote a little pl/sql package.  This packages takes the Zip Code database, and randomly picks  some rows with a cursor (about 1 % of the table).  This package is then called by swing bench, and I put a "think time" in it for each execution of the package.

Ideally, I try to execute it up to what the Server can handle.. This was especially usefull with the benchmarking I did in a previous post on hyperthreading.

I was interested in what anyone else does ?  I try to do a LIO lookup, and compare numbers between servers.  By doing this I have a pretty good idea how many LIO's an AMD server can handle per second, an Intel server can do, and different architectures (2 socket, 4 socket, and 8 socket).. I even benchmark virutalization to see how much of an overhead is caused from the Software.

This may not be the best way (it excludes what happens with updates (redo logs etc), and how much physical I/O's affect the workload.

Any ideas would be appreciated.  I would love to come up with a nice reproducable benchark, and then maybe create a dbcapute of it, and do a dbreplay on different architectures ? Would that be more accurate.

I know many of you will say the line "well it depends on the workload", maybe the benchmarking that comes with swingbench is good enough ??

I'm just tired of reading server bencharks, and finding that for an oracle database, those benchmarks aren't very meaningful.

I would also love to do some benchmarking with Solaris X86, and RHEL/OEL on an 8 socket box.

I would also love to learn what anyone else has learned ?  I am especially interested how 8 socket intel servers compare with 2 socket. I'm seeing some pretty increadable numbers from 2 socket servers (almost 2x the speed of 8 socket).  I'm wondering if anyone else is seeing some measureable differences.

I'm starting to move to "go wide"  camp rather than go high camp for increasing server power.  The blade servers are being more, and more powerful, and you can have more memory local to the CPU. Increasing CPU sockets just increases hops to get those LIO's done, costing time, waits, latches. etc. etc.

So here is a piece of my LIO benchmark...


/*  import 55,000 rows of distinct data */
CREATE PROCEDURE          kill_lio IS
   my_count number := 1;
   my_executions number;
   my_buffer_gets number
   my_cpu_time number;
   my_elapsed_time number;

error_code number;

for i in 1..10000 LOOP

select count(distinct cc_id) into my_count from kill_lio.killer;

end loop;

select executions,buffer_gets,cpu_time,elapsed_time into my_executions,my_buffer_gets,my_cpu_time,my_elapsed_time 
from sys.v_$sqlstats where sql_id='2j5tvp5rdzmym';
 dbms_output.put_line('exectutions:                          ' || to_char(my_executions,'999,999,999'));"
dbms_output.put_line('buffer gets:                          ' || to_char(my_buffer_gets,'999,999,999'));"
dbms_output.put_line('cpu time:                             ' || to_char(my_cpu_time,'999,999,999'));"
dbms_output.put_line('elapsed time:                         ' || to_char(my_elapsed_time,'999,999,999'));"
dbms_output.put_line('elapsed time per execution(ms)   :      ' ||to_char( my_elapsed_time/my_executions/1000,'999,999.9'));"
dbms_output.put_line('buffer_gets/second:                   ' ||to_char( my_buffer_gets/(my_elapsed_time/1000000),'999,999,999'));"

END;  -- exception handlers

and here is the output I use to compare.  I look at the average elapsed time, and buffer_gets/second to benchark systems.

executions:                                 10,000
buffer gets:                             1,190,000
cpu time:                              190,983,974
elapsed time:                          191,374,061
elapsed time per execution(ms)    :           19.1
buffer_gets/second:                          6,218

Here is the AWR report from the execution

Wednesday, September 21, 2011

Oracle Database Appliance

I have spent the day on an oracle call, and reading all the subsequent tweets that follow.  I think the best way to describe the appliance is that it is NOT a mini-exadata, but it is a simple rac appliance.

My impression is that it is a nice product for the small to mid market, but those us working with the bigger toys I don't see the gain.

I know, I've spent more days than I care to remember schooling the SA's on how to set up an interconnect, and ensure that all the IP's are correct.  I've worked with Storage administrators on how to present the disks, and make them available to ASM, and I've worked with networking on the ranges of IP's I need for scan, interconnect, etc. etc.  I'm sure you get the picture.

I also think that people like me that work in a big organization and have a team to handle these tasks, are probably going huh ? what is this? 

Personally, I don't see the big deal in this.. I see lots of dissadvantages.

  • These Appliances cannot be clustered. What they have in them is all they will ever have in them.
  • The 2 database nodes have 96g of memory, not a lot in today server sizes..
  • There is no storage software like the exadata. No HCC, no offloading, no infiniband
  • This is local disk in the appliance, meaning no cloning, storage virtualization, etc.
  • The interconnect is 1ge, not infiniband.
  • You CANNOT hook up fiber to this server, ever.
  • It runs OEL, NOT redhat linux.. the differences are getting greater over time.
  • This is a closed system with specific patch sets that need to be maintained to a short list of acceptable patches.
I know for a small, to midsize, the ideal of creating a new rac system in 2 hours is thing of beauty, but for bigger companies, there isn't a lot there.

Especially without the Exadata candy filling (infiniband, HCC, offloading, storage indexes).

I still think virtualiation is the direction, and this is a step in the opposite direction.  There may be a few takers, but I think companies will realize that virtualization is a better direction than a single closed appliance.

We will see.. just some thoughts.

Saturday, September 17, 2011

Delphix, Solaris and ZFS

I was inspired to write a new blog posting after seeing a great presentation on Delphix.  Delphix is an extremely interesting product to help with the managing of many copies of an oracle database.  This product goes hand-in-hand with a virtualization product like VMWARE. What VMWare is servers, Delphix is to storage.

First I want to babble on a bit.. Why ? Because this is my blog.

I have used Solaris for many years and I am a big fan of Sparc/Solaris, and everything that it became over the years.  I know in the last couple of years, it has kind of fallen down, and Linux x86 has been growing leaps and bounds.  The most impressive piece of Solaris however has been ZFS.  ZFS offers snapshotting, cloning, lots of fantastic options built into the OS.  All very easy for the DBA.

I was very excited when ACFS came out (anyone remember).  It was the coolest thing going.  It looked a lot like ZFS built on top of ASM.   It had the potential to become a fantastic tool.  I was really, really, really hoping that down the road ACFS, and ZFS would combine, and replace the cooked file system with an Oracle File System that Rocked.    Well in case you didn't know ACFS has morphed into the "Cloud File System" or Cloud FS.  This is a licensed product used for the sharing of directories in a "cloud" environment.  Also, if you didn't know, ACFS won't run on the Exadata.  2 nails in the Coffin of this fantastic vision that I had in my head.

I was devastated for a while.

Why the background ? Because Delphix picked up where this vision died.  They offer the ability to do snapshots, along with deduplication, and compression to boot.

Delphix was founded by Jedidiah Yueh, who founded Avamar (dedup software that was sold to EMC),  The company also has  Karthik Rau, ad VP of Products who worked on VMware.  Lots of very bright people that were already involved in the founding technology.

Delphix makes the creation of database copies, easy, and it save space to boot.  I will post the presentation that I saw once I can link to it.

Definitely something worth checking out, and to watch for in the infrastructure Arsenal.

Wednesday, September 14, 2011

My quest for a performance warehouse

I have written on this before, and at openworld last year I've talked to some of the presenters about this.

My goal is to to create a performance data warehouse.  What does this mean ?

I want to take all the AWR data from all the databases in the corporate infrastructure, and create one giant repository. 

Well I just got one step closer.

I have a repository server set up already. It is just an database with lots of space (about 3 tb), running on a Linux blade server.

My first test was to take all the AWR data from an 10g database we just retired running on HPUX. Since this database was moved to Linux (cross endian), transportable tablespaces was used, and all the AWR data was not brought over.

First step.. Get the data out of original source.

In order to do that you can use awrextr.sql in the $ORACLE_HOME/rdbms/admin directory.  I noticed (and found from other blogs) that this just calls sys.dbms_swrf_internal.awr_extract.



  1. Create user "awrextract" identified by xxxxx
  2. grant connect ot awrextract
  3. grant execute on sys.dbms_swrf_internal to awrextract
  4. Login as awrextract
  5. execute awr_extract procedure
exec sys.dbms_swrf_internal.awr_extract(dmpfile=>'awr_data',dmpdir=>'tmp_dir',bid=>9920,eid=>9925);

  • The procedure automagically puts a .dmp on the dump file name
  • You need to know the start and end snapshot you want to export
  • The dmpdir needs to be a precreated directory and it has to be a cooked filesystem (non-asm).  I found that it also puts the logfile here, and that's why it has to be cooked.
  • You cannot use dbms_file_transfer, especially going across endian types
 6. copy the file to my repository server

STEPS on the Destination

  1.  Create schema to import the data into
  2. Import the file in
DBMS_SWRF_INTERNAL.AWR_LOAD ( SCHNAME => 'AWR', dmpfile => 'awr_data', dmpdir => 'TMP_DIR');

  3.  Once imported move the awr data to sys catalogs



All this worked flawlessly, and then I tried to run COE_PROFILE.

IF you didn't know COE_PROFILE.sql  is part of sqlt and allows you to transfer profiles from one database to another by creating a script to build the profile.. It all works from my repository.

I will write about this again, but I am physched to finally get this off the ground.

Monday, September 12, 2011