Bryan's Oracle Blog

Sunday, November 6, 2011

Fancy new Disk array technology

Well first off, I need to put a big disclaimer down. These are my opinions, and my opinions only. These to do not reflect the opinions of my employer, my spouse or my dog.

I was watching some twitter updates go by and this blogpost caught my eye.
http://chucksblog.emc.com/chucks_blog/2011/10/shifts-happen.html

This blog was talking about new disk technology, and part of the covered the idea of FAST technology. If you haven't heard of FAST (this is the EMC name, I'm sure other vendors have their own flavors), it is disk technology that moves blocks to the best tier of storage automagically. Really ! The idea is that you buy an array with 3 different tiers of disk. Flash, Fibre channel, and Sata. The disk array learns the patterns for the data access, and moves the data to the appropriate tiers. Sounds great right ? It does make sense..
Let take an example... Let say that you are a supplier and you supply parts for 100,000 small businesses. You keep historical data their orders for 5 years for reference. Whenever they place a new order you reference their latest orders to find patterns.

So following this workload you can guess what happens.. The current data for your customers stays in fiber channel (everything starts in fibre channel), The old data gets migrated to sata, and your customer master data will most likely go the Flash. All well and good. Even though customers only order every month, their recent activity gets moved to a higher tier disk, and all that old history gets moved to Sata.

Now lets throw in a physical standby, dataguard .

With dataguard, we are writing the new blocks of history, and they are not accessed (this is cold standby). If you mix this data with other applications that are busy, all your data for the standby database is surely going to end up in Sata over time.. This makes perfect sense to the algorithms for the array. This historical data (or even current data) isn't accessed. For your standby sata it is !!

Bang... Sinking feeling.... wham.. You do a failover.

Now lets see what happens.. All your data is in Sata. You are now accessing, and trying to give your customers the same performance they are used to. You system is slow. You have 100,000 business, that access data over the course of the month. How long do you think it takes to move all the data from SATA to Flash or Fibre ? It could take quite a while for your system to learn the new patterns, and during this time your old primary (now standby) has it's data pattern getting changed. The data is getting migrated to SATA. You stay in your alternative site for a month, fail back, and guess what.. WHAM again. The disk array has to learn the pattern again.

As I said, this is all conjecture, and solely my opinion.

Configuring an Exadata (follow up)

Now that we have the exadata and it is up and running, we are working on getting it configured for ease of maintenance. I know there are some notes from metalink that can be helpful.

The first thing I wanted to do was get the machine (and all the hardware) configured with OCM (oracle Configuration manager). Like most things with the exadata, there is a special configuration piece for this called the "mass deployment Kit".. Here is a link for the latest information on it.

On MOS [ID 1319476.1]

I am still in the process of getting this configured by using the Oracle support Hub (or repeater). A lot of this information is contained in the PDF mentioned in the My Oracle Support note. As you can imagine, the exadata is usually installed in a companies core infrastructure, far within any firewalls. Connecting directly out to the internet isn't always possible, so setting up a repeater (like a proxy) as part of grid (or separately) will help get your configuration information sent up to oracle support.

The second item is Grid/Cloud 12c. I have to say that I set up Cloud 12c for the exadata about 48 hours after it came out. It was relatively easy. You just add the database nodes (and push out the agents), then once the database nodes are done, you use the tools with grid to walk you through discovering all the components (by starting with one of the database nodes). It all worked well, and there are some notes now on this. Oracle Enterprise Manager Cloud Control 12c Setup Automation kit for Exadata

So the exadata is close to be set up.. I believe setting up OCM is one of the most challenging things. One of the first steps is to create a spreadsheet with the configuration information.. Following this is the steps from the documentation. The one complaint I would state is that a lot of the information for the OCM configuration is the same information provided to the "one" script. I am hoping down the road the ACS group (or whoever does the configuration), also configures OCM, or at least provides the input for it. OCM isn't necessary, but I think having it configured will save a lot of time when we need to open up an SR.

From the manual......

**************************************************************

Use your favorite spreadsheet editor to create the input csv file. To facilitate the use of the input

file, the Mass Deployment document contains a template for you to use in providing the field
values (ocm_companion/distributions/ocm/md/sample_input.csv). See Section 2.4.5 “Input File”
in Mass Deployment documentation for details on the input file format.

Much of the information required as input into Mass Deployment can be retrieved from the
Exadata Database Machine configuration worksheets. Please see Appendix A for examples.

1. Copy/rename the sample_input.csv file (e.g., getinfo_exadata_csi_input.csv). This file
can be used as a template for entering the data for each host on which OCM will be
deployed and/or configured. Add information for all the compute nodes as listed below.

a. Action: Set this column to “get_info” to retrieve information about the state of the
OCM collector in all the Exadata Database (compute node) Oracle homes.

b. Host-Name: Host name of the node.

c. Host-User: OS user that owns the Oracle home.

d. Host-Password: Password for the OS user - set to “__PROMPT__” (two
underscores before and after). See Section 2.4.4 Credential s in the Mass
Deployment Documentation or Appendix B of this document for secure ways of
providing the password. If the same credentials are being used for multiple
hosts, another option is to use a password group name in the password.csv file
as described in Section 2.4.1 of the Oracle Configuration Manager Companion
Distribution Guide .

e. Oracle Home: Oracle Database home location.

f. Db SID: Set the Database SID for the last database host in the input file. This is
required for Mass Deployment to instrument the database for configuration data
collections. This script need only be run on one of the database hosts, but must
be run after the last server is installed.

g. DB Type: Set to 'db' for the last database host in the input file.Specify only for
Install and Instrumentation actions.

h. ML-User: Enter the customer's MOS Account username (email address).

i. ML-CSI: This field holds the Exadata Hardware Customer Support Identifier
(CSI) an can be used in conjunction with the ML-User field to authenticate OCM
uploads. If the CSI is not know, see Appendix B.

j. ML-Pwd: Leave it as blank (should only be used if the CSI is not known).

k. DB-user: Database username required to instrument the database.

l. DB-Pwd: Database user password

Friday, October 14, 2011

Grid control 12c

I've been spending my week playing with Grid Control 12c. I know it has only been out just over a week, but I was very excited to see if it is that much better than grid 11g. My company is currently rolling out Grid 11g, and I wanted to see if we should be pushing for grid 12c right on it's heels.

I am extremely impressed with this product, so much so that I set up a virtual environment with Grid 12c to check it out.

I've spent the last couple of days getting my exadata configured in grid 12c. After a couple of false starts (and reinstalling of the agent) I finally got it up and running. These are my lessons learned

First discover your database nodes, and make sure the name you use is the default fully qualified name.
Add the database machine as a target, and make sure you have all the passwords including the nm2user on the IB switches (password is changeme), and you also need the id and password for the PDU (admin/admin).

Once you get these all set, Grid 12c will recognize your machine, and you will see wonderous things. Here are 2 example screens from an exadata..

The first one shows the IB traffic through the switch,

The second one shows the combined load on the Storage cells.

Even if you don't have an exadata, here is my favorite ASH analytics. Notice the timeframe is very small.

It is definately worth checking out.

Monday, October 3, 2011

10 x 10=100. Larryisnm's from oow11

I wanted to put down my impressions on the big announcements at openworld this year. First is Exalytics.. Analyitics at the speed of throught. This is an intriguing product, and I can definately see the uses for applications where real time analytics is key. I think for most of us, this appliance is going to be out of our range. I know I don't know of any business cases. No prices was mentioned either. Second was EM grid 12c. Now this was pretty impressive. I was surprised on all the enhancements that was put into it. It really seems to do a nice job of centralized for Cloud support. I was especially impressed with the virtualization pieces. The provisioning, and support of virtual environments is a great component. It is also a very big carrot for those companies turning to virtualization, and aren't sure whether to choose VMware or Oracle VM. Big Data Appliance. - This one I am waiting to see more specs on. it looks interesting, but what is the licensing model ? I can't believe that there is no software licensing (other than OS pieces). All these are interesting announcments, but I think the 12c features was the most interesting to me. IF only the documentation was available I would install it right now.

Wednesday, September 28, 2011

partitioning Local vs Global

In my last post I talked about creating a function based index on a GTT after my query plans went to hell after partitioning. Someone asked me to elaborate why my query plans went south.

Well to start with, I deal with very large tables.. Not terribly large (250g 2 billion rows). We are in the processess of partitioning this data, so we can purge it.. The performance on the data is very good, but we keep eating up disk space.

Seems simple enough right ? partition by date ranges, with some hash partitions thrown in on the column used the most for lookups. Nice and neat. At this point we have 116 partitions. Smaller is better right.

Since the whole reason for doing this is being able to purge, we created local indexes on almost all the columns except for the primary key. Being able to maintain the partitions is critical.

Doing all this I assumed we would be OK with local indexes. The application does index lookups, and the ones that don't use the primary key (or the hash partitioned key) are close to unique.

How long can a lookup take with an index and number of distict values = num_rows. Easy..

Then the dbreplay came, and the queries were slower.. much slower.. plan was similar but buffer gets was off.

% Total Gets Gets per Exec #Executions Exec Time (ms) per Exec CPU Time (ms) per Exec I/O Time (ms) per Exec Physical Reads per Exec #Rows Processed per Exec #Plans   
SQL Id 1st 1st Total 2nd 2nd Total Diff 1st 2nd 1st 2nd 1st 2nd 1st 2nd 1st 2nd 1st 2nd 1st 2nd 1st/2nd/Both SQL Text 
d76xhcfh5dsrs 0.71 1.42 25.46 50.96 24.75 12,401.18 875,159.54 2,389 2,518 347 2,819 135 2,130 203 351 23.43 82.44 9.34 9.48 1/ 1/ 2 SELECT vpcyd_wrkr_id, vpcyd_cl... 
.

It's hard to see above, but the 2 critical values are 12,401 buffer gets vs 875,159 buffer gets.. and 347 ms/exec vs 2,819 ms/exex

Buffer gets was making a huge difference with my partitioned tables.. Now to dig into the trace file.

Here is the part that really stood out..147,157 cr (buffer block reads), to get 54 rows of data.. wow..

       545        545        545                     PARTITION RANGE ALL PARTITION: 1 29 (cr=147157 pr=19 pw=0 time=1142973 us cost=232 size=0 card=1)
       545        545        545                      PARTITION HASH ALL PARTITION: 1 4 (cr=147157 pr=19 pw=0 time=1126073 us cost=232 size=0 card=1)
       545        545        545                       INDEX RANGE SCAN PIDX_CUST_ID PARTITION: 1 116 (cr=147157 pr=19 pw=0 time=1032020 us cost=232 size=0 card=

I isolated this lookup, and found that it was a "unique" key (it had no duplicate values).. Why would 545 rows of data take all that time? (this was where the time was going).

I created a small query, and did a index lookup for one row and compared partitioned vs non-partitioned.

                SQL_ID       PLAN_HASH_VALUE BUFFER_GETS EXECUTIONS CPU_TIME ELAPSED_TIME AVG_HARD_PARSE_TIME APPLICATION_WAIT_TIME CONCURRENCY_WAIT_TIME 
Partitioned     gz67xt981w53p  3,540,849,128       7,323          6  472,928      642,455  625,819       0 168,862 
Non-Partitioned gz67xt981w53p    791,655,517          32          6    3,999        4,473    2,847       0 0

Comparing the index partitioned vs non-partitioned, (with 116 subpartitions), you can see the difference. 3,999 ms vs 472,928 ms . What caused me the biggest issue is that I didn't realize it was doing a nested loop, 54 times.. this made the difference 36,000 ms vs 4,256,352 ms. 116x longer with a local partion vs global.

Lesson learned was that with partitioning you need to balance performance with maintainability.. Local indexes can be very expensive. Especially with nested loops.

Tuesday, September 27, 2011

Why are the developers using functions ?

Hi all,

I have been working all week on trying to figure out why a query went to hell when we partitioned the tables. I dug into it, and found one good fix.. But I can't implement it.

The detail on what happened in my last post.. Keep in mind I found that issue, but working through this one, and moving the bottleneck.

Here is the problem.. They are joining to a GTT (global temporary table), but they are using a function on the column in the table. ARGH.. They are making it impossible for the optimizer to find the best plan.

Here is an example of what's happening...

First here the GTT

CREATE GLOBAL TEMPORARY TABLE my_temp_table (
  tmp_strt_dt  date,
  tmp_end_dt   date
) ON COMMIT preserve ROWS;

Here is table and lets load 128 rows of data into it.

create table test_table
(  strt_dt   date,
   end_dt    date,
   col1      varchar(1));

insert into test_table values(sysdate-1000,sysdate+1000,'Y');
insert into test_table select * from test_table;
insert into test_table select * from test_table;
insert into test_table select * from test_table;
insert into test_table select * from test_table;
insert into test_table select * from test_table;
insert into test_table select * from test_table;
insert into test_table select * from test_table;
commit;

Now lets insert into the temporary table, and analyze both tables.

insert into my_temp_table values(sysdate,sysdate);

exec dbms_stats.gather_table_stats(ownname=> null, tabname=> 'MY_TEMP_TABLE',estimate_percent=>null, cascade=>true, method_opt=> 'FOR ALL COLUMNS SIZE 1');
exec dbms_stats.gather_table_stats(ownname=> null, tabname=> 'TEST_TABLE',estimate_percent=>null, cascade=>true, method_opt=> 'FOR ALL COLUMNS SIZE 1');

Now for my query ..

select * from my_temp_table ,test_table 
where "END_DT">=TRUNC("TMP_STRT_DT") AND                                                                              
      "STRT_DT"<=TRUNC("TMP_END_DT");

and the explain plan.. Notice the cardinality of 1, though there are 128 rows that match

Plan hash value: 1231029307

------------------------------------------------------------------------------------
| Id  | Operation          | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |               |     1 |    34 |     4   (0)| 00:00:01 |
|   1 |  NESTED LOOPS      |               |     1 |    34 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| MY_TEMP_TABLE |     1 |    16 |     2   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| TEST_TABLE    |     1 |    18 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("END_DT">=TRUNC(INTERNAL_FUNCTION("TMP_STRT_DT")) AND
              "STRT_DT"<=TRUNC(INTERNAL_FUNCTION("TMP_END_DT")))

So what to do ??? I removed the trunc function, and the cardinality was right...

select * from my_temp_table ,test_table 
where "END_DT">="TMP_STRT_DT" AND                                                                              
      "STRT_DT"<="TMP_END_DT";

Plan hash value: 1231029307

------------------------------------------------------------------------------------
| Id  | Operation          | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |               |   128 |  4352 |     4   (0)| 00:00:01 |
|   1 |  NESTED LOOPS      |               |   128 |  4352 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| MY_TEMP_TABLE |     1 |    16 |     2   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| TEST_TABLE    |   128 |  2304 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("END_DT">="TMP_STRT_DT" AND "STRT_DT"<="TMP_END_DT")

Unfortunately, I can't change the code.. How do I get the optimizer to get the right cardinality ?? Function based indexes to the rescue. Here is what I did.. First create the indexes on the 2 columns.

create index my_temp_table_fbi1 on my_temp_table(TRUNC("TMP_STRT_DT"));
create index my_temp_table_fbi2 on my_temp_table(TRUNC("TMP_END_DT"));

Next insert into the table, and gather stats.. Notice that I am using "hidden" column clause.

insert into my_temp_table values(sysdate,sysdate);

exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'MY_TEMP_TABLE',estimate_percent=>null, cascade=>true, method_opt=> 'FOR ALL HIDDEN COLUMNS SIZE 1');

Now to run my query and look at the cardinality.

elect * from my_temp_table ,test_table 
where "END_DT">=TRUNC("TMP_STRT_DT") AND                                                                              
      "STRT_DT"<=TRUNC("TMP_END_DT");

Plan hash value: 1231029307

------------------------------------------------------------------------------------
| Id  | Operation          | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |               |   128 |  6400 |     4   (0)| 00:00:01 |
|   1 |  NESTED LOOPS      |               |   128 |  6400 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| MY_TEMP_TABLE |     1 |    32 |     2   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| TEST_TABLE    |   128 |  2304 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("END_DT">=TRUNC(INTERNAL_FUNCTION("TMP_STRT_DT")) AND
              "STRT_DT"<=TRUNC(INTERNAL_FUNCTION("TMP_END_DT")))

Notice that the index is not used for the query plan, but by having the index, and gathering statistics, the optimizer is able to figure out the correct cardinality even though a function is used for the column. Problem solved without changing the query.

As always, you can find my script here

Saturday, September 24, 2011

My LIO silly little benchmark

I have been working on a benchmark for LIO. I know there are TPC and TPH transactions numbers that are published, on CPU speeds, but how much does that directly releate to LIO's, the heart of an Oracle database ?

To help benchmark, I wrote a little pl/sql package. This packages takes the Zip Code database, and randomly picks some rows with a cursor (about 1 % of the table). This package is then called by swing bench, and I put a "think time" in it for each execution of the package.

Ideally, I try to execute it up to what the Server can handle.. This was especially usefull with the benchmarking I did in a previous post on hyperthreading.

I was interested in what anyone else does ? I try to do a LIO lookup, and compare numbers between servers. By doing this I have a pretty good idea how many LIO's an AMD server can handle per second, an Intel server can do, and different architectures (2 socket, 4 socket, and 8 socket).. I even benchmark virutalization to see how much of an overhead is caused from the Software.

This may not be the best way (it excludes what happens with updates (redo logs etc), and how much physical I/O's affect the workload.

Any ideas would be appreciated. I would love to come up with a nice reproducable benchark, and then maybe create a dbcapute of it, and do a dbreplay on different architectures ? Would that be more accurate.

I know many of you will say the line "well it depends on the workload", maybe the benchmarking that comes with swingbench is good enough ??

I'm just tired of reading server bencharks, and finding that for an oracle database, those benchmarks aren't very meaningful.

I would also love to do some benchmarking with Solaris X86, and RHEL/OEL on an 8 socket box.

I would also love to learn what anyone else has learned ? I am especially interested how 8 socket intel servers compare with 2 socket. I'm seeing some pretty increadable numbers from 2 socket servers (almost 2x the speed of 8 socket). I'm wondering if anyone else is seeing some measureable differences.

I'm starting to move to "go wide" camp rather than go high camp for increasing server power. The blade servers are being more, and more powerful, and you can have more memory local to the CPU. Increasing CPU sockets just increases hops to get those LIO's done, costing time, waits, latches. etc. etc.

So here is a piece of my LIO benchmark...

CREATE TABLE "KILLER" ("CC_ID" NUMBER(20, 0) NOT NULL ENABLE)  ;

/*  import 55,000 rows of distinct data */
CREATE PROCEDURE          kill_lio IS
   my_count number := 1;
   my_executions number;
   my_buffer_gets number
   my_cpu_time number;
   my_elapsed_time number;

error_code number;

BEGIN
for i in 1..10000 LOOP


select count(distinct cc_id) into my_count from kill_lio.killer;

end loop;

select executions,buffer_gets,cpu_time,elapsed_time into my_executions,my_buffer_gets,my_cpu_time,my_elapsed_time 
from sys.v_$sqlstats where sql_id='2j5tvp5rdzmym';
 
 dbms_output.put_line('exectutions:                          ' || to_char(my_executions,'999,999,999'));"
dbms_output.put_line('buffer gets:                          ' || to_char(my_buffer_gets,'999,999,999'));"
dbms_output.put_line('cpu time:                             ' || to_char(my_cpu_time,'999,999,999'));"
dbms_output.put_line('elapsed time:                         ' || to_char(my_elapsed_time,'999,999,999'));"
dbms_output.put_line('elapsed time per execution(ms)   :      ' ||to_char( my_elapsed_time/my_executions/1000,'999,999.9'));"
dbms_output.put_line('buffer_gets/second:                   ' ||to_char( my_buffer_gets/(my_elapsed_time/1000000),'999,999,999'));"

END;  -- exception handlers

and here is the output I use to compare. I look at the average elapsed time, and buffer_gets/second to benchark systems.

executions:                                 10,000
buffer gets:                             1,190,000
cpu time:                              190,983,974
elapsed time:                          191,374,061
elapsed time per execution(ms)    :           19.1
buffer_gets/second:                          6,218

Here is the AWR report from the execution