Twitter Updates 2.2.1: FeedWitter

Tuesday, May 24, 2011

Oracle and networking latency with small packets

Recently I have run into some performance issues, and trying to track down the cause has been a challenge.  It wanted to share a valuable lesson that I learned.

It started like any other performance issues.  The phone rings, and the database is slow.  Why do they think the database is slow ? Because the app isn't very busy, and the processing isn't moving along at the expected pace.  This is a new app, and they are in beta.  The process goes along as it always does.  Check the database for any bottlenecks, or any performance issues.  The database all looks good.  The queries are all executing in milliseconds. on the new hardware.  The application looks good.  The process is executing in Milliseconds, and then calling out to the database for the next query.  Everything looks good, database is fast, application is fast.  The issue goes to the networking group, and the network group concludes that we have a nice 1ge going between the 2 servers.  The network isn't breaking a sweat.  The issue continues.

So what was the issue ?  It was the number of hops, and the distance of the hops between the application and the database server.  It turns out that this is a chatty app, and it sends a lot of small packets back and forth between the database server, and the app server. 
To make matters worse, the application was just uplifted to new hardware for both the database, and the application.  This uplift caused the applications to be less of a bottleneck, pushing the bottleneck to the network.

The issue turned out to be network latency.  Not a lot of latency, (just a couple of milliseconds), but enough to be noticable for a very chatty application.  It becomes more, and more noticable as servers become faster.  Now that queries run in < 0.00ms, the network is popping up as more of a bottleneck.

The lesson I want to pass on, is that I would highly suggest you measure the network latency, and know how much of a impact it has.. Especially if you going across datacenters, or across many networks in the same datacenters.

My test was a simple one..

1) Create a script full of "select 'x' from dual;

2) put a !date call at the beging and end of the script

3) Run this on a client on your network and compare the difference between the begin and the end time.

4) Run this for multiple scenarios.. I even ran it as IPC, to find out what the network overhead is.

Knowing the expected latency as you go across your datacenter(s) is useful to find out where that missing time is going. So often, it is blamed on the dba (database), and doing this kind of check will let you know what isn't the database.

1 comment:

  1. I encountered a similar situation, but the effect was being played out on even a high performance low latency network. For very chatty applications there is not much you can do at the network level. You just have to refactor.

    The experience led me to look into this in more detail. You can see my results here: