Bryan's Oracle Blog

Saturday, November 19, 2011

Big Data

Big Data

I wanted to write a little bit about big data, since that is such a common topic these days.

First the definition. What is considered big data? I had a conversation with a very smart thinker who defined it as “when the data is so big you have to put a lot of thought in how to deal with it”. This means it is data where normal techniques are no longer applicable to work with this large of a dataset. Large could be 20 Tb, 200tb, or petabytes. It’s about how the data relates to your infrastructure.

The next question is why ? why now ?

I think the big thing that has happen in the last couple of years is the explosion of smart phones, faster technology, and cellular data communications. There is a huge amount of data generated from these sources. And probably one of the biggest ingredients in this realm is the Location.. Think about GPS built into you smart phone (or the ability to triangulate your location from the use of cell phone towers. This adds a whole new dimension to the equations. From your smart phone, data is generated to tell more detail more than was every possible. You are in a store shopping for an item, and you check the price on amazon through your phone. There is a ton of data in that “transaction”. It is now possible to mine data about where you were when you looked at the item online. Did you buy it online ? when did you go shopping for it ?

There is also a desire to mine the data that you have. You might have order data, but that data is growing, and the desire to derive value from that data is growing.. At some point this forces you too look at solutions to handle this problem.

There is a huge proliferation of data from both traditional sources, and all these new sources of information. There are so many new devices out there that are internet aware. Look at the new coke vending machines that allow you customize your flavor, and data about your choices and the machines health is available to be monitored

So the next topic is to define the 2 types of “big data”

Structure data - Set schema

Unstructured Data – No set schema for the data.

Structured Data

First let’s talk about structured data. This data typically comes from your current sql database (oracle, db2, mysql, sqlserver, etc.. etc.). Data is created into a set schema, and then it is often joined together. This is the most common way to store traditional data for customer/order information. I think we can all relate to the data model

This is a very simple diagram, but you get the picture.. There are multiple tables, and there are relationships. You most likely have historical data collected over time. You might want to take all this data and find patterns.

unstructured data

A great example of this is resumes. Let say you are linkedin, and you have millions of resumes. The resume’s are not in a specific format and some key words may be difficult to disseminate. This is especially true in the computer field were so many names are derived from other things.. Look at the skill set of someone who is versed in a lot of the current technology like Redhat, Python, Java. If you go searching through resumes you might get a snake keeper who likes specific colored hats, and coffee. You need to find out how take this data, and make it useful using natural language, or other such means.

So you do you handle these 2 types of data ??

Structured data Solutions

I think there are many solutions that have come onto the market in this area. I’m sure you have heard of them. Exadata, Terradata, Netezza, Greenplum, vertica, and asterdata. Some of these solutions have been around, and are still on the forefront of the data revolution, and some of these are new comers. These are appliances that take the large amounts of data and parallelize the processing to quickly get to the solution. Terradata has been at this for years with their solutions. These solutions are usually mpp solutions, that use local disk. They break the query workload down into pieces, and then bring back the result set. These work very good to accomplish this.

Unstructured solutions

Undoubtedly the biggest player in this space is Hadoop. Hadoop is an opensource “nosql” database that can process large amounts of data. So lets get back to the our resume example.. You have million resumes and you are looking for a specific individual that has experience with Goldengate. Hadoop is made of multiple servers using local disk to split up the workload into pieces (sometimes this is called sharding). Commodity hardware is used to accomplish this. This is a scalable solution because you keep adding nodes to give you more performance. You would take your 1 million resumes, and send them out to your “cluster”. Let’s say you have a 10 node cluster. Hadoop would take the resumes and split across all 10 nodes, while maintaining redundancy by putting each resume on 2 nodes in the system. After all these are commodity servers that may break/ When you go to run your hadoop “query”, you have hadoop tell all 10 nodes to start looking for “goldengate” in the resumes and return the results. This breaking down the work across multiple machines is huge advantage for scalability.. If you need more processing power you add more nodes.

Summary –

This is a very interesting time for data. The amount of data generated is skyrocketing. The equipment that is utilized to parse through the data is getting faster. In memory databases are becoming a reality. All this is causing a lot changes in the market.

There are many opportunities for companies to take advantage of this change in the market. Most large companies are looking at solutions for both of these issues (structured and unstructured data). Oracle has been advertising their interest in Hadoop and their intention to enter this market with a product that will handle data warehousing. This was announced at open world, but the details haven’t been unveiled yet.

The one item I didn’t mention in this post is the use of an in memory database. This type of technology is becoming more common the advent of SAP HANA. Oracle has now announced Exalitics to fill this space.

Analytics and big data is definitely the wave of the future.

Tuesday, November 15, 2011

configurations for multiple instances on 4 nodes

How to handle multiple databases without enough memory.

Lets say we have 2 environments that need to use the same 4 node cluster. Each application has 3 instances. For simplicity lets call the apps

DBFS
MSTDB
DWDB

Now to separate out the 2 environments lets give each environment it’s own set of database.

DBFSI

MSTDBI

DWDBI

DBFSP

MSTDBP

DWDBP

We have 6 instances from 2 environments that all need to be running on 4 nodes with 96g apiece.

RECOMMENDATION

1) Split the 4 node cluster in ½ . Put the Imp systems on the first 2 nodes, and he perf systems on the second 2 nodes.

2) Create 3 different sets of “databases” and “instances” through srvrctl. These 3 sets will contain 3 different sets of instances. Only 1 of these 3 will be up at any time. They will be the same set of datafiles, just different configurations. By overriding the memory settings in the Init file, and have 3 sets of sids in the SPFILE, this configuration is possible.

3) Start up the appropriate databases (and instances) for the proper configuration

Database	SGA	instance	Nodes
DBFSI	20g	DBFSI1-DBFSI2	dbnode1/dbnode2
LDBFSI	70g	LDBFSI1-LDBFSI4	dbnode1/dbnode2 dbnode3/dbnode4
SDBPFSI	4g	SDBPFSI1	dbnode1
MSTDBI	20g	MSTDBI1-MSTDBI2	dbnode1/dbnode2
LMSTDBI	70G	LMSTDBI1-LMSTDBI4	dbnode1/dbnode2 dbnode3/dbnode4
SMSTDBI	4g	SMSTDBI1	dbnode2
DWDBI	20g	DWDBI1-DWDBI2	dbnode1/dbnode2
LDWDBI	70G	LDWDBI1-LDWDBI4	dbnode1/dbnode2 dbnode3/dbnode4
SDWDBI	4g	SDWDBI1	dbnode1
DBFSP	20g	DBFSP1-DBFSP2	dbnode3/dbnode4
LDBFSP	70g	LDBFSP1-LDBFSP4	dbnode1/dbnode2 dbnode3/dbnode4
SDBPFSP	4g	SDBPFSP1	dbnode3
MSTDBP	20g	MSTDBP1-MSTDBP2	dbnode3/dbnode4
LMSTDBP	70G	LMSTDBP1-LMSTDBP4	dbnode1/dbnode2 dbnode3/dbnode4
SMSTDBP	4g	SMSTDBP1	dbnode4
DWDBP	20g	DWDBP1-DWDBP2	dbnode3/dbnode4
LDWDBP	70G	LDWDBP1-LDWDBP4	dbnode1/dbnode2 dbnode3/dbnode4
SDWDBP	4g	SDWDBP1	dbnode4

OK, now that I have 3 sets of 6 databases combined, what will the actual configuration choices be ??

Normal configuration showing memory usage

Database	dbnode1	dbnode2	dbnode3	dbnode4
DBFSI	20	20
MSTDBI	20	20
DWDBI	20	20
DBFSP			20	20
MSTDBP			20	20
DWDBP			20	20

Total	60g	60g	60g	60g

Perf Isolated testing of DWDB

Database	dbnode1	dbnode2	dbnode3	dbnode4
DBFSI	20	20
MSTDBI	20	20
DWDBI	20	20
SDBFSP			4
SMSTDBP				4
LDWDBP			70	70

Total	60g	60g	74g	74g

Perf Full testing of DWDB

Database	dbnode1	dbnode2	dbnode3	dbnode4
SDBFSI	4
SMSTDBI		4
SDWDBI		4
SDBFSP			4
SMSTDBP				4
LDWDBP	70g	70g	70	70

Total	74g	78g	74g	74g

You can see that with this configuration, it is possible to carefully manage the Database usage. The above examples can be used to make any one of the database span the whole machine, while the others sit on one node in a small configuration.

Bryan's Oracle Blog

Saturday, November 19, 2011

Big Data

Tuesday, November 15, 2011

configurations for multiple instances on 4 nodes

Labels

About Me

Blog Archive