Saturday, November 19, 2011

Big Data


Big Data

I wanted to write a little bit about big data, since that is such a common topic these days. 

First the definition. What is considered big data?  I had a conversation with a very smart thinker who defined it as “when the data is so big you have to put a lot of thought in how to deal with it”.  This means it is data where normal techniques are no longer applicable to work with this large of a dataset.  Large could be 20 Tb, 200tb, or petabytes. It’s about how the data relates to your infrastructure.

The next question is why ? why now ?

I think the big thing that has happen in the last couple of years is the explosion of smart phones, faster technology, and cellular data communications.  There is a huge amount of data generated from these sources.   And probably one of the biggest ingredients in this realm is the Location.. Think about GPS built into you smart phone (or the ability to triangulate your location from the use of cell phone towers. This adds a whole new dimension to the equations.  From your smart phone, data is generated to tell more detail more than was every possible.  You are in a store shopping for an item, and you check the price on amazon through your phone.  There is a ton of data in that “transaction”. It is now possible to mine data about where you were when you looked at the item online. Did you buy it online ? when did you go shopping for it ?
There is also a desire to mine the data that you have.  You might have order data, but that data is growing, and the desire to derive value from that data is growing.. At some point this forces you too look at solutions to handle this problem.
There is a huge proliferation of data from both traditional sources, and all these new sources of information.  There are so many new devices out there that are internet aware.  Look at the new coke vending machines that allow you customize your flavor, and data about your choices and the machines health is available to be monitored

So the next topic is to define the 2 types of “big data”


Structure data -  Set schema

Unstructured Data – No set schema for the data.

Structured Data

First let’s talk about structured data.  This data typically comes from your current sql database (oracle, db2, mysql, sqlserver, etc.. etc.).  Data is created into a set schema, and then it is often joined together. This is the most common way to store traditional data for customer/order information. I think we can all relate to the data model




  
This is a very simple diagram, but you get the picture.. There are multiple tables, and there are relationships. You most likely have historical data collected over time.  You might want to take all this data and find patterns.

unstructured data

A great example of this is resumes.  Let say you are linkedin, and you have millions of resumes.  The resume’s are not in a specific format and some key words may be difficult to disseminate.  This is especially true in the computer field were so many names are derived from other things.. Look at the skill set of someone who is versed in a lot of the current technology like Redhat, Python, Java.  If you go searching through resumes you might get a snake keeper who likes specific colored hats, and coffee.  You need to find out how take this data, and make it useful using natural language, or other such means.

So you do you handle these 2 types of  data ??

Structured data Solutions

I think there are many solutions that have come onto the market in this area.  I’m sure you have heard of them.  Exadata, Terradata, Netezza, Greenplum, vertica, and asterdata.  Some of these solutions have been around, and are still on the forefront of the data revolution, and some of these are new comers.  These are appliances that take the large amounts of data and parallelize the processing to quickly get to the solution.  Terradata has been at this for years with their solutions.  These solutions are usually mpp solutions, that use local disk.  They break the query workload down into pieces, and then bring back the result set.  These work very good to accomplish this.

Unstructured solutions

Undoubtedly the biggest player in this space is Hadoop.  Hadoop is an opensource “nosql” database that can process large amounts of data.  So lets get back to the our resume example.. You have million resumes and you are looking for a specific individual that has experience with Goldengate.  Hadoop is made of multiple servers using local disk to split up the workload into pieces (sometimes this is called sharding).  Commodity hardware is used to accomplish this.  This is a scalable solution because you keep adding nodes to give you more performance. You would take your 1 million resumes, and send them out to your “cluster”. Let’s say you have a 10 node cluster.  Hadoop would take the resumes and split across all 10 nodes, while maintaining redundancy by putting each resume on 2 nodes in the system. After all these are commodity servers that may break/  When you go to run your hadoop “query”, you have hadoop tell all 10 nodes to start looking for “goldengate” in the resumes and return the results.  This breaking down the work across multiple machines is huge advantage for scalability.. If you need more processing power you add more nodes.

Summary

This is a very interesting time for data.  The amount of data generated is skyrocketing.  The equipment that is utilized to parse through the data is getting faster.  In memory databases are becoming a reality.  All this is causing a lot changes in the market.

There are many opportunities for companies to take advantage of this change in the market.  Most large companies are looking at solutions for both of these issues (structured and unstructured data).  Oracle has been advertising their interest in Hadoop and their intention to enter this market with a product that will handle data warehousing. This was announced at open world, but the details haven’t been unveiled yet.

The one item I didn’t mention in this post is the use of an in memory database.  This type of technology is becoming more common the advent of SAP HANA.  Oracle has now announced Exalitics to fill this space.

Analytics and big data is definitely the wave of the future.

Tuesday, November 15, 2011

configurations for multiple instances on 4 nodes


How to handle multiple databases without enough memory.


Lets say we have 2 environments that need to use the same 4 node cluster.  Each application has 3 instances.  For simplicity lets call the apps


  • DBFS
  • MSTDB
  • DWDB




Now to separate out the 2 environments lets give each environment it’s own set of database.


DBFSI
MSTDBI
DWDBI
DBFSP
MSTDBP
DWDBP


We have 6 instances from 2 environments that all need to be running on 4 nodes with 96g apiece.


RECOMMENDATION


1)      Split the 4 node cluster in ½ .  Put the Imp systems on the first 2 nodes, and he perf systems on the second 2 nodes.
2)      Create 3 different sets of “databases” and “instances” through srvrctl.  These 3 sets will contain 3 different sets of instances.  Only 1 of these 3 will be up at any time.  They will be the same set of datafiles, just different configurations. By overriding the memory settings in the Init file, and have 3 sets of sids in the SPFILE, this configuration is possible.
3)      Start up the appropriate databases (and instances) for the proper configuration


Database
SGA
instance
Nodes
DBFSI
20g
DBFSI1-DBFSI2
dbnode1/dbnode2
LDBFSI
70g
LDBFSI1-LDBFSI4
dbnode1/dbnode2
dbnode3/dbnode4
SDBPFSI
4g
SDBPFSI1
dbnode1
MSTDBI
20g
MSTDBI1-MSTDBI2
dbnode1/dbnode2
LMSTDBI
70G
LMSTDBI1-LMSTDBI4
dbnode1/dbnode2
dbnode3/dbnode4
SMSTDBI
4g
SMSTDBI1
dbnode2
DWDBI
20g
DWDBI1-DWDBI2
dbnode1/dbnode2
LDWDBI
70G
LDWDBI1-LDWDBI4
dbnode1/dbnode2
dbnode3/dbnode4
SDWDBI
4g
SDWDBI1
dbnode1
DBFSP
20g
DBFSP1-DBFSP2
dbnode3/dbnode4
LDBFSP
70g
LDBFSP1-LDBFSP4
dbnode1/dbnode2
dbnode3/dbnode4
SDBPFSP
4g
SDBPFSP1
dbnode3
MSTDBP
20g
MSTDBP1-MSTDBP2
dbnode3/dbnode4
LMSTDBP
70G
LMSTDBP1-LMSTDBP4
dbnode1/dbnode2
dbnode3/dbnode4
SMSTDBP
4g
SMSTDBP1
dbnode4
DWDBP
20g
DWDBP1-DWDBP2
dbnode3/dbnode4
LDWDBP
70G
LDWDBP1-LDWDBP4
dbnode1/dbnode2
dbnode3/dbnode4
SDWDBP
4g
SDWDBP1
dbnode4




OK, now that I have 3 sets of 6 databases combined, what will the actual configuration choices be ??


Normal configuration showing memory usage


Database
dbnode1
dbnode2
dbnode3
dbnode4
DBFSI
20
20


MSTDBI
20
20


DWDBI
20
20


DBFSP


20
20
MSTDBP


20
20
DWDBP


20
20





Total
60g
60g
60g
60g


Perf Isolated testing of DWDB


Database
dbnode1
dbnode2
dbnode3
dbnode4
DBFSI
20
20


MSTDBI
20
20


DWDBI
20
20


SDBFSP


4

SMSTDBP



4
LDWDBP


70
70





Total
60g
60g
74g
74g














Perf Full testing of DWDB


Database
dbnode1
dbnode2
dbnode3
dbnode4
SDBFSI
4



SMSTDBI

4


SDWDBI

4


SDBFSP


4

SMSTDBP



4
LDWDBP
70g
70g
70
70





Total
74g
78g
74g
74g




You can see that with this configuration, it is possible to carefully manage the Database usage.  The above examples can be used to make any one of the database span the whole machine, while the others sit on one node in a small configuration.