Big Data
I wanted to write a little bit about big data, since that is such a common topic these days.
First the definition. What is considered big data? I had a conversation with a very smart thinker who defined it as “when the data is so big you have to put a lot of thought in how to deal with it”. This means it is data where normal techniques are no longer applicable to work with this large of a dataset. Large could be 20 Tb, 200tb, or petabytes. It’s about how the data relates to your infrastructure.
The next question is why ? why now ?
I think the big thing that has happen in the last couple of years is the explosion of smart phones, faster technology, and cellular data communications. There is a huge amount of data generated from these sources. And probably one of the biggest ingredients in this realm is the Location.. Think about GPS built into you smart phone (or the ability to triangulate your location from the use of cell phone towers. This adds a whole new dimension to the equations. From your smart phone, data is generated to tell more detail more than was every possible. You are in a store shopping for an item, and you check the price on amazon through your phone. There is a ton of data in that “transaction”. It is now possible to mine data about where you were when you looked at the item online. Did you buy it online ? when did you go shopping for it ?
There is also a desire to mine the data that you have. You might have order data, but that data is growing, and the desire to derive value from that data is growing.. At some point this forces you too look at solutions to handle this problem.
There is a huge proliferation of data from both traditional sources, and all these new sources of information. There are so many new devices out there that are internet aware. Look at the new coke vending machines that allow you customize your flavor, and data about your choices and the machines health is available to be monitored
So the next topic is to define the 2 types of “big data”
Structure data - Set schema
Unstructured Data – No set schema for the data.
Structured Data
First let’s talk about structured data. This data typically comes from your current sql database (oracle, db2, mysql, sqlserver, etc.. etc.). Data is created into a set schema, and then it is often joined together. This is the most common way to store traditional data for customer/order information. I think we can all relate to the data model
This is a very simple diagram, but you get the picture.. There are multiple tables, and there are relationships. You most likely have historical data collected over time. You might want to take all this data and find patterns.
unstructured data
A great example of this is resumes. Let say you are linkedin, and you have millions of resumes. The resume’s are not in a specific format and some key words may be difficult to disseminate. This is especially true in the computer field were so many names are derived from other things.. Look at the skill set of someone who is versed in a lot of the current technology like Redhat, Python, Java. If you go searching through resumes you might get a snake keeper who likes specific colored hats, and coffee. You need to find out how take this data, and make it useful using natural language, or other such means.
So you do you handle these 2 types of data ??
Structured data Solutions
I think there are many solutions that have come onto the market in this area. I’m sure you have heard of them. Exadata, Terradata, Netezza, Greenplum, vertica, and asterdata. Some of these solutions have been around, and are still on the forefront of the data revolution, and some of these are new comers. These are appliances that take the large amounts of data and parallelize the processing to quickly get to the solution. Terradata has been at this for years with their solutions. These solutions are usually mpp solutions, that use local disk. They break the query workload down into pieces, and then bring back the result set. These work very good to accomplish this.
Unstructured solutions
Undoubtedly the biggest player in this space is Hadoop. Hadoop is an opensource “nosql” database that can process large amounts of data. So lets get back to the our resume example.. You have million resumes and you are looking for a specific individual that has experience with Goldengate. Hadoop is made of multiple servers using local disk to split up the workload into pieces (sometimes this is called sharding). Commodity hardware is used to accomplish this. This is a scalable solution because you keep adding nodes to give you more performance. You would take your 1 million resumes, and send them out to your “cluster”. Let’s say you have a 10 node cluster. Hadoop would take the resumes and split across all 10 nodes, while maintaining redundancy by putting each resume on 2 nodes in the system. After all these are commodity servers that may break/ When you go to run your hadoop “query”, you have hadoop tell all 10 nodes to start looking for “goldengate” in the resumes and return the results. This breaking down the work across multiple machines is huge advantage for scalability.. If you need more processing power you add more nodes.
Summary –
This is a very interesting time for data. The amount of data generated is skyrocketing. The equipment that is utilized to parse through the data is getting faster. In memory databases are becoming a reality. All this is causing a lot changes in the market.
There are many opportunities for companies to take advantage of this change in the market. Most large companies are looking at solutions for both of these issues (structured and unstructured data). Oracle has been advertising their interest in Hadoop and their intention to enter this market with a product that will handle data warehousing. This was announced at open world, but the details haven’t been unveiled yet.
The one item I didn’t mention in this post is the use of an in memory database. This type of technology is becoming more common the advent of SAP HANA. Oracle has now announced Exalitics to fill this space.
Analytics and big data is definitely the wave of the future.