Big data and machine learning – definition, importance, differents
The purpose of this survey paper is to define Big data and understand how it is different from traditional data set, what purpose it serves, the issues and challenges in Big data, what are the defining characteristics of the Big data. And one of technologies that uses Big data i.e. Machine learning is explored, and two techniques used in Machine learning are studied and compared.
Keywords: Big data, k-means, SVM, Machine learning.
The term big data tough coined in 1990’s has been a buzz word since last decade and many big corporate companies and tech giants are trying to develop new technologies for it and investing in it. In 2011 six national departments and agencies — the National Science Foundation, NIH, the U.S. Geological Survey, DOD, DOE and the Defense Advanced Research Projects Agency — announced a joint research and development initiative that will invest more than $200 million to develop new big data tools and techniques.
So, what is Big data?
Big data as the term suggest is about dealing with large amounts of data. Everything in this world exhausts data. Big organizations are trying to collect this data to study and understand patterns of masses, climates, weather, to understand genome code and many more. Many big companies are collecting and possess large amount of data that is too voluminous or unstructured to be analyzed or processes using traditional data structure methods. This burgeoning source of data is collected from social media, online activity, sensors, videos, surveillance cameras voice recording form calls and GPS data and many ways.
The impacts of Big data can be seen all around us like google predicting the term you about to search or Amazon suggesting product for you. All of this done by gathering, studying and analyzing big chunks of data all of us exhaust.
What makes Big data so important?
A simple way to answer it would be, data-driven decisions are much better then decisions driven by intuitions. This can be archived by Big data. With so much of data collected by companies. If the companies can form and understand the patterns, the managerial decisions can be much more efficient for the companies. It is the potential in Big data to give predictive analysis that has put so much attention on it.
Issues and Challenges
There are three data types categorized in Big data:
- Structures data: more traditional data.
- Semi-structured data: HTML, XMLS.
- Unstructured data: video data, audio data.
This where the problem raises traditional data management techniques can process structured data and to some extent unstructured data but can’t process unstructured data and that is why traditional data management techniques can’t be used on Big data efficiently.
Relational databases are more suitable for structured data that are transactional in nature. They satisfy the ACID properties. ACID is acronym for:
- Atomicity: A transaction is “all or nothing” when it is atomic. If any part of the transaction or the underlying system fails, the entire transaction fails.
- Consistency: Only transactions with valid data will be performed on the database. If the data is corrupt or improper, the transaction will not complete and the data will not be written to the database.
- Isolation: Multiple, simultaneous transactions will not interfere with each other. All valid transactions will execute until completed and in the order they were submitted for processing.
- Durability: After the data from the transaction is written to the database, it stays there “forever.”
ACID can’t be archived by relational Databases on Big data.
Characters of Big data
Size is the first things that comes to mind when we talk about Big data, but it is not the only characteristics of Big data. Big data is characterized by three V’s. It is what differentiates Big data for being just another way of “analytics”.
The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s. With the world going digital, as of 2012 the number as reached 2.5 Exabytes (2.5* 1018). With so much of data it gives companies opportunity to work with petabytes of data in single data set. Google alone process 24 petabytes of data every single day. It is not just online data, Walmart collects around 2.5 petabytes of data every hour from its costumer transactions.
The speed of data creation, processing and retrieval is import. To make a real time or near real time prediction speed is a necessary factor. Milli-seconds data litany can put companies behind their competitors. Rapid analysis can put obvious advantage on wall street companies and main street managers.
The source data is so diverse when collecting data. For example, data collected by social media platforms include pictures videos, on which paged the user spent more time, his entire online social media activity, what most of the user are leaning towards. And that’s just one example there can sensors collecting different type of data from temperature reading to pictures and videos of samples. The data type varies from structured to semi-structured to unstructured.
Big data the a very good decision making, and predictive analytic tool is defined and reviewed by Davenport, Thomas H., Paul Barth, and Randy Bean in how ‘big data’ is different.
Machine learning is one the technologies that uses big data. It learns via different methods such as supervised learning, unsupervised learning and reinforcement learning. The unsupervised learning uses algorithm called k-means which is explain in “k-means++: The advantages of careful seeding.” by Arthur, David, and Sergei Vassilvitskii. In supervised learning many algorithms are used which are spoken about in Performance analysis of various supervised algorithms on big data by Unnikrishnan, Athira, Uma Narayanan, and Shelbi Joseph
In “Predict failures in production lines: A two-stage approach with clustering and supervised learning” by D. Zhang, B. Xu and J. Wood, they take unlabeled data and use k-means to make clusters of data and put it through supervised learning algorithms to predict the failures in the production line of car manufacturing.
As reported by McKinsey Global Institute in the 2011 the main components and eco-system of Big data are as follows:
- Techniques for analyzing data: A/B testing, machine learning and natural language processing.
- Big data technologies: business intelligence, cloud computing and databases.
- Visualization: charts, graphs and other displays of the data.
- In this survey paper we are going to study two different algorithms used in machine learning.
Machine learning is one the techniques used in Big data to analyze the data and see patterns in the heaps of data. This is how Amazon, YouTube or any online website shows predictions or related products for the users.
Three types of learning algorithms are used in machine learning
- Supervised Learning: in this the algorithm develops a mathematical model from given set of labeled training data which contain training examples. The examples have inputs and desired outputs. supervised algorithms include Classification algorithm and regression algorithms. Classification algorithms are used when the outcome wanted is labeled. Regression algorithms are used when out is expected within a range.
- Unsupervised learning: in this algorithm takes test data that is not labeled, classified or organized. The algorithms learn the commonalities in the given test data and reacts to the new data based on presence or absence of the commonalities. Unsupervised learning uses clustering. Some common clustering algorithms used in unsupervised learning.
The basic principle is the agent learn how to behave based on interaction with the environment and seeing the results. This is used in game theory, control theory, DeepMind etc.
The k-means method is a simple and fast algorithm that attempts to locally improve an arbitrary k-means clustering. It is used to automatically partition given data set into K groups. It works as follows:
- It starts by selecting k initial random centers, called means.
- It categorizes each value to its closest mean points and new mean point is calculated based on the categorization. All the values categorized together are used to calculate new mean. It determines the new mean point.
- The process is iterated for a given number of time to give the cluster.
The outcome may not be optimum. Selecting different mean points at the start and running the algorithm again may yield better clusters.
This is an unsupervised learning method for categorizing the unlabeled data and making decisions based on it.
Support Vector Machine
The original SVM algorithm was invented by Vladimir N. Vapnik and Alexey Yakovlevich Chervonenkis in 1963.This is supervised learning algorithm. It is suited for extreme cases. SVM is a frontier that best segregates two classes. Given the data which has examples that that which class, among the two, it belongs to, the algorithm will develop a model to determine to which class the new data belongs to. The SVM model is a representation of the data as point in space, which are separated by a wide margin. If the given data can’t be separated properly then the data is mapped to a higher dimension.
Since SVM algorithm is supervised, it can’t be used without labels. So, at time clustering algorithms are used to label the data and then SVM (supervised learning) algorithms are used.
Before we compare the two algorithms, it should be clear that this is not exactly apples to apples comparison. The two algorithms are very different from the core, though both are machine learning algorithms k-means algorithm is unsupervised learning algorithm and SVM is supervised learning algorithm.
The difference from the very type of data given for these algorithms. K-means is given unlabeled data, whereas SVM is given labeled data.
K-means reads the data and can make categories of data based on the commonalities(mean) and makes decision on the new data based on the commonalities. SVM operates differently it forms its model from training data set and draws a hyperplane in the space and segregates the data.
K-means is fast but can yield better results over multiple executions. SVM is slow but very decisive.
Realization and Future references:
The best Big data applications to get patterns or answers out of it even before u ask for it. Developing a Machine learning algorithms to recognize and bring out patterns that are not particularly asked for but are hidden deep in the data. There is so much of data that is collected every day that have many hidden patterns that are to be found. It may be a base case in “Predict failures in production lines: A two-stage approach with clustering and supervised learning,” by D. Zhang, B. Xu and J. Wood, but if we put unsupervised learning algorithms like k-means or even more complex algorithms and put the clusters through supervised algorithms, I believe, many unnoticed patterns in nature, in mass behavior or in any predictive field can be found.
Through this survey paper we have defined what big data is, how it is different and what are the characteristics of big data are. We have also explored the areas of machine learning and studied what supervised and unsupervised learning are and compared two different algorithms used in them.
- Shinde, Manisha. (2015). XML Object: Universal Data Structure for Big Data. International Journal of Research Trends and Development 2394-9333. 2. 107-113.
- Michel Adiba, Juan-Carlos Castrejon-Castillo, Javier Alfonso Espinosa Oviedo, Genoveva VargasSolar, José-Luis Zechinelli-Martini. Big Data Management Challenges, Approaches, Tools and their limitations. Shui Yu, Xiaodong Lin, Jelena Misic, and Xuemin Sherman Shen. Networking for Big Data, Chapman and Hall/CRC 2016, 978-1-4822-6349-7. ;lt;hal-01270335;gt;
- Saint John Walker (2014) Big Data: A Revolution That Will Transform How We Live, Work, and Think, International Journal of Advertising, 33:1, 181-183, DOI: 10.2501/ IJA-33-1-181-183
- Madden, Sam. “From databases to big data.” IEEE Internet Computing 3 (2012): 4-6.
- Arthur, David, and Sergei Vassilvitskii. “k-means++: The advantages of careful seeding.” Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007.
- Unnikrishnan, Athira, Uma Narayanan, and Shelbi Joseph. “Performance analysis of various supervised algorithms on big data.” 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS). IEEE, 2017.
- Davenport, Thomas H., Paul Barth, and Randy Bean. How’big data’is different. MIT Sloan Management Review, 2012.
- Lohr, Steve. “The age of big data.” New York Times 11.2012 (2012).
- McAfee, Andrew, et al. “Big data: the management revolution.” Harvard business review 90.10 (2012): 60-68.
D. Zhang, B. Xu and J. Wood, “Predict failures in production lines: A two-stage approach with clustering and supervised learning,” 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, 2016, pp. 2070-2074.doi: 10.1109/BigData.2016.7840832
- Manyika, James, Chui, Michael, Brown, Brad, Bughin, Jacques, Dobbs, Richard, Roxburgh, Charles and Byers, Angela Hung Big Data: The Next Frontier for Innovation, Competition, and Productivity. , McKinsey Global Institute (2011).