The continuing growth of massive and diverse data volumes, and the  growth of data intensive applications, has presented a need to find  effective means of data management across all sectors. According to a recent report,  businesses face a huge skill gap in the management of big data, with  the gap growing from 400 in 2007 to 4,000 in 2012 in the United Kingdom  alone. In addition to this, there is a general lack of understanding  among students of current data analytics processes, which are becoming  extremely important for future challenges with the growth of the  Internet of Things (IoT) and real-time data.
As a computer scientist, studying and building modeling and  simulation applications, I was initially perplexed as to the attraction  towards the term big data. Business seems to focus on Hadoop-related  software for data analytics, and having Hadoop-related projects on your  resume can be a bonus. As a teacher of cloud computing and software  engineering, I decided to assign two students Hadoop-related projects  for big data management with a "smart cities" focus, and interviewed  them about their learning objectives to see what they thought about the  technologies.
As a prerequisite, the students were given full freedom to examine  the topic of Hadoop big data processing, and asked to explore whichever  tools they wanted to in this area. Hadoop is a set of tool that supports  the running of big data applications with multiple job executions to  allow massive amounts of data to be processed quickly. It is an  environment to run MapReduce jobs that are usually sorted in batch. Hadoop has become one of the  most important tools in science projects which require analyzing data.  Some of the Hadoop-related tools my students investigated included:
- Apache Ambari: A framework for managing and monitoring Hadoop clusters
- Apache Pig: A platform for running code for analyzing large data set of data.
- Apache Sqoop: A tool used for moving data between Hadoop and other data stores
- Apache ZooKeeper: A tool used for providing synchronization and maintaining the set up of information.
- Apache Spark: A newer tool used to run analysis faster on some types of data.
- Apache Flume: A system that gathers information that is later stored in HDFS.
- Apache Hive: A tool which allows users to use a SQL-like language to analyze data.
- Apache Oozie: A tool to start analysis jobs that have been broken into different parts in the correct sequence.
- Hadoop Distributed File System (HDFS): A framework for dividing data between nodes.
- HCatalog: A tool which is used to upload tables and is used to manage data, which enables users to analyze data using different processing tools like Pig, Hive, and MapReduce.
After the students successfully finished their final year  dissertations, I asked them some questions to understand what they  learned from the experience. Here are the responses from both of my  students, Saudamini Sonalker and Rafiat Olubodun Kadiri, who were doing  independent experiments with Hadoop.
Why did you want to learn Hadoop? Just to learn something new, or were you influenced by industry interest in the project?
Saudamini: I was primarily motivated to work on this  topic after having read a book about big data by Victor  Mayer-Schonberger and Kenneth Cukier: Big Data: A Revolution That Will Transform How We Live, Work, and Think.  The predictive nature of tools that assist big data processing is what  drew me to learning more about it. Concentrating on smart city data was  also an interesting element of this project. I want to learn and  understand more about how city data can be utilized to make cities  efficient, green, and smart.
Rafiat: I chose the topic of Hadoop because it is a  new area; it is a buzz word, and recently has been dominating the  market. Different businesses make use of it, including social media  websites such as Twitter and Facebook using Hadoop to mine data for  different purposes, enabling them to make reasonable business decisions.
What do companies use big data for? What kind of questions are they using it to ask?
Saudamini: Companies use big data for numerous  purposes. Amazon utilizes it for recommendations, Skyscanner and Kayak  for adjusting flight prices by monitoring an individual's past searches,  and Google uses it to determine the order of search results. An  interesting use of big data was Amsterdam's Energy Atlas project. It used energy consumption data from within the city to  promote renewable energy by making its citizens aware of their own  usage.
Rafiat: Different companies have different use of  big data. The usage of big data for a company depends on what type of  service they provide to the public. Businesses like eBay and Amazon use  big data to make predictions of what customers may desire according to  their previous purchase history and similar purchase by other customers
What problems did you have when installing Hadoop while setting up the sandbox environment? What led you to choose Hortonworks Sandbox for your experiments?
Saudamini: I explored a couple of options before  deciding on Hortonworks Data Platform. The major reason for choosing it  was because it is open source and free. Other competitors like MapR,  Amazon Web Services and Cloudera, however good the platforms, were  expensive. However, there were strict memory requirements to set up the  sandbox. A 64-bit processor was necessary to access the sandbox via  virtual machine, and it required at least 4GB RAM. This slowed the  process down for me and the platform has no flexibility in terms of  requirements.
Rafiat: There are quite a number of public Hadoop clusters that have been  designed for storing and analyzing large amounts of unstructured data in  a computing environment. They are available on cloud infrastructures  such as Heroku, Hortonworks Sandbox, Azure, and others.
After a few searches, I decided to use Hortonworks Data Platform, an  open source apache Hadoop data platform. The system requirements  included using Windows or Mac operating system, at least 4GB of RAM, a  virtual machine environment, and a 64-bit chip that supports  virtualization.
The first step was to download a virtual machine, then download the  sandbox from the Hortonworks website. After this I connected to the  sandbox with the given IP address.
There were some negative aspects to using the Hortonworks sandbox for  research, which I still face. I was unable to access the sandbox with  the given IP address for a while, but after multiple trials, it worked.  Second, the virtual machine slowed down my computer the moment it is  switched on, and it took a long time for a query to load.
Further, I face issues like when my machine goes off itself without  allowing me to shut down the virtual machine down myself, the next time I  switch it on, the virtual machine comes up with configuration errors  which restricts me from accessing the sandbox. Another issue that I face  is not being able to access some of the tools sometimes, which slows  down my research.
How does the Hortonworks Data Platform work?
Saudamini: The platform can be divided into three  layers: the data access layer, cluster resource management, and HDFS.  The data access layer is where the user uploads, catalogues, and manages  data; one uses this layer to enter their Hive/Pig jobs for the system  to perform. Cluster resource management (YARN) is an architectural hub  for data processing engines so multiple applications can be run on the  HDFS. This layer essentially works as a translator for the other two.  Finally, HDFS is where the MapReduce jobs are run in parallel between  the master and slave nodes.
Ambari is a web-based GUI that can speak to the underlying machinery and allows user to set up and manage a Hadoop cluster.
Rafiat: When accessing the sandbox, I was directed  to a page where I had access to different tools like Hive, File browser,  Pig, Job browser, and others. I could upload different type of files  (zip file, csv, xml), then create tables from tools like Hive, Pig and  HCatalog with the file that has been uploaded through the file browser  icon. I could then create queries to provide different type of tables  with different criteria to fit a requirement.
Ambari can be used to monitor and manage Hadoop clusters. Monitoring  the outcome of the queries that have been carried out, and showing the  effect of the queries on the CPU usage, memory usage, network usage,  etc.
What tools did you explore, and what were the new things you learned in the process?
Saudamini: Initially, I planned on exploring Pig and  Hive, but I had issues running the Pig script on Hortonworks Sandbox  and hence stuck with Hive. Hive Query Language is very similar to SQL,  therefore if someone is proficient in the latter, then they shouldn't  have an issue working with the tool. On Hortonworks Sandbox, Hive has a  graphical user interface called Beeswax. Hive converts queries you write  into MapReduce jobs. Whether or not one needs multiple options to  process data depends on the skill sets of the users working on a large  project. Hive diminishes the need to train or hire external resources in  order to fill in the gap. The flexibility is useful in scenarios like  those.
Rafiat: I used Hive, which uses an SQL-like  scripting language which is known as HiveQL. It is suitable for users  that are familiar with structured query language. Additionally, Pig was  used as a language for data analysis and it is also a high level  processing layer on Hadoop. It consist of a language called Pig Latin.
What kind of files did you process? Smart city datasets?
Saudamini: I concentrated on smart city data, specifically London traffic and social data.
Rafiat: Smart city data were used for this  experiments most of the data was retrieved from ITU data statistics  website and London data store website.
What were the goals of the experiments? What did you achieve?
Saudamini: The goal was to observe performance of  the underlying machinery and cluster loads. After processing different  big data files I compared results of CPU performance, cluster loads,  memory usage, and network usage.
Transport and social data was processed on the platform to check the  feasibility of implementing smart offices within London to reduce  traffic and save people's time. The hypothesis was that there would be a  correlation between high traffic boroughs and boroughs with most work  destinations. Although that held up in most cases, these boroughs were  not in central London like initially imagined.
Rafiat: The goal of the experiment was to analyze  set of data that will be retrieved from different sources like ITU  (International Telecommunication Union) website, London data store,  public data sets on Amazon Web Services, etc. The aim was to use volume  as one the criteria to consider while analyzing the data. By doing this,  the experiment will be able to show how long it will take for the data  to be processed.
If you were given a project now for big data processing, how would you approach it?
Saudamini: If time is not a concern and price is an  issue then I would recommend using Hortonworks Sandbox as its  flexibility towards type of data source, data processing tool options  and Ambari environment give a wholesome data management experience.  However, if time is of the essence and money not a factor then it would  be beneficial to look at other options which provide a similar user  experience in the cloud.
Rafiat: I would use of Hortonworks Data Platform on a  separate machine dedicated to the platform, as my own machine was not  very high spec.
As a computer science student, do you think for data management we should always use tools like these?
Saudamini: If the dataset you are working with it  large, then I think it is advisable to use big data tools like these.  Their flexibility and quick processing make them ideal to be deployed as  solutions to smart city issues. However, I am not convinced that we  should always use them. We could actually try and avoid using these  tools if the dataset doesn't demand it. A lot of the analytical  functions can be done by other BI tools. Big data tools can have a steep  learning curve, and training users should be factored in while  deploying systems that utilize them.
Rafiat: Data management is a very important topic  There are different advantages to managing data effectively as a  student, individual or organization. This includes preventing data  duplication, which will allow memory space to be saved. It allows  validation of results if need be. Data management allows proper  understanding of data, the use of queries to provide specific  information needed, so data can be understood easily.
In conclusion, we got mixed results on the use of tools to process ig  data applications. An open Hadoop data platform seemed like the obvious  choice at the time. As previously described, MapReduce is at the core  of a Hadoop Distributed File System. Hortonworks Sandbox is equipped  with YARN, the second generation of MapReduce. It divides the two  important tasks and makes the process more efficient. YARN supports  batch as well as real-time processing projects. The Hortonworks Data  Platform has the capability to adapt to the user’s existing data  architecture which is a huge plus. In addition to the platform being  cost-free, efficient and adaptable, it also has an extensive list of  tutorials and user based guides on using the services it provides.
There are a lot of big data processing platforms available as a  result of it being the current buzzword. Most services; Amazon Web  Services, Cloudera, MapR etc. to name a few, charge the user depending  upon the traffic and amount of data they process. Cloudera’s website  claims, "The company’s enterprise data hub (EDH) software platform  empowers organizations to store, process and analyze all enterprise  data, of whatever type, in any volume—creating remarkable  cost-efficiencies as well as enabling business transformation."
The current move towards open data generating massive amounts of  data, needs real-time processing needing intelligent solutions to  process it. Having more tools which are open source can fuel further  open data research impacting not only computing, but social sciences,  where economists and governments can make use of big data as well.
 
No comments:
Post a Comment