In this article you will learn the following:
- What is Big data?
- Who uses Big data?
- Big data core tenets
- Associated concerns with Big data
- How can Microsoft Azure help with Big data?
- The Big Data pipeline
- Using Microsoft Azure to get a “Spark” of HDInsight
What is big data?
The continuous influx of digital information that is referred to as data in today’s era is staggering, to say the least. The massive volumes of data include structured, semi-structured and unstructured disparate data from various sources and origins, in different sizes that range from gigabytes to petabytes of the world’s digital information. In a future article I will address the ever-growing concern and skepticism of the very real security risks we face as Americans and as a people regarding big data; for now, we will focus on its merits and remain optimistic.
Collecting, considering and analyzing this ever-growing data with traditional processes and tools is challenging. The large volume of digital information can be historical or in real time and ranges from streams of tweets (twitter) to batches of transactional history(Amazon), from millions of internet server logs to telemetric data from residential and industrial sensor equipment. The blogs you search for using Google or Bing generate data. We cannot fathom the amount of digital information being created as you read this article.
Who uses Big data?
Public and private organizations around the world utilize this digital information because of its applications across all industries for personal and business benefits. Small to enterprise companies use innovative forms of information processing that enable enhanced insight, decision making and automation.
Big data core tenets
The techniques and technologies used to handle the extreme scale of data economically is known as “Big Data." There are some core tenets to factor when deciding on the need of a big data solution. Those tenets are known as Volume, Variety, and Velocity. If the volume of data cannot be stored using traditional vertical scaling and needs a horizontal scaling architecture, then you require a big data solution. Similarly, a variety concern is when the data coming in has different structure and format than what is already stored. This would also be true if the data was unstructured. There is also a concern with the rate at which data arrives or changes. When the window of processing data is small, then it is a velocity concern, which requires a big data architectural solution.
Associated concerns with Big Data
The concern with Big Data is not with the storage of data, as storage costs are low. The problem is the amount of data that gets analyzed; large volumes of data is not easy to analyze with traditional technologies. All of this changed with Apache Hadoop, which enabled us to analyze massive volumes of data using economical hardware. The cloud also optimized, implementing a moderately sized Hadoop cluster. Now if you want to analyze petabytes of data, you just initialize a Hadoop cluster and terminate it when you are done processing; you only pay for the time you use. This has drastically reduced the costs of data processing and has made it available to all business no matter the size or internal talent base.
How can Microsoft Azure help with Big data?
As we mentioned in the last paragraph, the cloud has optimized how we plan, prepare, and implement technologies like Apache Hadoop. Microsoft Azure delivers a cloud scale version of Apache Hadoop, known as HDInsight. It is an open-source Hadoop at cloud scale. Thus, you can run any Hadoop application on HDInsight without the tedious refactorization. HDInsight includes Apache Hadoop frameworks like; Spark, HBase, Storm, Pig, Hive, Sqoop, Oozie, Ambari and more. Business analysts can continue to use either their 3rd party BI (business intelligence) tools or opt for Microsoft’s Power BI application.
The Big Data pipeline…
The pipeline pattern starts with data and should end with actionable insight. Typically, data is ingested using an appropriate application. The data is then staged based on its structure in one of Microsoft Azure’s storage offerings. Then the data is processed and analyzed in motion using HDInsight from the initial storage technology to be stored again in a similar or dissimilar storage offering which varies based on your project requirements. Optionally, the data is then processed or analyzed by the HDInsight again to get more information from the data. Furthermore, data analysts using business intelligence tools connect with HDInsight to perform exploratory data analysis. Once the appropriate hypotheticals have a high probability in terms of precision and accuracy, we can with confidence present our findings to the user, which grants them insights they can use to make decisions.
Using Microsoft Azure to get a “Spark” of HDInsight
Azure HDInsight is a cloud scale service running Apache Hadoop which facilitates creating, configuring and running Apache frameworks like Spark. Apache Spark is a parallel processing service that delivers in-memory analytics to optimize big data solutions. In short, you can run a Spark job which loads and cache’s data into memory and query it. The in-memory computing is faster than running a Hadoop job, which shares data through its distributed file system.
What are the benefits of using Spark on HDInsight?
Spark on Microsoft HDInsight delivers benefits like…
- Implementation convenience using APIs such as:
- Azure portal
- Azure CLI
- Azure SDK
- Jupyter Notebooks
- Apache Zeppelin Notebooks
- Interactive data processing
- REST APIs
- Apache Livy
- Remote jobs (monitoring)
- Spark Core
- Spark SQL
- Spark streaming API
- Integration with Azure
- Integration with BI Tools
- ML Services
- Apache Livy
Consider the following prerequisites:
- Experience in Spark Scala
- Experience in PySpark
- Azure Account
- Azure portal
- HDInsight Spark
- Sample data
- Jupyter Notebook
Using HDInsight Spark Jupyter Notebooks to perform exploratory data analysis
APIs and tools:
- Azure portal
- Azure CLI
- Azure Powershell (optional)
- Visual Studio 2017/2019 or later (optional)
- HDInsight Spark Cluster
- Jupyter Notebook running Spark/Scala