Welcome to the Fast Lane Blog!

Take a look around. Let us know if you have any questions.

A STUDY ON FEATURE ENGINEERING AT SCALE WITH DATABRICKS SPARK

Posted by David Santana on Apr 23, 2021 4:45:00 PM

Statistical data processing has been around for a very long time. Even the underpinnings to prominent probability theories have existed for years, such as Bayes Theorem (discovered in 1763) or Alan Turing’s learning machine, which was said to assimilate data and is commonly seen as one of the precursors to the age of artificial intelligence. Undoubtedly, the milestones from our past have guided us into the future, and will continue to bring insights into all industries and sectors.

Organizations currently are generating, storing, preparing, and analyzing large volumes of data. Utilizing cutting edge technologies in machine learning, this data is then used to forecast all aspects of the business process. Current advances in elastically scaling machine learning compute have increased the circulation and growth of the technology. Small to large companies increasingly rely on machine learning to provide specific services to their clients. More and more business use machine learning to optimize daily operations.

Organizations have learned to adopt cloud technologies to increase agility and control analytical operational costs. This shift has seen many companies migrating their analytical workloads to Microsoft Azure. For example, the American Cancer Society conducts research analyzing petabytes of data and provides their cancer patients with value regarding cancer research. They have undergone a Microsoft Azure digital transformation that reduces overhead and stretches each dollar further so they can use their money where it counts to fight cancer.

Microsoft Azures capabilities are limitless; elasticity, automation, accelerating machine learning workloads, and more. These capabilities are achieved by delivering powerful graphics processing compute or serverless compute resources, providing massive parallel distributed data storage at a fraction of the price. These concerns are common for data scientists in any sector or industry.

As mentioned, pundits collect data for the benefit of all. Collected data helps industry leaders make informed decisions, from improving consumer services to accelerating the health industries’ capabilities to administer life-saving care.

This white paper focuses on machine learning, particularly feature engineering using a cloud scaled Apache Spark, known as Azure Databricks. Additionally, this paper will briefly address the concerns and values of a multi-staged iterative machine learning process.

Data

Data scientists view data as observations of the world that we all live in. Data is generated organically by a number of systems and processes, and every piece of generated data gives us important information. However, there is always data that is missing, semi-structured, unstructured, categorical, and sometimes raw that does not scale well based on its distribution.

It is common for data to require normalization and enrichment before it can be explored visually. Microsoft Azure offers extract, transform, load (ETL) and extract, load, and transform (ELT) data processing technologies. To support various machine learning workflow phases, you must have data storage and data processing platform services.

Data Pipeline Overview

pic1-Apr-23-2021-07-05-16-06-PM

Microsoft Azure’s various data pipeline technologies help store, classify, and process big data based on its volume, variety, and velocity. The data is ingested into cloud scaled big data analytical storage services, like Data Lake, or Apache Hadoop in Microsoft Azure, known as HDInsight, and processed by Microsoft Azure Databricks. The data is then staged, sampled, mined, and then duplicated for fault tolerance reasons. The transformed data in its malleable format is then modeled efficiently using an array of programing frameworks at cloud scale.

Data Mining

pic2-Apr-23-2021-07-05-39-77-PM

Data mining and inferencing strategies using methods such as the cross-industry standard process for data mining (CRISP) are common tasks in analyzing large data sets. The two tenants that will be addressed here are business understanding and data understanding. The tasks are comprised of manual or automatic analysis of unknown variables and patterns, such as groupings of data, outliers, and dependencies. These patterns are a summary of data and are used in machine learning.

Data scientists also implement k-fold cross validation resampling methods on smaller amounts of the population data. In addition, there are heuristic methods for calculating an estimate for a suitable sample size based on the number of features to deliver optimal results.

Data mining uses machine learning and data models to discover patterns in large datasets to ensure we adhere to business requirements and data understanding through classification. Data analysis and inferencing techniques used to evaluate models and hypothesize on the dataset will not be addressed in this article. Data mining is arguably as challenging as feature engineering; thus, it bears importance and at the least must be acknowledged. Moreover, inferencing at scale is attainable leveraging Microsoft Azure technologies like Azure Machine learning services.

Statistical Data Modeling

pic3-3

“Data models exist in the boundary where data and insights meet.”

This paper’s main objective will reinforce how Databricks Spark machine learning will accelerate feature engineering, which subsequently improves data modeling.

Data modeling derives information from the characteristics of the observed data, such as categorical labels, numerical labels, missing values, or data redundancy. In brief, the statistical model describes various aspects of your data. As an example, analyzing the sporadic behavior of the Dow Jones stock prices can be used to predict future fluctuations. A data model created using this information would need to ascertain historical earnings, as well as past and present price correlations from our sample data. Then we need to render it in a mathematical notation that grants us the capability to calculate the probability of any event. Note that the data model is a statistical assumption concerning the data sample.

Feature engineering helps drive insights from our sample data, thus ensuring the capability of our statistical data model. Measuring our data model is also important; alas, this paper will not explore the frequently used measures to discern the magnitudes of errors in prediction, like root-mean-square-deviation. However, there will be future papers on how to improve statistical assumptions utilizing these key methods. Nevertheless, trying to understand your data is arduous and this is where optimizing your statistical models with scaled feature engineering delivers value.

Apache Spark MLLIB Overview

pic4-3

Apache Spark is an open-source analytical engine used for processing large volumes of data in memory to optimize performance. Additionally, Spark distributes data and workloads in parallel, thus improving overall performance. Furthermore, it can ingest and process data interactively and supports various programming languages such as Python, R, .NET, Java, Scala, and SQL.

Spark includes a machine learning library that interoperates with Python, R, and cutting-edge frameworks like MLNET. Moreover, Spark iterates computations exponentially, optimizing machine learning performance, yielding efficient results over traditional computational services.

Scaling Analytics With Azure Databricks

pic5-2

Microsoft’s Azure Databricks is a cloud service that simplifies implementation, modeling, evaluating, and deploying machine learning statistical data models at cloud scale. Databricks provides value by integrating an MLflow service to monitor the machine learning project lifecycle.

Databricks machine learning runtime is a prebuilt environment that deliver preprocessing algorithms, including clustering, classification, regression, feature extraction and transformation. Databricks machine learning runtime provides various libraries, including XGBoost, scikit-learn, Horovod, PyTorch and TensorFlow. You can install additional libraries during cluster creation.

Databricks delivers extensions to improve performance, including GPU acceleration, distributed deep learning using HorovodRunner, and improved the Databricks file system checkpointing process.

Feature Engineering With Spark

pic6-2

Features represent characteristics of our data and how they line up with model assumptions. In fact, features are a numeric (scalar, vector) representation of data. Feature engineering develops features given the sample data’s characteristics. Deciding on the right number of features to use is equally important as ensuring they are relevant to the data project. If not, this can create complexity and negatively impact the statistical data model’s evaluation and performance.

Here are other factors to consider:

  • Binarization: Transforming numerical variables into binary vectors.
  • Log transforms: log functions map small numbers ([0, 1] to [1, 10]) to larger number ranges.
  • Scaling: Changing the input scale of the feature to control bounds.
    • Min-max scaling
    • Standardization
    • Normalization
  • Flattening: Using bag-of-words, we count words that are repeated due to the words’ relevance and higher count and convert the text into vectors.
  • Filtering: Separates the noise from the signal, for example: removing common words that are inscrutable.
    • Term frequency filtering
    • Stop-words.
    • Stemming
  • Feature Hashing: Compresses a feature into a dimensional vector.

Another matter to consider is the distribution of the scalar values. Traditionally, training a regression model assumes scalar values are displayed in a normal distribution, known as Gaussian distribution. However, if scalar values range out of bounds then a normal distribution cannot be used. There are various scaling methods that can mitigate this concern. This paper does not delve into the various scaling methods in depth, but several of them have been listed above.

Feature engineering with Spark reduces statistical data model costs because Spark iterates computations exponentially, optimizing machine learning performance. The benefits in performance provide an efficient model which decreases scoring time and controls compute costs. Some of the most common Spark feature engineering practices are denoted below.

Feature filtering processes should help reduce features that are counterproductive, such as filtering out features that do not fit our scales thresholds. Feature wrapper method allow us to try out subsets of features. Embedded methods select features during evaluation that are specific to the model, which delivers a balance between compute expense and quality.

Feature engineering will require handling numerical and categorical variables which are common in various transactional datasets. For instance, the username from a transaction is an example of a categorical variable. If the categorical variable appears many times in a dataset, then we can represent that variable as a count, this is known as bin counting. Using Spark machine learning libraries encoding algorithms and methods are required to transform large categorical variables into numerical variables at cloud scale.

ML Transformative Algorithms

pic7

Extracting, transforming, and selecting features is essential in feature engineering. Consider categorical variables, such as ores like Titanium, Copper, Tungsten, Gold, and Obsidian, among others.

Transformative algorithms are required to convert nonnumeric data into numeric. Transformation methods like one-hot encoding transforms categorical features to a binary vector. Each binary bit represents a category.

One-hot encoding supports algorithms like Logistic Regressions. There are other coding methods known as dummy and effect coding which have similarities to one-hot encoding but provide solutions to the caveats of one-hot encoding.

Assimilating larger categorical variables may require more than simple coding methods. Feature hashing using hash functions map larger input variables to the same numbers which are mapped into bins. This is ideal for strings and complex data structures. Bin counting, also known as binning, computes the statistics between a value and the target label. These methods and algorithms are included in Spark Machine learning libraries.

At present, machine learning frameworks are changing rapidly, and data scientists require fast, convenient access to various libraries on-demand. Databricks Spark Machine Learning runtime delivers this and more by streamlining the machine learning pipeline. Including powerful feature engineering accelerators like, AutoML features, hyperparameter tuning and end to end monitoring at cloud scale.

Conclusion

pic8

Azure Databricks machine learning runtime provides comprehensive tooling for developing and deploying machine learning statistical models. Sparks feature engineering functions and methods are optimizing and controlling R & D operational costs at cloud scale across all industries because of Microsoft Azure platform as a service offering.

Key Takeaways

  • Spark delivers ease of use to Data Scientists because Sparks API interoperates with MLNET, R and Python programing libraries which streamline the statistical modeling process.
  • Databricks runtime accelerates algorithmic computations which provides high-quality performance at cloud scale.
  • You can run Spark virtually everywhere. You can implement Spark on Azure virtual machine’s, on Azure Kubernetes Service (AKS), on Azure HDInsight, and Azure Databricks. Integrate, Ingest, and inference your data from various data platform technologies hosted in the Microsoft Azure ecosystem like, Azure Storage account services Blob, or the massively scaled Azure Data Lake service.

Topics: Microsoft Azure, Artificial Intelligence, ITTraining, Machine Learning, Databricks, Apache Spark

Need help choosing a training path that's right for you?

Drop us a line here and one of our education services consultants will reach out!