Data Science collaboration is changing. Gone are the days of sharing files over email. It is no longer necessary to have new team members download dependencies in order to work on a project. This article is going to cover some options that are out there now that will improve the efficiency of your Data Science team and the quality of your workflow.
Databricks is an Apache Spark product that has evolved into an end-to-end workflow that incorporates parallel processing into a machine learning workflow. Environments can be loaded pre-configured and customized for a particular project. Utilizing notebooks, data science team members can make comments and add to the project in several languages including Python, R, Scala, SQL and Java, all in the same notebook. Databricks even allows the team to test and productionize models from the UI. Databricks has partnered with Microsoft and AWS to integrate with their products as well.
MLFlow is an open source project that focuses in four areas of the ML lifecycle: tracking, projects, models and registry. Many of the products that are mentioned in this article use the packages and contribute to the content. It is an option to consider if the Data Science team in your organization is looking to avoid operational costs at the expense of more up-front setup work. MLFlow has API packages in Python, Java, and R, and has its own REST API.
Azure Machine Learning (Studio) is a data science workflow platform product developed by Microsoft. It includes the option of creating ML workflows via a drag and drop interface or using Jupyter notebooks, Jupyter labs and R studio. Deployment options include container instances for test and development or Kubernetes for real-time scalable solutions. One standout of this product is its ability to provide explanations via feature importance and data disparity (bias) (in preview). Also has the ability to monitor data drift.
Amazon’s Sagemaker is an IDE platform that, like the other products listed here, perform end-to-end workflows. Sagemaker utilizes notebooks and can perform automatic debugging. Sagemaker also offers a Ground Truth solution that is a labeling service designed to assist in creating high quality data training sets. Deployment and model monitoring are also simple endeavors through Sagemaker services. Sagemaker even offers an augmented AI human review of models.
These are just a few of the options out there. To find out more about the latest tools in Data Science, check out some of Fast Lane’s Data Science course offerings here!