This is the fifth course in the IBM AI Enterprise Workflow Certification specialization. You are STRONGLY encouraged to complete these courses in order as they are not individual independent courses, but part of a workflow where each course builds on the previous ones.
This course introduces you to an area that few data scientists are able to experience: Deploying models for use in large enterprises. Apache Spark is a very commonly used framework for running machine learning models. Best practices for using Spark will be covered in this course. Best practices for data manipulation, model training, and model tuning will also be covered. The use case will call for the creation and deployment of a recommender system. The course wraps up with an introduction to model deployment technologies.
By the end of this course you will be able to:
1. Use Apache Spark’s RDDs, dataframes, and a pipeline
2. Employ spark-submit scripts to interface with Spark environments
3. Explain how collaborative filtering and content-based filtering work
4. Build a data ingestion pipeline using Apache Spark and Apache Spark streaming
5. Analyze hyperparameters in machine learning models on Apache Spark
6. Deploy machine learning algorithms using the Apache Spark machine learning interface
7. Deploy a machine learning model from Watson Studio to Watson Machine Learning
Who should take this course?
This course targets existing data science practitioners that have expertise building machine learning models, who want to deepen their skills on building and deploying AI in large enterprises. If you are an aspiring Data Scientist, this course is NOT for you as you need real world expertise to benefit from the content of these courses.
What skills should you have?
It is assumed that you have completed Courses 1 through 4 of the IBM AI Enterprise Workflow specialization and you have a solid understanding of the following topics prior to starting this course: Fundamental understanding of Linear Algebra; Understand sampling, probability theory, and probability distributions; Knowledge of descriptive and inferential statistical concepts; General understanding of machine learning techniques and best practices; Practiced understanding of Python and the packages commonly used in data science: NumPy, Pandas, matplotlib, scikit-learn; Familiarity with IBM Watson Studio; Familiarity with the design thinking process.
Today data scientists have more tooling than ever before to create model-driven or algorithmic solutions, and it is important to know when to take the time to make code optimizations. This week we spend a lot of time performing hands on activities. We start this week by interacting with Apache Spark then progressing to a tutorial with Docker. We’ll wrap up the week working through a tutorial on Watson Machine Learning.
Deploying Models using Spark
This week is primarily focused on deploying models using Spark. The rationale to move to Spark almost always has to do with scale, either at the level of model training or at the level of prediction. Although the resources available to build Spark applications are fewer than those for scikit-learn, Spark gives us the ability to build in an entirely scaleable environment. We will also look at recommendation systems. Most recommender systems today are able to leverage both explicit (e.g. numerical ratings) and implicit (e.g. likes, purchases, skipped, bookmarked) patterns in a ratings matrix. The majority of modern recommender systems embrace either a collaborative filtering or a content-based approach. A number of other approaches and hybrids exist making some implemented systems difficult to categorize. We wrap the week up with our hands-on case study on Model Deployment.