Oreilly - Building Pipelines for Natural Language Understanding with Spark
by David Talby, Alex Thomas | Released December 2016 | ISBN: 9781491978122
The course is designed for engineers and data scientists who have some familiarity with Scala, Apache Spark, and machine learning who need to process large natural language text in a distributed fashion.We will use sample of posts from the subreddit /r/WritingPrompts, which contains short stories and comments about the short stories.The course has four parts1. Building a natural language processing and entity extraction pipeline on Scala & Spark2. Machine Learning Applications for Statistical Natural Language Understanding at Scale3. Topic Modeling on Natural Language with Scala, Spark and MLLib4. Deep Learning Applications for Natural Language Understanding with Scala, Spark and MLLibYou will learn how use Apache Spark to process text with annotations, use machine learning with your annotations, create and use topic models, create and use a word2vec model. Show and hide more Publisher resources View/Submit Errata Download Example Code
- Welcome to the Course 00:01:37
- Part 1: Building a natural language processing and entity extraction pipeline on Scala & Spark
- Notebook 1: Introduction 00:02:35
- Annotation Library 00:04:15
- Basic Annotators 00:08:59
- Vocabulary Analysis 00:09:30
- Exercise: Building a stopword annotator 00:05:06
- Part 2: Machine Learning Applications for Statistical Natural Language Understanding at Scale
- Notebook 2: Introduction 00:02:14
- Model-based Annotators 00:04:18
- Creating a Binary Classifier 00:14:38
- Exercise: Predicting score or popularity 00:05:30
- Part 3: Topic Modeling on Natural Language with Scala, Spark and MLLib
- Notebook 3: Introduction 00:02:12
- K-Means clustering 00:07:03
- LDA topic modeling 00:07:39
- Exercise: Using topics for score or popularity prediction 00:02:36
- Part 4: Deep Learning Applications for Natural Language Understanding with Scala, Spark and MLLib
- Notebook 4: Introduction 00:02:07
- Word2Vec 00:05:05
- Expanding genre entity lists 00:04:49
- Exercise: Using Word2Vec based features for score or popularity prediction 00:02:44
Show and hide more