Oreilly - Building Data Pipelines with Python
by Katharine Jarmul | Released November 2016 | ISBN: 9781491970263
This course shows you how to build data pipelines and automate workflows using Python 3. From simple task-based messaging queues to complex frameworks like Luigi and Airflow, the course delivers the essential knowledge you need to develop your own automation solutions. You'll learn the architecture basics, and receive an introduction to a wide variety of the most popular frameworks and tools.Designed for the working data professional who is new to the world of data pipelines and distributed solutions, the course requires intermediate level Python experience and the ability to manage your own system set-ups. Acquire a practical understanding of how to approach data pipelining using Python toolsets Master the ability to determine when a Python framework is appropriate for a project Understand workflow concepts like directed acyclic graphs, producers, and consumers Learn to integrate data flows into pipelines, workflows, and task-based automation solutions Understand how to parallelize data analysis, both locally and in a distributed cluster Practice writing simple data tests using property-based testingKatharine (AKA Kjam) Jarmul is a Python developer, data consultant, and educator who has worked with Python since 2008. Kjam runs kjamistan UG, a Python consulting, training, and competitive analysis company based in Berlin, Germany. She is the author of several O'Reilly titles, including Data Wrangling with Python: Tips and Tools to Make Your Life Easier. She holds an M.A. from American University and an M.S. from Pace University. Show and hide more Publisher resources Download Example Code
- Introduction
- Welcome To The Course 00:02:53
- About The Author 00:01:55
- Automation 101
- Introduction To Automation 00:02:48
- Adventures With Servers 00:06:37
- Being A Good Systems Caretaker 00:06:03
- What Is A Queue? 00:02:32
- What Is A Consumer? What Is A Producer? 00:02:00
- Easy Task Processing With Celery
- Why Celery? 00:01:49
- Celery Architecture & Set Up 00:05:25
- Writing Your First Tasks 00:07:49
- Deploying Your Tasks 00:06:08
- Scaling Your Workers 00:08:52
- Monitoring With Flower 00:05:05
- Advanced Celery Features 00:06:00
- Scaling Data Analysis With Dask
- Why Dask? 00:03:01
- First Steps With Dask 00:10:08
- Dask Bags 00:10:18
- Dask Distributed 00:09:58
- Data Pipelines With Luigi & Airflow
- What Are Data Pipelines? What Is Dag? 00:02:37
- Luigi And Airflow: A Comparison 00:05:50
- First Steps With Luigi 00:07:12
- More Complex Luigi Tasks 00:09:17
- Introduction To Hadoop 00:08:21
- First Steps With Airflow 00:08:07
- Custom Tasks With Airflow 00:09:16
- Advanced Airflow: Subdags And Branches 00:11:17
- Using Luigi With Hadoop 00:10:15
- Other Workflow Frameworks
- Apache Spark 00:08:28
- Apache Spark Streaming 00:06:32
- Django Channels 00:09:39
- And Many More 00:05:59
- Testing With Pipelines
- Introduction To Testing With Python 00:07:24
- Property-Based Testing With Hypothesis 00:06:09
- Conclusion
Show and hide more