Agile Machine Learning: Effective Machine Learning Inspired by the Agile Manifesto

Ebook408 pages4 hours

Agile Machine Learning: Effective Machine Learning Inspired by the Agile Manifesto

Name: Agile Machine Learning: Effective Machine Learning Inspired by the Agile Manifesto
Author: Eric Carter
ISBN: 9781484251072

By Eric Carter and Matthew Hurst

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Build resilient applied machine learning teams that deliver better data products through adapting the guiding principles of the Agile Manifesto.

Bringing together talented people to create a great applied machine learning team is no small feat. With developers and data scientists both contributing expertise in their respective fields, communication alone can be a challenge. Agile Machine Learning teaches you how to deliver superior data products through agile processes and to learn, by example, how to organize and manage a fast-paced team challenged with solving novel data problems at scale, in a production environment.

The authors’ approach models the ground-breaking engineering principles described in the Agile Manifesto. The book provides further context, and contrasts the original principles with the requirements of systems that deliver a data product.

What You'll Learn

Effectively run a data engineeringteam that is metrics-focused, experiment-focused, and data-focused
Make sound implementation and model exploration decisions based on the data and the metrics
Know the importance of data wallowing: analyzing data in real time in a group setting
Recognize the value of always being able to measure your current state objectively
Understand data literacy, a key attribute of a reliable data engineer, from definitions to expectations

Who This Book Is For

Anyone who manages a machine learning team, or is responsible for creating production-ready inference components. Anyone responsible for data project workflow of sampling data; labeling, training, testing, improving, and maintaining models; and system and data metrics will also find this book useful. Readers should be familiar with software engineering and understand the basics of machine learning and working with data.

Skip carousel

LanguageEnglish

PublisherApress

Release dateAug 21, 2019

ISBN9781484251072

Author

Eric Carter

Related authors

Skip carousel

Related to Agile Machine Learning

Related ebooks

Skip carousel

Azure Internet of Things Revealed: Architecture and Fundamentals
Ebook
Azure Internet of Things Revealed: Architecture and Fundamentals
byRobert Stackowiak
Rating: 0 out of 5 stars
0 ratings
Cognitive Computing Recipes: Artificial Intelligence Solutions Using Microsoft Cognitive Services and TensorFlow
Ebook
Cognitive Computing Recipes: Artificial Intelligence Solutions Using Microsoft Cognitive Services and TensorFlow
byAdnan Masood
Rating: 0 out of 5 stars
0 ratings
Building Big Data Applications
Ebook
Building Big Data Applications
byKrish Krishnan
Rating: 0 out of 5 stars
0 ratings
Design Thinking in Software and AI Projects: Proving Ideas Through Rapid Prototyping
Ebook
Design Thinking in Software and AI Projects: Proving Ideas Through Rapid Prototyping
byRobert Stackowiak
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Azure: Building and Deploying Artificial Intelligence Solutions on the Microsoft AI Platform
Ebook
Deep Learning with Azure: Building and Deploying Artificial Intelligence Solutions on the Microsoft AI Platform
byMathew Salvaris
Rating: 0 out of 5 stars
0 ratings
Information Management: Strategies for Gaining a Competitive Advantage with Data
Ebook
Information Management: Strategies for Gaining a Competitive Advantage with Data
byWilliam McKnight
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning with Python: A Problem-Solver's Guide to Building Real-World Intelligent Systems
Ebook
Practical Machine Learning with Python: A Problem-Solver's Guide to Building Real-World Intelligent Systems
byDipanjan Sarkar
Rating: 0 out of 5 stars
0 ratings
Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud
Ebook
Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud
byManuel Amunategui
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
Ebook
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
byDavid Loshin
Rating: 5 out of 5 stars
5/5
Making Big Data Work for Your Business: A guide to effective Big Data analytics
Ebook
Making Big Data Work for Your Business: A guide to effective Big Data analytics
bySudhi Sinha
Rating: 0 out of 5 stars
0 ratings
Learning Hunk
Ebook
Learning Hunk
byDmitry Anoshin
Rating: 0 out of 5 stars
0 ratings
Enabling World-Class Decisions for Banks and Credit Unions: Making Dollars and Sense of Your Data
Ebook
Enabling World-Class Decisions for Banks and Credit Unions: Making Dollars and Sense of Your Data
byCorey Barak
Rating: 0 out of 5 stars
0 ratings
Microsoft BizTalk 2010: Line of Business Systems Integration
Ebook
Microsoft BizTalk 2010: Line of Business Systems Integration
byKent Weare
Rating: 0 out of 5 stars
0 ratings
Contextual Design: Design for Life
Ebook
Contextual Design: Design for Life
byKaren Holtzblatt
Rating: 4 out of 5 stars
4/5
Enabling World-Class Decisions: The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions
Ebook
Enabling World-Class Decisions: The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions
byCorey Barak
Rating: 0 out of 5 stars
0 ratings
Practical DataOps: Delivering Agile Data Science at Scale
Ebook
Practical DataOps: Delivering Agile Data Science at Scale
byHarvinder Atwal
Rating: 0 out of 5 stars
0 ratings
Enabling World-Class Decisions for Asia Pacific (APAC): The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions for Asia Pacific
Ebook
Enabling World-Class Decisions for Asia Pacific (APAC): The Executive’s Guide to Understanding & Deploying Modern Corporate Performance Management Solutions for Asia Pacific
byCorey Barak
Rating: 0 out of 5 stars
0 ratings
Data Science Fundamentals for Python and MongoDB
Ebook
Data Science Fundamentals for Python and MongoDB
byDavid Paper
Rating: 0 out of 5 stars
0 ratings
Driving Data Projects: A comprehensive guide
Ebook
Driving Data Projects: A comprehensive guide
byChristine Haskell
Rating: 0 out of 5 stars
0 ratings
Using OpenRefine
Ebook
Using OpenRefine
byRuben Verborgh
Rating: 4 out of 5 stars
4/5
Navigating Big Data Analytics: Strategies for the Quality Systems Analyst
Ebook
Navigating Big Data Analytics: Strategies for the Quality Systems Analyst
byWilliam D. Mawby
Rating: 0 out of 5 stars
0 ratings
Python Data Science: A Step-By-Step Guide to Data Analysis. What a Beginner Needs to Know About Machine Learning and Artificial Intelligence. Exercises Included
Ebook
Python Data Science: A Step-By-Step Guide to Data Analysis. What a Beginner Needs to Know About Machine Learning and Artificial Intelligence. Exercises Included
byAxel Ross
Rating: 0 out of 5 stars
0 ratings
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
Ebook
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
byKanika Sud
Rating: 0 out of 5 stars
0 ratings
Data Analysis and Business Modeling with Excel 2013
Ebook
Data Analysis and Business Modeling with Excel 2013
byDavid Rojas
Rating: 1 out of 5 stars
1/5
Business Intelligence Guidebook: From Data Integration to Analytics
Ebook
Business Intelligence Guidebook: From Data Integration to Analytics
byRick Sherman
Rating: 4 out of 5 stars
4/5
Build a Next-Generation Digital Workplace: Transform Legacy Intranets to Employee Experience Platforms
Ebook
Build a Next-Generation Digital Workplace: Transform Legacy Intranets to Employee Experience Platforms
byShailesh Kumar Shivakumar
Rating: 0 out of 5 stars
0 ratings
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Ebook
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
byJanet Laane Effron
Rating: 0 out of 5 stars
0 ratings
.NET DevOps for Azure: A Developer's Guide to DevOps Architecture the Right Way
Ebook
.NET DevOps for Azure: A Developer's Guide to DevOps Architecture the Right Way
byJeffrey Palermo
Rating: 0 out of 5 stars
0 ratings
Be Data Curious!: Be Data Curious!, #1
Ebook
Be Data Curious!: Be Data Curious!, #1
byNick Jewell
Rating: 0 out of 5 stars
0 ratings
Mastering Machine Learning with R
Ebook
Mastering Machine Learning with R
byLesmeister Cory
Rating: 0 out of 5 stars
0 ratings

Programming For You

Skip carousel

Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
Ebook
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
byJason Scotts
Rating: 4 out of 5 stars
4/5
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
Ebook
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
byAnthony Adams
Rating: 4 out of 5 stars
4/5
Coding All-in-One For Dummies
Ebook
Coding All-in-One For Dummies
byNikhil Abraham
Rating: 4 out of 5 stars
4/5
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
HTML & CSS: Learn the Fundaments in 7 Days
Ebook
HTML & CSS: Learn the Fundaments in 7 Days
byMichael Knapp
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
OneNote: The Ultimate Guide on How to Use Microsoft OneNote for Getting Things Done
Ebook
OneNote: The Ultimate Guide on How to Use Microsoft OneNote for Getting Things Done
byChris Will
Rating: 1 out of 5 stars
1/5
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
Ebook
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
byDavid DuRocher
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
Ebook
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
byDexter Jackson
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
Ebook
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
byJames Tudor
Rating: 5 out of 5 stars
5/5
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
Ebook
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
byJoseph Labrecque
Rating: 5 out of 5 stars
5/5
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
Ebook
Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles
byPatrick McNeil
Rating: 4 out of 5 stars
4/5
Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming
Ebook
Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming
byConnor P. Milliken
Rating: 0 out of 5 stars
0 ratings
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
Ebook
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
byMark Chan
Rating: 5 out of 5 stars
5/5
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
Ebook
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
byTravis Plunk
Rating: 0 out of 5 stars
0 ratings
Linux: Learn in 24 Hours
Ebook
Linux: Learn in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
Ebook
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
byTimothy C. Needham
Rating: 4 out of 5 stars
4/5
C++ Learn in 24 Hours
Ebook
C++ Learn in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
HTML in 30 Pages
Ebook
HTML in 30 Pages
byU.Q. Magnusson
Rating: 5 out of 5 stars
5/5
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
Ebook
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
byBrady Ellison
Rating: 5 out of 5 stars
5/5
Learn JavaScript in 24 Hours
Ebook
Learn JavaScript in 24 Hours
byAlex Nordeen
Rating: 3 out of 5 stars
3/5
SQL All-in-One For Dummies
Ebook
SQL All-in-One For Dummies
byAllen G. Taylor
Rating: 3 out of 5 stars
3/5
Beginning Programming with Python For Dummies
Ebook
Beginning Programming with Python For Dummies
byJohn Paul Mueller
Rating: 3 out of 5 stars
3/5
Python: Learn Python in 24 Hours
Ebook
Python: Learn Python in 24 Hours
byAlex Nordeen
Rating: 4 out of 5 stars
4/5
Raspberry Pi Cookbook for Python Programmers
Ebook
Raspberry Pi Cookbook for Python Programmers
byTim Cox
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard // Coffee Session #47
Podcast episode
Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard // Coffee Session #47
byMLOps.community
0 ratings
0% found this document useful
Data Governance with Jessi Ashdown and Uri Gilad: Hosts Stephanie Wong and Priyanka Vergadia learn about data governance this week in an interesting chat with Jessi Ashdown and Uri Gilad.
Podcast episode
Data Governance with Jessi Ashdown and Uri Gilad: Hosts Stephanie Wong and Priyanka Vergadia learn about data governance this week in an interesting chat with Jessi Ashdown and Uri Gilad.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Data Analytics Launches with Bruno Aziza and Eric Schmidt: Stephanie Wong and Jenny Brown are your hosts this week, discussing data analytics with the yin and yang of the field, Bruno Aziza and Eric Schmidt.
Podcast episode
Data Analytics Launches with Bruno Aziza and Eric Schmidt: Stephanie Wong and Jenny Brown are your hosts this week, discussing data analytics with the yin and yang of the field, Bruno Aziza and Eric Schmidt.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Machine Learning, Business Success – Charles Martin, PhD, Data Scientist, Machine Learning AI Consultant, and Chief Scientist at Calculation Consulting – Rapidly Evolving Opportunities For Business Via Machine Learning and Data Science: Charles Martin, PhD, data scientist, machine learning AI consultant, and chief scientist at Calculation Consulting, delivers a thorough overview of the technologies that are helping companies expand their customer base and increase revenue. Martin is...
Podcast episode
Machine Learning, Business Success – Charles Martin, PhD, Data Scientist, Machine Learning AI Consultant, and Chief Scientist at Calculation Consulting – Rapidly Evolving Opportunities For Business Via Machine Learning and Data Science: Charles Martin, PhD, data scientist, machine learning AI consultant, and chief scientist at Calculation Consulting, delivers a thorough overview of the technologies that are helping companies expand their customer base and increase revenue. Martin is...
byFinding Genius Podcast
0 ratings
0% found this document useful
554. Barry Saunders: AI Project Case Study: Show Notes: Barry Saunders, a digital expert at McKinsey, discusses his background in the firm and his experience in AI-related projects. He worked in the LEAP practice, which built platforms for video streaming, preventative maintenance, and...
Podcast episode
554. Barry Saunders: AI Project Case Study: Show Notes: Barry Saunders, a digital expert at McKinsey, discusses his background in the firm and his experience in AI-related projects. He worked in the LEAP practice, which built platforms for video streaming, preventative maintenance, and...
byUnleashed - How to Thrive as an Independent Professional
0 ratings
0% found this document useful
Building a Carbon-Focused Tech Startup With CoveTool Co-Founder Patrick Chopson: Patrick Chopson, is the co-founder od the carbon-focused Atlanta-based startup, cove.tool. As the Chief Product Officer, he leads product development for cove.tool, a web-based software for analyzing, drawing, engineering, and connecting...
Podcast episode
Building a Carbon-Focused Tech Startup With CoveTool Co-Founder Patrick Chopson: Patrick Chopson, is the co-founder od the carbon-focused Atlanta-based startup, cove.tool. As the Chief Product Officer, he leads product development for cove.tool, a web-based software for analyzing, drawing, engineering, and connecting...
byThe Green Building Matters Podcast with Charlie Cichetti
0 ratings
0% found this document useful
Collaborators: Data-driven decision-making with Jina Suh and Shamsi Iqbal
Podcast episode
Collaborators: Data-driven decision-making with Jina Suh and Shamsi Iqbal
byMicrosoft Research Podcast
0 ratings
0% found this document useful
Engineering Execution is a Strategic Weapon w/ Bill Coughran & Melody Meckfessel #84: In this episode, Bill Coughran (Partner @ Sequoia Capital, Former SVP Engineering @ Google) and Melody Meckfessel (Co-Founder & CEO @ Observable) discuss ways to make your engineering org a strategic advantage to your company. They cover how to leverage feature/system “simplicity,” how to implement product instrumentation, when to bring in SREs, and how to balance tech debt and refactor work.
Podcast episode
Engineering Execution is a Strategic Weapon w/ Bill Coughran & Melody Meckfessel #84: In this episode, Bill Coughran (Partner @ Sequoia Capital, Former SVP Engineering @ Google) and Melody Meckfessel (Co-Founder & CEO @ Observable) discuss ways to make your engineering org a strategic advantage to your company. They cover how to leverage feature/system “simplicity,” how to implement product instrumentation, when to bring in SREs, and how to balance tech debt and refactor work.
byThe Engineering Leadership Podcast
0 ratings
0% found this document useful
AI and the Democratization of Data of with Alonso Castañeda Andrade: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is joined today by , who is the...
Podcast episode
AI and the Democratization of Data of with Alonso Castañeda Andrade: Dr. Jerry Smith welcomes you to another episode of AI Live and Unbiased to explore the breadth and depth of Artificial Intelligence and to encourage you to change the world, not just observe it! Dr. Jerry is joined today by , who is the...
byAI Live & Unbiased
0 ratings
0% found this document useful
The Modern SaaS Industry with Austin Petersmith
Podcast episode
The Modern SaaS Industry with Austin Petersmith
byFYI - For Your Innovation
0 ratings
0% found this document useful
State of DevOps Report 2021 with Nathen Harvey and Dustin Smith: This week, Stephanie Wong and Carter Morgan are talking about the recently released State of DevOps Report.
Podcast episode
State of DevOps Report 2021 with Nathen Harvey and Dustin Smith: This week, Stephanie Wong and Carter Morgan are talking about the recently released State of DevOps Report.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
The Secret Sauce to Learning Analytics with Peter Manniche Riber: As part of the hybrid working environment, organizations typically have an LMS or an LXP in place, that collects a lot of user data and actions which can be sorted, filtered, and analyzed to look for patterns and insights to solve problems. One of the common questions that L&D leaders face is how to analyze and utilize this data?
Podcast episode
The Secret Sauce to Learning Analytics with Peter Manniche Riber: As part of the hybrid working environment, organizations typically have an LMS or an LXP in place, that collects a lot of user data and actions which can be sorted, filtered, and analyzed to look for patterns and insights to solve problems. One of the common questions that L&D leaders face is how to analyze and utilize this data?
byThe Digital Adoption Show | Upskilling the Future Digital Workforce
0 ratings
0% found this document useful
WWSD: What would Sue do (for your intranet): On this episode, we hear from Sue Hanley who expertly focuses on change -- evolving and improving your intranet. Sue is an information architect, a knowledge management expert, and she doesn't shy away from the word: Portal. Sue is steeped in all...
Podcast episode
WWSD: What would Sue do (for your intranet): On this episode, we hear from Sue Hanley who expertly focuses on change -- evolving and improving your intranet. Sue is an information architect, a knowledge management expert, and she doesn't shy away from the word: Portal. Sue is steeped in all...
byThe Intrazone by Microsoft 365
0 ratings
0% found this document useful
Agile Applied AI Research with Parvez Ahammad - #492: Today we’re joined by Parvez Ahammad, head of data science applied research at LinkedIn. In our conversation, Parvez shares his interesting take on organizing principles for his organization, starting with how data science teams are broadly...
Podcast episode
Agile Applied AI Research with Parvez Ahammad - #492: Today we’re joined by Parvez Ahammad, head of data science applied research at LinkedIn. In our conversation, Parvez shares his interesting take on organizing principles for his organization, starting with how data science teams are broadly...
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
07: Brian Leonard: Be friends with engineering with open source Martech: There's a lot lost when we think of marketers and engineers as separate things and not the organization as a whole. The right thing to do is engage with the engineers that power your marketing tech stack. And meet them where they are. Open source martech
Podcast episode
07: Brian Leonard: Be friends with engineering with open source Martech: There's a lot lost when we think of marketers and engineers as separate things and not the organization as a whole. The right thing to do is engage with the engineers that power your marketing tech stack. And meet them where they are. Open source martech
byHumans of Martech
0 ratings
0% found this document useful
Data and Analytics Summit Preps Leaders for the Future: Gartner’s Data and Analytics Summit is one of the world’s largest gatherings of business and IT leaders across this important discipline. Gartner’s leading analysts came together to share a postcard from the future to enable everyone to be ready for the coming technology wave.
Podcast episode
Data and Analytics Summit Preps Leaders for the Future: Gartner’s Data and Analytics Summit is one of the world’s largest gatherings of business and IT leaders across this important discipline. Gartner’s leading analysts came together to share a postcard from the future to enable everyone to be ready for the coming technology wave.
byTechWave: A Gartner Podcast for IT Leaders
0 ratings
0% found this document useful
008 Joerg Hellwig - Digital Transformation in the Materials and Chemicals Industry: Summary: In this episode, Joerg Hellwig, Chief Digital Officer of LANXESS AG, provides an Industry perspective on how the acceleration of new product development plays a crucial role in a company’s success, and how data is a critical enabler of this...
Podcast episode
008 Joerg Hellwig - Digital Transformation in the Materials and Chemicals Industry: Summary: In this episode, Joerg Hellwig, Chief Digital Officer of LANXESS AG, provides an Industry perspective on how the acceleration of new product development plays a crucial role in a company’s success, and how data is a critical enabler of this...
byDataLab: The Materials Informatics Podcast
0 ratings
0% found this document useful
Gaming Analytics Platform with Kir Titievsky, Eric Anderson, and Tino Tereshko: Analytics is an essential part of many platforms, but it is specially important for gaming. Today we discuss how Google Cloud makes analytics simpler and super powerful.
Podcast episode
Gaming Analytics Platform with Kir Titievsky, Eric Anderson, and Tino Tereshko: Analytics is an essential part of many platforms, but it is specially important for gaming. Today we discuss how Google Cloud makes analytics simpler and super powerful.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Open Source Software as a Triumph of Information Hiding, Modularity, and Creating Optionality with Dr. Gail Murphy: In this newest episode of The Idealcast, Gene Kim speaks with Dr. Gail Murphy, Professor of Computer Science and Vice President of Research and Innovation at the University of British Columbia. She is also the co-founder, board member, and former Chi...
Podcast episode
Open Source Software as a Triumph of Information Hiding, Modularity, and Creating Optionality with Dr. Gail Murphy: In this newest episode of The Idealcast, Gene Kim speaks with Dr. Gail Murphy, Professor of Computer Science and Vice President of Research and Innovation at the University of British Columbia. She is also the co-founder, board member, and former Chi...
byThe Idealcast with Gene Kim by IT Revolution
0 ratings
0% found this document useful
Automating Analytics Teams
Podcast episode
Automating Analytics Teams
byThe Cloudcast
0 ratings
0% found this document useful
Understanding Graph Database Patterns
Podcast episode
Understanding Graph Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Technology Strategy 2023: Into the Metaverse: #metaverse #digitaltransformation #digitaltwin Enterprise technology is moving rapidly toward the metaverse, a world composed of real and virtual businesses, people, and things operating in a hyperconnected environment. Listen to this...
Podcast episode
Technology Strategy 2023: Into the Metaverse: #metaverse #digitaltransformation #digitaltwin Enterprise technology is moving rapidly toward the metaverse, a world composed of real and virtual businesses, people, and things operating in a hyperconnected environment. Listen to this...
byCXOTalk
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
59. Outsiders Solving Wicked Problems with Shubhi Mishra: While Shubhi Mishra, founder and CEO of Raft, is a lawyer and data scientist by training, she’s better known as an intentional government technology (GovTech) disruptor at heart. She loves solving complex problems, even the kind that give you a headache while you’re working through them. But that process of discovery, of realization, and coming to a solution makes it all worthwhile. Her passion is working with bleeding-edge technology focused on the defense sector. Raft provides an innovation space for people who are similarly mission-focused, tackling vexing challenges with passion and enthusiasm. Ms. Mishra seeks to inspire other women in and out of the GovTech space and excite them enough to join the movement of providing better solutions and services to the defense industry through sustainable, emerging technology. In today’s interview, Ms. Mishra discusses wicked problems in national security; finding creative, mission-
Podcast episode
59. Outsiders Solving Wicked Problems with Shubhi Mishra: While Shubhi Mishra, founder and CEO of Raft, is a lawyer and data scientist by training, she’s better known as an intentional government technology (GovTech) disruptor at heart. She loves solving complex problems, even the kind that give you a headache while you’re working through them. But that process of discovery, of realization, and coming to a solution makes it all worthwhile. Her passion is working with bleeding-edge technology focused on the defense sector. Raft provides an innovation space for people who are similarly mission-focused, tackling vexing challenges with passion and enthusiasm. Ms. Mishra seeks to inspire other women in and out of the GovTech space and excite them enough to join the movement of providing better solutions and services to the defense industry through sustainable, emerging technology. In today’s interview, Ms. Mishra discusses wicked problems in national security; finding creative, mission-
byThe Convergence - An Army Mad Scientist Podcast
0 ratings
0% found this document useful
Why AI is a Game Changer for Customer Experience with OCX Recognition CEO Richard Owen
Podcast episode
Why AI is a Game Changer for Customer Experience with OCX Recognition CEO Richard Owen
byThe Delighted Customers Podcast with Mark Slatin
0 ratings
0% found this document useful
WBSP178: Grow Your Business by Learning the Importance of Source of Truth, a Live Interview w/ a Panel of Experts
Podcast episode
WBSP178: Grow Your Business by Learning the Importance of Source of Truth, a Live Interview w/ a Panel of Experts
byWBSRocks: Business Growth with ERP and Digital Transformation
0 ratings
0% found this document useful
1425: Exasol - How Data Visualization Is Highlighting Gender Inequality: Exasol, Technology Evangelist, Eva Murray
Podcast episode
1425: Exasol - How Data Visualization Is Highlighting Gender Inequality: Exasol, Technology Evangelist, Eva Murray
byThe Tech Talks Daily Podcast
0 ratings
0% found this document useful
Data Leadership Ep. 1 | New Visions for New Technologies
Podcast episode
Data Leadership Ep. 1 | New Visions for New Technologies
byThe Digital Transformation Journey
0 ratings
0% found this document useful
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
Podcast episode
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
byDataFramed
0 ratings
0% found this document useful
Ep 532: Data Driven Talent Acquisition: Grant Telfer, Business Development Director at Textkernel, talks to Matt Alder
Podcast episode
Ep 532: Data Driven Talent Acquisition: Grant Telfer, Business Development Director at Textkernel, talks to Matt Alder
byRecruiting Future with Matt Alder
0 ratings
0% found this document useful

Skip carousel

01 Ready Or Not, AI Is Here To Assist You
HWM Singapore
Article
01 Ready Or Not, AI Is Here To Assist You
Jul 11, 2023
4 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
Quantum Leap
Marketing
Article
Quantum Leap
Jul 11, 2019
6 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
The Procurement Call For Agile, What Does It Mean?
The European Business Review
Article
The Procurement Call For Agile, What Does It Mean?
Dec 3, 2019
11 min read
Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
Article
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
Four Critical Skills For Tomorrow’s Innovation Workforce
Rotman Management
Article
Four Critical Skills For Tomorrow’s Innovation Workforce
Sep 1, 2020
12 min read
You’d Better Get Write on It
Inc.
Article
You’d Better Get Write on It
May 23, 2018
In March 2010, Foursquare was riding high, one of the coolest social startups of the day, with gobs of fresh venture capital and a million people using its mobile app to check in. And then, on March 26, the company’s website went dark. Somebody, it s
2 min read
Q&A
Rotman Management
Article
Q&A
May 1, 2023
Describe the capability that companies like Netflix, UPS, Amazon and Caesars Entertainment have in common. These are all leading firms in their industries with respect to leveraging analytics as a source of competitive advantage. We now have so much
7 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
AppleMagazine
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 28, 2023
4 min read
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
TechLife News
Article
Q&A: OPENAI CTO MIRA MURATI ON SHEPHERDING CHATGPT
Apr 29, 2023
4 min read
Leadership Forum: Making Digital Transformation A Reality
Rotman Management
Article
Leadership Forum: Making Digital Transformation A Reality
Jan 1, 2018
Glenda Crisp Senior Vice President and Chief Data Officer, TD Bank Group + Connie Bonello Associate Partner, Financial Services, IBM Canada IN MOST OF TODAY’S ORGANIZATIONS, data underpins every transaction, operation and interaction. And yet, the ab
8 min read
Signals Of Change: how To Evolve For The New Global Reality
Rotman Management
Article
Signals Of Change: how To Evolve For The New Global Reality
May 1, 2022
11 min read
AI And Design: Questions Of Ethics
Architecture Australia
Article
AI And Design: Questions Of Ethics
Mar 4, 2024
Artificial intelligence (AI) is a very old idea, but the term AI and the field of AI as it relates to modern programmable digital computing have taken their contemporary forms in the past 70 years.1Today, we interact with AI technologies constantly,
5 min read
Will Generative AI Disrupt Your Company And Your need For Workers?
The European Business Review
Article
Will Generative AI Disrupt Your Company And Your need For Workers?
Jul 31, 2023
5 min read
Questions for Angela Zutavern, Machine Intelligence Expert, Booz Allen Hamilton
Rotman Management
Article
Questions for Angela Zutavern, Machine Intelligence Expert, Booz Allen Hamilton
Jan 1, 2018
You believe that the world of leadership has hit an inflection point. How so? As useful as popular mental models and heuristics are, machine models now outstrip human performance in about half of the portfolio of cognitive tasks. Going forward, we wi
6 min read
Your Biggest Questions About AI, Answered
Entrepreneur
Article
Your Biggest Questions About AI, Answered
Nov 14, 2023
7 min read
WHAT EVERY MANAGER SHOULD KNOW ABOUT HUMAN-CENTERED AI: A Manager’s Introduction to Human-Centered Artificial Intelligence
The European Business Review
Article
WHAT EVERY MANAGER SHOULD KNOW ABOUT HUMAN-CENTERED AI: A Manager’s Introduction to Human-Centered Artificial Intelligence
Dec 3, 2019
9 min read
Why We Need To Fear The Risk Of AI Model Collapse
Evening Standard
Article
Why We Need To Fear The Risk Of AI Model Collapse
Dec 17, 2023
4 min read
Why Your Organisation Needs To Lift Its Data Game
NZBusiness and Management
Article
Why Your Organisation Needs To Lift Its Data Game
Oct 22, 2019
From problems stemming from the recent New Zealand census to data collected by Facebook, data has been in the news a lot lately. It may seem obvious that large organisations such as Statistics New Zealand and Facebook need to continually improve thei
3 min read
The Era of Human + Machine Innovation
Rotman Management
Article
The Era of Human + Machine Innovation
Jan 1, 2019
Interview by Karen Christensen In today's environment, organizations that don't keep up with customers' evolving needs are doomed. What is the best way to get a handle on these evolving needs? The first step in understanding your customers is to acce
5 min read
Pandora’s Box? Unleashing The Power Of AI
NZ Marketing
Article
Pandora’s Box? Unleashing The Power Of AI
Jun 22, 2023
8 min read
The Future Of The Data Economy
The European Business Review
Article
The Future Of The Data Economy
Jun 1, 2022
6 min read
PEOPLE ASSESSMENT in the Digital Age
The European Business Review
Article
PEOPLE ASSESSMENT in the Digital Age
May 25, 2021
8 min read
The Art Of AI Maturity: Five Success Factors
Rotman Management
Article
The Art Of AI Maturity: Five Success Factors
Jan 1, 2023
TODAY, MUCH OF WHAT WE TAKE FOR GRANTED in our daily lives stems from machine learning. Every time you use a wayfinding app to get from point A to point B, use dictation to convert speech to text, or unlock your phone using face ID, you’re relying on
10 min read
The Tech Trends Every Leader Needs to Understand
Rotman Management
Article
The Tech Trends Every Leader Needs to Understand
Sep 1, 2023
11 min read
6 Businesses That Need to Be Launched Right Now
Entrepreneur
Article
6 Businesses That Need to Be Launched Right Now
Mar 1, 2018
2 min read
Digital Marketing: AI Enables Expanded Roles For Marketers
The European Business Review
Article
Digital Marketing: AI Enables Expanded Roles For Marketers
Jan 25, 2021
8 min read
Decoding The Impact Of AI
Her World Singapore
Article
Decoding The Impact Of AI
May 5, 2023
6 min read
Strategy + Design Thinking = Stakeholder-Centric Design
Rotman Management
Article
Strategy + Design Thinking = Stakeholder-Centric Design
Sep 1, 2018
OVER THE PAST 15 YEARS, design thinking has had an explosive impact on innovation and commercialization, especially within established firms. The Rotman School’s former Dean, Roger Martin, has contributed mightily to these advances, notably through h
6 min read

Related categories

Skip carousel

Reviews for Agile Machine Learning

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Agile Machine Learning - Eric Carter

E. Carter, M. HurstAgile Machine Learninghttps://doi.org/10.1007/978-1-4842-5107-2_1

1. Early Delivery

Eric Carter¹ and Matthew Hurst²

(1)

Kirkland, WA, USA

(2)

Seattle, WA, USA

Our highest priority is to satisfy the customer through early and continuous delivery of valuable [data].

—agilemanifesto.org/principles

Data projects, unlike traditional software engineering projects, are almost entirely governed by a resource fraught with unknown patterns, distributions, and biases – the data. To successfully execute a project that delivers value through inference over data sets, a novel set of skills, processes, and best practices need to be adopted. In this chapter, we look at the initial stages of a project and how we can make meaningful progress through these unknowns while engaging the customer and continuously improving our understanding of the data, its value, and the implications it holds for system design.

To get started, let’s take a look at a scenario that introduces the early stages of a project that involved mining local business data from the Web which comes from the authors’ experience working on Microsoft’s Bing search engine. There are millions of local business locations in the United States. Approximately 50%¹ of these maintain some form of web site whether in the form of a simple, one-page design hosted by a drag-and-drop web hosting service or a sophisticated multi-brand site developed and maintained by a dedicated web team. The majority of these businesses update their sites before any other representation of the business data, driven by an economic incentive to ensure that their customers can find authoritative information about them.² For example, if their phone number is incorrect, then potential customers will not be able to reach them; if they move and their address is not updated, then they risk losing existing clients; if their business hours change with the seasons, then customers may turn away.

A local search engine is only as good as its data. Inaccurate or missing data cannot be improved by a pretty interface. Our team wanted to go to the source of the data – the business web site – to get direct access to the authority on the business. As an aggregator, we wanted our data to be as good as the data the business itself presented on the Web. In addition, we wanted a machine-oriented strategy that could compete with the high-scale, crowd-sourced methods that competitors benefitted from. Our vision was to build an entity extraction system that could ingest web sites and produce structured information describing the businesses presented on the sites.

Extraction projects like this require a schema and some notion of quality to deliver a viable product,³ both determined by the customer – that is, the main consumer of the data. Our goal was to provide additional data to an existing system which already ingested several feeds of local data, combining them to produce a final conflated output. With an existing product in place, the schema was predetermined, and the quality of the current data was a natural lower bound on the required quality. The schema included core attributes: business name, address, phone number, as well as extended attributes including business hours, latitude and longitude, menu (for restaurants), and so on. Quality was determined in terms of errors in these fields.

The Metric Is the Customer

The first big shift in going from traditional agile software projects to data projects is that much of the role of the customer is shifted to the metric measured for the data. The customer, or product owner, certainly sets things rolling and works in collaboration with the development team to establish and agree to the evaluation process. The evaluation process acts as an oracle for the development team to guide investments and demonstrate progress.

Just as the customer-facing metric is used to guide the project and communicate progress, any component being developed by the team can be driven by metrics. Establishing internal metrics provides an efficient way for teams to iterate on the inner loop⁴ (generally not observed by the customer). An inner metric will guide some area of work that is intended to contribute to progress of the customer-facing metric.

A metric requires a data set in an agreed-upon schema derived from a sampling process over the target input population, an evaluation function (that takes data instances and produces some form of score), and an aggregation function (taking all of the instance results and producing some overall score). Each of these components is discussed and agreed upon by the stakeholders. Note that you will want to distinguish metrics of quality (i.e., how correct is the data) from metrics of impact or value (i.e., what is the benefit to the product that is using the data). You can produce plenty of high-quality data, but if it is not in some way an improvement on an existing approach, then it may not have any actual impact.

Getting Started

We began as a small team of two (armed with some solid data engineering skills) with one simple goal – to drive out as many unknowns and assumptions as possible in the shortest amount of time. To this end, we maximized the use of existing components to get data flowing end to end as quickly as possible.

Inference

We use the term inference to describe any sort of data transformation that goes beyond simple data manipulation and requires some form of conceptual modeling and reasoning, including the following techniques:

Classification: Determining into which bucket to place a piece of data

Extraction: Recognizing and normalizing information present in a document

Regression: Predicting a scalar value from a set of inputs

Logical reasoning: Deriving new information based on existing data rules

Structural transformations of data (e.g., joining tables in a database) are not included as inference, though they may be necessary components of the systems that we describe.

Adopting a strategy of finding technology rather than inventing technology allowed us to build something in a matter of days that would quickly determine the potential of the approach, as well as identify where critical investments were needed. We quickly learned that studying the design of an existing system is a valuable investment in learning how to build the next version.

But, before writing a single line of code, we needed to look at the data. Reviewing uniform sample of web sites associated with businesses, we discovered the following:

Most business web sites are small, with ten pages or less.

Most sites used static web content – that is to say, all the information is present in the HTML data rather than being dynamically fetched and rendered at the moment the visitor arrives at their site.

Sites often have a page with contact information on it, though it is common for this information to be present in some form on many pages, and occasionally there is no single page which includes all desired information.

Many businesses have a related page on social platforms (Facebook, Instagram, Twitter), and a minority of them only have a social presence.

These valuable insights, which took a day to derive, allowed us to make quick, broad decisions regarding our initial implementation. In hindsight, we recognized some important oversights, such as a distinction between (large) chain businesses and the smaller singleton businesses . From a search perspective, chain data is of higher value. While chains represent a minority of actual businesses, they are perhaps the most important data segment because users tend to search for chain businesses the most. Chains tend to have more sophisticated sites, often requiring more sophisticated extraction technology. Extracting an address from plain HTML is far easier than extracting a set of entities dynamically placed on a map as the result of a zip code search.

Every Task Starts with Data

Developers can gain insights into the fundamentals of a domain by looking at a small sample of data (less than 100). If there is some aspect of the data that dominates the space, it is generally easy to identify. Best practices for reviewing data include randomizing your data (this helps to remove bias from your observations) and viewing the data in as native a form as possible (ideally seeing data in a form equivalent to how the machinery will view it; viewing should not transform it). To the extent possible, ensure that you are looking at data from production⁵ (this is the only way to ensure that you have the chance to see issues in your end-to-end pipeline).

With our early, broad understanding of the data, we rapidly began building out an initial system. We took a pragmatic approach to architecture and infrastructure. We used existing infrastructure that was built for a large number of different information processing pipelines, and we adopted a simple sequential pipeline architecture that allowed us to build a small number of stages connected by a simple data schema. Specifically, we used Bing’s cloud computation platform which is designed to run scripted processes that follow the MapReduce pattern on large quantities of data. We made no assumptions that the problem could best be delivered with this generic architecture, or that the infrastructure and the paradigms of computation that it supported were perfectly adapted to the problem space. The only requirement was that it was available and capable of running some processes at reasonable scale and would allow developers to iterate rapidly for the initial phase of the project.

Bias to Action

In general, taking some action (reviewing data, building a prototype, running an experiment) will always produce some information that is useful for the team to make progress. This contrasts with debating options, being intuitive about data, assuming that something is obvious about a data set, and so on. This principle, however, cannot be applied recklessly – the action itself must be well defined with a clear termination point and ideally a statement of how the product will be used to move things forward.

If we think about the components needed for web mining local business sites to extract high-quality records representing the name, address, and phone number of these entities, we would need the following:

A means of discovering the web sites

A crawler to crawl these sites

Extractors to locate the names, addresses, and phone numbers on these sites

Logic to assemble these extracted elements into a record, or records for the site

A pipeline implemented on some production infrastructure that can execute these components and integrate the results

Each of these five elements would require design iterations, testing, and so on. In taking an agile discovery approach to building the initial system, we instead addressed the preceding five elements with these five solutions:

Used an existing corpus of business records with web sites to come up with an initial list of sites to extract from – no need to build a discovery system yet

Removed the need to crawl data by processing data only found in the production web corpus of the web search engine already running on the infrastructure we were to adopt for the first phase of work

Used an existing address extractor and built a simple phone number extractor and name extractor

Implemented a naïve approach to assemble extracted elements into a record which we called entification

Deployed the annotators and entification components using an existing script-based execution engine available to the larger Bing engineering team that had access to and was designed to scale over data in the web corpus

We discovered as part of this work that there were no existing off-the-shelf approaches to extracting the business name from web sites. This, then, became an initial focus for innovation. Later in the project, there were opportunities to improve other parts of the system, but the most important initial investment to make was name extraction.

By quickly focusing on critical areas of innovation and leveraging existing systems and naïve approaches elsewhere, we delivered the first version of the data in a very short time frame. This provided the team with a data set that offered a deeper understanding of the viability of the project. This data set could be read to understand the gap between our naïve architecture and commodity extractors and those that were required to deliver to the requirements of the overall project. Early on, we were able to think more deeply about the relationship between the type of data found in the wild and the specializations required in architecture and infrastructure to ensure a quality production system.

Going End to End with Off-the-Shelf Components

Look for existing pieces that can approximate a system so that you can assess the viability of the project, gain insight into the architecture required, recognize components that will need prioritized development, and discover where additional evaluation might be required for the inner loop. In large organizations, likely many of the components you need to assemble are already available. There are also many open source resources that can be used for both infrastructure and inference components.

Getting up and running quickly is something of a recursive strategy. The team makes high-level decisions about infrastructure choices (you want to avoid any long-term commitments to expensive infrastructure prior to validating the project) and architecture (the architecture will be iterated on and informed by the data and inference ecosystem). This allows the team to discover where innovation is required, at which point the process recurses.

To build the name extractor, we enlisted a distance learning approach. We had available to us a corpus of data with local entity records associated with web site URLs. To train a model to extract names from web sites, we used this data to automatically label web pages. Our approach was to

Randomly create a training set from the entire population of business-site pairs.

Crawl the first tier of pages for each URL.

Generate n-grams from the pages using a reasonable heuristic (e.g., any strings of 1–5 tokens found within an HTML element).

Label each n-gram as positive if they match the existing name of the business from the record and negative otherwise – some flexibility was required in this match.

Train a classifier to accept the positive cases and reject the negative cases.

If we think about the general approach to creating a classifier, this method allows for the rapid creation of training data while avoiding the cost of manual labeling, for example, creating and deploying tools, sourcing judges, creating labeling guidelines, and so on.

Important Caveat to Commodity System Assembly

You may state at some point that the code you are writing or the system you are designing or the platform you are adopting is just a temporary measure and that once you have the lay of the land, you will redesign, or throw experimental code away, and build the real system. To throw experimental code away requires a highly disciplined team and good upward management skills. Once running, there is pressure to deliver to the customer in a way that starts building dependencies not only on the output but on the team’s capacity to deliver more. Often this is at the cost of going back and addressing the temporary components and designs that you used to get end to end quickly.

Data Analysis for Planning

Now that we have a system in place, it’s time to look at the initial output and get a sense of where we are. There is a lot of value in having formal, managed systems that pipe data through judges and deliver training data, or evaluation data, but that should never be done prior to (or instead of) the development team getting intimate with every aspect of the data (both input and output). To do so would miss some of the best opportunities for the team to become familiar with the nuances of the domain.

There are two ways in which a data product can be viewed. The first looks at the precision of the data and answers the question How good are the records coming out of the system? This is a matter of sampling the output, manually comparing it to the input, and determining what the output should have been – did the machine get it right? The second approach focuses on the recall of the system – for all the input where we should have extracted something, how often did we do so and get it right?

The insight that we gained from reviewing the initial output validated the project and provided a backlog of work that led to our decision to expand the team significantly:

Overall precision was promising but not at the level required.

Name precision was lower than expected, and we needed to determine if the approach was right (and more features were needed) or if the approach was fundamentally flawed.

The commodity address extractor was overly biased to precision, and so we missed far too many businesses because we failed to find the address.

Our understanding of the domain was naïve, and through the exposure to the data, we now had a far deeper appreciation of the complexity of the world we needed to model. In particular, we started to build a new schema for the domain that included the notion of singleton businesses, business groups, and chains, as well as simple and complex businesses and sites. These concepts had a strong impact on our architecture and crawling requirements.

The existing web index we leveraged to get going quickly was not sufficient for our scenario – it lacked coverage, but also failed to accurately capture the view of certain types of pages as a visitor to the site would experience them.

Now that we had a roughed-in system, we could use it to run some additional exploratory investigations to get an understanding of the broader data landscape that the project had opened up. We ran the system on the Web at large (rather than the subset of the Web associated with our existing corpus of business data records). More about this later.

Data Evaluation Best Practices

It is common for data engineering teams early on in their career to use ad hoc tools to view and judge data. For example, you might review the output of a component in plain text or in a spreadsheet program like Excel. It soon becomes apparent, however, that even for small data sets, efficiency can be gained when specialized tools that remove as much friction as possible for the workflows involved in data analysis are developed.

Consider reviewing the output of a process extracting data from the Web. If you viewed this output in Excel, you would have to copy the URL for the page and paste it into a browser, and then you would be able to compare the extraction with the original page. The act of copying and pasting can easily be the highest cost in the process. When the activities required to get data in front of you are more expensive than viewing the data itself, the team should consider building (or acquiring) a tool to remove this inefficiency. Teams are distinguished by their acknowledgement of the central importance of high-quality data productivity tools.

Another consideration is the activity of the evaluation process itself. Generally, there are judgment-based processes (a judge – that is to say, a human – will look at the output and make a decision on its correctness based on some documented guidelines) and data-based processes (a data set – often called a ground truth set – is constructed and can be used to fully automate the evaluation of a process output). Ground truth-based processes can be incredibly performant, allowing the dev team to essentially continuously evaluate at no cost.

Both tool development and ground truth data development involve real costs. It is a false economy to avoid these investments and attempt to get by with poorly developed tools and iteration loops that require manual evaluation.

Establishing Value

With the basics of the system in place, we next determined how to ship the data and measure its value, or impact. The data being extracted from the Web was intended for use in our larger local search system. This system in part created a corpus of local entities by ingesting around 100 feeds of data and merging the results through conflation/merge (or record linkage as it is often termed). The conflation/merge system had its own machine learning models for determining which data to select at conflation/merge time, and so we wanted to do everything we could to ensure our web-extracted data got selected by this downstream system. We got an idea of the impact of our system by both measuring the quality of the data and how often our data got selected by the downstream system. The more attributes from our web-mined data stream were selected and surfaced to the user, the more impact it had. By establishing this downstream notion of impact, we could chart a path to continuous delivery of increasing impact – the idea being to have the web-mined data account for more and more of the selected content shown to users of the local search feature in Bing.

We considered several factors when establishing a quality bar for the data we were producing. First was the quality of existing data. Surprisingly, taking a random sample of business records from many of the broad-coverage feeds (i.e., those that cover many types of businesses vs. a vertical feed that covers specific types of businesses like restaurants) showed that the data from the broad-coverage feeds was generally substandard. We could have made this low bar our requirement for shipping. However, vertical and other specialized feeds tended to be of higher quality – around 93% precision per attribute. It made sense, then, to use these boutique vertical data sources as our benchmark for quality.

In addition, we considered the quality of the data in terms of how well that quality can be estimated by measurement. Many factors determine the accuracy of a measurement, with the most important being sample design and size and human error. These relate to the expense of the judgment (the bigger the sample, the more it costs to judge, the lower the desired error rate, the more judges one generally requires per Human Intelligence Task (HIT)⁶). Taking all this into account, we determined that a per attribute precision of 98% was the limit of practically measurable quality. In other words, we aimed to ship our data when it was of similar quality to the boutique feeds and set a general target of 98% precision for the name, address, and phone number fields in our feed.

Because we had established a simple measurement of impact – the percentage of attributes selected from this feed by the downstream system – our baseline impact was 0%. We therefore set about analyzing the data to determine if there was a simple way to meet the precision goal with no immediate pressure on the volume of data that we would produce. The general strategy being to start small, with intentionally low coverage, but a high-quality feed. From there, we could work on delivering incremental value in two ways. The first involved extending the coverage through an iteration of gap analysis (identifying where we failed to deliver data and make the required improvements to ensure that we would succeed in those specific extraction scenarios). The second involved identifying additional properties of local businesses that we could deliver. Local business sites, depending on the type of business, have many interesting properties that benefit local search users such as business hours, amenities, menus, social media identities, services offered, and so on.

Discovering the Value Equilibrium

By establishing the evaluation process, the customer helped frame the notion of value. In many cases, declaring a target – say the data has to be 80% precise with 100% coverage – is acceptable, but a worthy customer would require more. In many projects, the data being delivered will be consumed by another part of a larger system and further inference will be done on it. This downstream process may alter the perceived quality of the data you are delivering and have implications for the goals being set. There is a tension between the lazy customer – for whom the simplest thing to do is to ask for perfect data – and the lazy developer, for whom the simplest thing to do is deliver baseline data. Setting a target is easy, but setting the right target requires an investment. The most important goal to determine is the form and quality of the output that will result in a positive impact downstream. This is the exact point at which the system delivers value. While managing this tension is a key part of the discussion between the customer and the team, it is even better to have the downstream customer engage in consuming the data before any initial goal is set, to better understand and continuously improve the true requirements.

Of course, determining the exact point at which value is achieved is an ideal. If the downstream consumer is unclear on how to achieve their end goal, then determining the point-of-value for your product would require them to complete their system. You have something of a data science chicken and the egg problem. In some cases, it is pragmatic to agree on some approximation of value and develop to that baseline.

Enjoying the preview?

Page 1 of 1

Agile Machine Learning: Effective Machine Learning Inspired by the Agile Manifesto

About this ebook

Eric Carter

Related authors

Related to Agile Machine Learning

Related ebooks

Programming For You

Related podcast episodes

Related articles

Related categories

Reviews for Agile Machine Learning

What did you think?

Book preview

Agile Machine Learning - Eric Carter

1. Early Delivery

The Metric Is the Customer

Getting Started

Inference

Every Task Starts with Data

Bias to Action

Going End to End with Off-the-Shelf Components

Important Caveat to Commodity System Assembly

Data Analysis for Planning

Data Evaluation Best Practices

Establishing Value

Discovering the Value Equilibrium