Modern Data Mining Algorithms in C++ and CUDA C: Recent Developments in Feature Extraction and Selection Algorithms for Data Science

Ebook361 pages3 hours

Modern Data Mining Algorithms in C++ and CUDA C: Recent Developments in Feature Extraction and Selection Algorithms for Data Science

Name: Modern Data Mining Algorithms in C++ and CUDA C: Recent Developments in Feature Extraction and Selection Algorithms for Data Science
Author: Timothy Masters
ISBN: 9781484259887

By Timothy Masters

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Discover a variety of data-mining algorithms that are useful for selecting small sets of important features from among unwieldy masses of candidates, or extracting useful features from measured variables.

As a serious data miner you will often be faced with thousands of candidate features for your prediction or classification application, with most of the features being of little or no value. You’ll know that many of these features may be useful only in combination with certain other features while being practically worthless alone or in combination with most others. Some features may have enormous predictive power, but only within a small, specialized area of the feature space. The problems that plague modern data miners are endless. This book helps you solve this problem by presenting modern feature selection techniques and the code to implement them. Some of these techniques are:

Forward selection component analysis
Local feature selection
Linking features and a target with a hidden Markov model
Improvements on traditional stepwise selection
Nominal-to-ordinal conversion

All algorithms are intuitively justified and supported by the relevant equations and explanatory material. The author also presents and explains complete, highly commented source code.

The example code is in C++ and CUDA C but Python or other code can be substituted; the algorithm is important, not the code that's used to write it.

What You Will Learn

Combine principal component analysis with forward and backward stepwise selection to identify a compact subset of a large collection of variables that captures the maximum possible variation within the entire set.
Identify features that may have predictive power over only a small subset of the feature domain. Such features can be profitably used by modern predictive models but may be missed by other feature selection methods.
Find an underlying hidden Markov model that controls the distributions of feature variables and the target simultaneously. The memory inherent in this method is especially valuable in high-noise applications such as prediction of financial markets.
Improve traditional stepwise selection in three ways: examine a collection of 'best-so-far' feature sets; test candidate features for inclusion with cross validation to automatically and effectively limit model complexity; and at each step estimate the probability that our results so far could be just the product of random good luck. We also estimate the probability that the improvement obtained by adding a new variable could have been just good luck. Take a potentially valuable nominal variable (a category or class membership) that is unsuitable for input to a prediction model, and assign to each category a sensible numeric value that can be used as a model input.

Who This Book Is For

Intermediate to advanced data science programmers and analysts.

Skip carousel

LanguageEnglish

PublisherApress

Release dateJun 5, 2020

ISBN9781484259887

Author

Timothy Masters

Related to Modern Data Mining Algorithms in C++ and CUDA C

Related ebooks

Skip carousel

Practical Numerical C Programming: Finance, Engineering, and Physics Applications
Ebook
Practical Numerical C Programming: Finance, Engineering, and Physics Applications
byPhilip Joyce
Rating: 0 out of 5 stars
0 ratings
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
Ebook
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
byCésar Pérez López
Rating: 0 out of 5 stars
0 ratings
Implementing Machine Learning for Finance: A Systematic Approach to Predictive Risk and Performance Analysis for Investment Portfolios
Ebook
Implementing Machine Learning for Finance: A Systematic Approach to Predictive Risk and Performance Analysis for Investment Portfolios
byTshepo Chris Nokeri
Rating: 0 out of 5 stars
0 ratings
Consumption-Based Forecasting and Planning: Predicting Changing Demand Patterns in the New Digital Economy
Ebook
Consumption-Based Forecasting and Planning: Predicting Changing Demand Patterns in the New Digital Economy
byCharles W. Chase
Rating: 0 out of 5 stars
0 ratings
Big Data for Insurance Companies
Ebook
Big Data for Insurance Companies
byMarine Corlosquet-Habart
Rating: 0 out of 5 stars
0 ratings
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
Ebook
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
byCésar Pérez López
Rating: 0 out of 5 stars
0 ratings
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
Ebook
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
byCésar Pérez López
Rating: 0 out of 5 stars
0 ratings
Transactional Machine Learning with Data Streams and AutoML: Build Frictionless and Elastic Machine Learning Solutions with Apache Kafka in the Cloud Using Python
Ebook
Transactional Machine Learning with Data Streams and AutoML: Build Frictionless and Elastic Machine Learning Solutions with Apache Kafka in the Cloud Using Python
bySebastian Maurice
Rating: 0 out of 5 stars
0 ratings
Machine Learning Algorithms for Data Scientists: An Overview
Ebook
Machine Learning Algorithms for Data Scientists: An Overview
byVinaitheerthan Renganathan
Rating: 0 out of 5 stars
0 ratings
Machine Learning and Predictive Modeling
Ebook
Machine Learning and Predictive Modeling
byChuck Sherman
Rating: 0 out of 5 stars
0 ratings
Statistics with Rust: 50+ Statistical Techniques Put into Action
Ebook
Statistics with Rust: 50+ Statistical Techniques Put into Action
byKeiko Nakamura
Rating: 0 out of 5 stars
0 ratings
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Ebook
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
byFouad Sabry
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
Ebook
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
byDavid Loshin
Rating: 5 out of 5 stars
5/5
Data Science Fundamentals for Python and MongoDB
Ebook
Data Science Fundamentals for Python and MongoDB
byDavid Paper
Rating: 0 out of 5 stars
0 ratings
Data Mining: Fundamentals and Applications
Ebook
Data Mining: Fundamentals and Applications
byFouad Sabry
Rating: 0 out of 5 stars
0 ratings
Advanced Forecasting with Python: With State-of-the-Art-Models Including LSTMs, Facebook’s Prophet, and Amazon’s DeepAR
Ebook
Advanced Forecasting with Python: With State-of-the-Art-Models Including LSTMs, Facebook’s Prophet, and Amazon’s DeepAR
byJoos Korstanje
Rating: 0 out of 5 stars
0 ratings
Risk Modeling: Practical Applications of Artificial Intelligence, Machine Learning, and Deep Learning
Ebook
Risk Modeling: Practical Applications of Artificial Intelligence, Machine Learning, and Deep Learning
byTerisa Roberts
Rating: 0 out of 5 stars
0 ratings
Design Automation of Cyber-Physical Systems
Ebook
Design Automation of Cyber-Physical Systems
byMohammad Abdullah Al Faruque
Rating: 0 out of 5 stars
0 ratings
Jumpstart Your ML Journey: A Beginner's Handbook to Success
Ebook
Jumpstart Your ML Journey: A Beginner's Handbook to Success
byMoss Adelle Louise
Rating: 0 out of 5 stars
0 ratings
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
Ebook
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
byCésar Pérez López
Rating: 0 out of 5 stars
0 ratings
Data Mapping for Data Warehouse Design
Ebook
Data Mapping for Data Warehouse Design
byQamar Shahbaz
Rating: 5 out of 5 stars
5/5
Data Science Revealed: With Feature Engineering, Data Visualization, Pipeline Development, and Hyperparameter Tuning
Ebook
Data Science Revealed: With Feature Engineering, Data Visualization, Pipeline Development, and Hyperparameter Tuning
byTshepo Chris Nokeri
Rating: 0 out of 5 stars
0 ratings
High-Order Models in Semantic Image Segmentation
Ebook
High-Order Models in Semantic Image Segmentation
byIsmail Ben Ayed
Rating: 0 out of 5 stars
0 ratings
Handbook of Statistical Analysis and Data Mining Applications
Ebook
Handbook of Statistical Analysis and Data Mining Applications
byRobert Nisbet
Rating: 4 out of 5 stars
4/5
Data Science for Beginners
Ebook
Data Science for Beginners
byTom Lesley
Rating: 0 out of 5 stars
0 ratings
Quantum Computing Solutions: Solving Real-World Problems Using Quantum Computing and Algorithms
Ebook
Quantum Computing Solutions: Solving Real-World Problems Using Quantum Computing and Algorithms
byBhagvan Kommadi
Rating: 0 out of 5 stars
0 ratings
Metaheuristics for Big Data
Ebook
Metaheuristics for Big Data
byClarisse Dhaenens
Rating: 0 out of 5 stars
0 ratings
Big Data Preprocessing: Enabling Smart Data
Ebook
Big Data Preprocessing: Enabling Smart Data
byJulián Luengo
Rating: 0 out of 5 stars
0 ratings
PYTHON DATA ANALYTICS: Harnessing the Power of Python for Data Exploration, Analysis, and Visualization (2024)
Ebook
PYTHON DATA ANALYTICS: Harnessing the Power of Python for Data Exploration, Analysis, and Visualization (2024)
byNED MUNOZ
Rating: 0 out of 5 stars
0 ratings
Sustainable Asset Management: AI & Blockchain Unleashed
Ebook
Sustainable Asset Management: AI & Blockchain Unleashed
byPrashant Sinha
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
Ebook
Childhood Unplugged: Practical Advice to Get Kids Off Screens and Find Balance
byKatherine Johnson Martinko
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
Ebook
AP Computer Science Principles Premium, 2024: 6 Practice Tests + Comprehensive Review + Online Practice
bySeth Reichelson
Rating: 0 out of 5 stars
0 ratings
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
Going Text: Mastering the Command Line
Ebook
Going Text: Mastering the Command Line
byBrian Schell
Rating: 4 out of 5 stars
4/5
People Skills for Analytical Thinkers
Ebook
People Skills for Analytical Thinkers
byGilbert Eijkelenboom
Rating: 5 out of 5 stars
5/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

What Bothers Me About CFD Software Today?
Podcast episode
What Bothers Me About CFD Software Today?
byThe CFD Mixtape
0 ratings
0% found this document useful
What's the Big Deal About GPU Computing?
Podcast episode
What's the Big Deal About GPU Computing?
byThe CFD Mixtape
0 ratings
0% found this document useful
System Observability For The Cloud Native Era With Chronosphere: An interview about the Chronosphere platform and the M3DB storage engine for managing system metrics to power observability in the cloud native era.
Podcast episode
System Observability For The Cloud Native Era With Chronosphere: An interview about the Chronosphere platform and the M3DB storage engine for managing system metrics to power observability in the cloud native era.
byData Engineering Podcast
0 ratings
0% found this document useful
Ahmed Elsamadisi, Narrator CEO, is a roboticist by training and one of the first engineers at WeWork. Now he's changing how the world tells stories with data.
Podcast episode
Ahmed Elsamadisi, Narrator CEO, is a roboticist by training and one of the first engineers at WeWork. Now he's changing how the world tells stories with data.
byAI and the Future of Work
0 ratings
0% found this document useful
Optimising the Future
Podcast episode
Optimising the Future
byDataCafé
0 ratings
0% found this document useful
Measuring The Speed of AI Through Benchmarks: David Kanter, Executive Director at MLCommons, discusses the work they’re doing with MLPerf Benchmarks, creating the world’s first industry standard approach to measuring AI speed and safety. He also shares ways they’re testing AI and LLMs for...
Podcast episode
Measuring The Speed of AI Through Benchmarks: David Kanter, Executive Director at MLCommons, discusses the work they’re doing with MLPerf Benchmarks, creating the world’s first industry standard approach to measuring AI speed and safety. He also shares ways they’re testing AI and LLMs for...
byThe Brave Technologist
0 ratings
0% found this document useful
Designing Data Platforms For Fintech Companies: Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
Podcast episode
Designing Data Platforms For Fintech Companies: Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
byData Engineering Podcast
0 ratings
0% found this document useful
Wed Nov-15-2023: Clear Goals and Pragmatic AI Use, DeepMind's GraphCast Achieves 90% Accuracy
Podcast episode
Wed Nov-15-2023: Clear Goals and Pragmatic AI Use, DeepMind's GraphCast Achieves 90% Accuracy
byBusiness of Tech
0 ratings
0% found this document useful
ESW #313 - Pablo Zurro, Travis Howerton: Fortra's Core Security has conducted it's fourth annual survey of cybersecurity professionals on the usage and perception of pen testing. The data collected provides visibility into the full spectrum of pen testing’s role, helping to determine how...
Podcast episode
ESW #313 - Pablo Zurro, Travis Howerton: Fortra's Core Security has conducted it's fourth annual survey of cybersecurity professionals on the usage and perception of pen testing. The data collected provides visibility into the full spectrum of pen testing’s role, helping to determine how...
bySecurity Weekly Podcast Network (Audio)
0 ratings
0% found this document useful
Using Data To Illuminate The Intentionally Opaque Insurance Industry: The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
Podcast episode
Using Data To Illuminate The Intentionally Opaque Insurance Industry: The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.
byData Engineering Podcast
0 ratings
0% found this document useful
Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling: For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
Podcast episode
Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling: For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
byData Engineering Podcast
0 ratings
0% found this document useful
Quantum computing and it's applications in industry with Florian Neukart - Terra Quantum
Podcast episode
Quantum computing and it's applications in industry with Florian Neukart - Terra Quantum
byLast Week on Earth with GARI
0 ratings
0% found this document useful
Developing Machining Automation Systems in the THINC Developers Group with Randy Jokerst & Brad Klippstein: With so many automation systems available, it can be hard to know which to use to meet your specific goals and needs as a manufacturing leader. Guest speakers Randy Jokerst and Brad Klippstein share how the THINC Developers Group enables the Metal...
Podcast episode
Developing Machining Automation Systems in the THINC Developers Group with Randy Jokerst & Brad Klippstein: With so many automation systems available, it can be hard to know which to use to meet your specific goals and needs as a manufacturing leader. Guest speakers Randy Jokerst and Brad Klippstein share how the THINC Developers Group enables the Metal...
byMaking Chips Podcast for Manufacturing Leaders
0 ratings
0% found this document useful
Machine Learning, Business Success – Charles Martin, PhD, Data Scientist, Machine Learning AI Consultant, and Chief Scientist at Calculation Consulting – Rapidly Evolving Opportunities For Business Via Machine Learning and Data Science: Charles Martin, PhD, data scientist, machine learning AI consultant, and chief scientist at Calculation Consulting, delivers a thorough overview of the technologies that are helping companies expand their customer base and increase revenue. Martin is...
Podcast episode
Machine Learning, Business Success – Charles Martin, PhD, Data Scientist, Machine Learning AI Consultant, and Chief Scientist at Calculation Consulting – Rapidly Evolving Opportunities For Business Via Machine Learning and Data Science: Charles Martin, PhD, data scientist, machine learning AI consultant, and chief scientist at Calculation Consulting, delivers a thorough overview of the technologies that are helping companies expand their customer base and increase revenue. Martin is...
byFinding Genius Podcast
0 ratings
0% found this document useful
The Future of Cyber: Lateral Security, Edge Ecosystems, External Attack Surface Mgmt - Christopher Kruegel, Theresa Lanowitz, Vinay Anand - ESW #316
Podcast episode
The Future of Cyber: Lateral Security, Edge Ecosystems, External Attack Surface Mgmt - Christopher Kruegel, Theresa Lanowitz, Vinay Anand - ESW #316
byEnterprise Security Weekly (Video)
0 ratings
0% found this document useful
The Future of Cyber: Lateral Security, Edge Ecosystems, External Attack Surface Mgmt - Christopher Kruegel, Theresa Lanowitz, Vinay Anand - ESW #316
Podcast episode
The Future of Cyber: Lateral Security, Edge Ecosystems, External Attack Surface Mgmt - Christopher Kruegel, Theresa Lanowitz, Vinay Anand - ESW #316
bySecurity Weekly Podcast Network (Video)
0 ratings
0% found this document useful
Inventing the future of computing, with Alessandro Curioni
Podcast episode
Inventing the future of computing, with Alessandro Curioni
byLondon Futurists
0 ratings
0% found this document useful
AI Today Podcast: Overview of Synthetic Data: Machine learning algorithms need examples of data from which they can learn, especially supervised machine learning algorithms. However, one big challenge for those looking to put machine learning into practice is the lack of a sufficient quantity of g...
Podcast episode
AI Today Podcast: Overview of Synthetic Data: Machine learning algorithms need examples of data from which they can learn, especially supervised machine learning algorithms. However, one big challenge for those looking to put machine learning into practice is the lack of a sufficient quantity of g...
byAI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion
0 ratings
0% found this document useful
A Survey of Techniques for Optimizing Transformer Inference: Recent years have seen a phenomenal rise in performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (G...
Podcast episode
A Survey of Techniques for Optimizing Transformer Inference: Recent years have seen a phenomenal rise in performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (G...
byPapers Read on AI
0 ratings
0% found this document useful
Secure Multi-Party Computation Explained with Skyflow’s Liz Acosta: In this episode, we discuss the topic of secure multi-party computation. Since its introduction in the eighties, secure multi-party computation – also known as SMPC – has evolved into a subfield of cryptography for which a variety of protocols have b...
Podcast episode
Secure Multi-Party Computation Explained with Skyflow’s Liz Acosta: In this episode, we discuss the topic of secure multi-party computation. Since its introduction in the eighties, secure multi-party computation – also known as SMPC – has evolved into a subfield of cryptography for which a variety of protocols have b...
byPartially Redacted: Data Privacy, Security & Compliance
0 ratings
0% found this document useful
Do You Need a PhD to Run CFD?
Podcast episode
Do You Need a PhD to Run CFD?
byThe CFD Mixtape
0 ratings
0% found this document useful
The Grand Vision And Present Reality of DataOps: A conversation about the grand vision and current realities of DataOps and how you can start on the journey toward more maintainable and reliable data systems.
Podcast episode
The Grand Vision And Present Reality of DataOps: A conversation about the grand vision and current realities of DataOps and how you can start on the journey toward more maintainable and reliable data systems.
byData Engineering Podcast
0 ratings
0% found this document useful
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
Podcast episode
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
byMLOps.community
0 ratings
0% found this document useful
Unlocking Logistics Profits with Matt McKinney: and discuss unlocking logistics profits. Matt is the Co-founder and CEO of , a modern transportation audit & pay company. About Matt McKinney is the Co-founder and CEO of Loop, a modern transportation audit & pay company. Loop's AI-driven...
Podcast episode
Unlocking Logistics Profits with Matt McKinney: and discuss unlocking logistics profits. Matt is the Co-founder and CEO of , a modern transportation audit & pay company. About Matt McKinney is the Co-founder and CEO of Loop, a modern transportation audit & pay company. Loop's AI-driven...
byThe Logistics of Logistics
0 ratings
0% found this document useful
This week Al and Ahmed Elsamadisi discuss autonomous cars, exoskeletons, the Iron Man Suit, We Work, and Narrator
Podcast episode
This week Al and Ahmed Elsamadisi discuss autonomous cars, exoskeletons, the Iron Man Suit, We Work, and Narrator
byMaking Data Simple
0 ratings
0% found this document useful
Product Owners in Data Science - Anna Hannemann
Podcast episode
Product Owners in Data Science - Anna Hannemann
byDataTalks.Club
0 ratings
0% found this document useful
Discussing Service Mesh Architectures
Podcast episode
Discussing Service Mesh Architectures
byThe Cloudcast
0 ratings
0% found this document useful
Transforming Cyber Risk/Compliance Through Automation - Padraic O'Reilly - BSW #200: How are CISOs of the Global 500 automating risk and compliance assessments by 90%, saving millions of dollars per year, and creating a unified strategy around cyber risk in the wake of Digital Transformation? Those on the cutting-edge of risk and...
Podcast episode
Transforming Cyber Risk/Compliance Through Automation - Padraic O'Reilly - BSW #200: How are CISOs of the Global 500 automating risk and compliance assessments by 90%, saving millions of dollars per year, and creating a unified strategy around cyber risk in the wake of Digital Transformation? Those on the cutting-edge of risk and...
bySecurity Weekly Podcast Network (Video)
0 ratings
0% found this document useful
EP 163 - How is the cloud transforming lab operations: This week, we interviewed Nathan Clark, founder of Ganymede. Ganymede is the cloud-native data platform for life science companies focusing on integrating physical lab instruments with digital workflows. In this episode, we discussed the...
Podcast episode
EP 163 - How is the cloud transforming lab operations: This week, we interviewed Nathan Clark, founder of Ganymede. Ganymede is the cloud-native data platform for life science companies focusing on integrating physical lab instruments with digital workflows. In this episode, we discussed the...
byIndustrial IoT Spotlight
100%
100% found this document useful
Trent McConaghy: Ocean Protocol – The Platform Making Waves in the Data Industry: Ocean Protocol is a platform which creates data marketplaces, providing an alternative to the current model. Trent McConaghy, Founder of Ocean Protocol, joins us to chat about the platform.
Podcast episode
Trent McConaghy: Ocean Protocol – The Platform Making Waves in the Data Industry: Ocean Protocol is a platform which creates data marketplaces, providing an alternative to the current model. Trent McConaghy, Founder of Ocean Protocol, joins us to chat about the platform.
byEpicenter - Learn about Crypto, Blockchain, Ethereum, Bitcoin and Distributed Technologies
0 ratings
0% found this document useful

Skip carousel

Quantum Computing’s DISRUPTION IN Finance Industry
Techfastly
Article
Quantum Computing’s DISRUPTION IN Finance Industry
Oct 1, 2021
5 min read
The AI race
Racecar Engineering
Article
The AI race
Jul 7, 2023
10 min read
The Infrastructure of an AI Factory
Techfastly
Article
The Infrastructure of an AI Factory
Mar 3, 2021
Data is a crucial element for machine learning algorithms. It can be considered as a fuel of AI factories. Collection of useful data and feeding it into frameworks and models is the foremost step. Data acts as a case or example that the algorithms re
1 min read
Quantum Computing and The Rise Of Machine Learning
Techfastly
Article
Quantum Computing and The Rise Of Machine Learning
Oct 1, 2021
2 min read
Is It Time For AI?
Racecar Engineering
Article
Is It Time For AI?
Jul 2, 2021
The upper echelons of motorsport are riddled with vast amounts of complex technology, both on and off the cars. Onboard contemporary high-end race machines, especially in the hybrid categories, you’ll find an array of electronic systems. And for effi
3 min read
Cognitive Enterprise
Techfastly
Article
Cognitive Enterprise
Dec 1, 2021
6 min read
Forward Thinking
Racecar Engineering
Article
Forward Thinking
Feb 4, 2022
8 min read
ARTIFICIAL INTELLIGENCE (AI) IN SUPPLY CHAIN PLANNING THE Future is Here & Now
The European Business Review
Article
ARTIFICIAL INTELLIGENCE (AI) IN SUPPLY CHAIN PLANNING THE Future is Here & Now
Dec 3, 2019
7 min read
Putting Artificial Intelligence to Work
Rotman Management
Article
Putting Artificial Intelligence to Work
May 1, 2018
11 min read
Generative AI: What Leaders Need To Know
Rotman Management
Article
Generative AI: What Leaders Need To Know
Jan 1, 2024
12 min read
Machine Learning How Effective Is It in Cryptocurrency Trading?
Techfastly
Article
Machine Learning How Effective Is It in Cryptocurrency Trading?
Nov 1, 2021
5 min read
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
The European Business Review
Article
Powering Costing With Artificial Intelligence: The Case Of Vodafone Procurement
May 25, 2021
8 min read
2024: What Is The Near Future Of Generative AI?
The European Business Review
Article
2024: What Is The Near Future Of Generative AI?
Jan 26, 2024
8 min read
Quantum Computing: Challenges & Opportunities Ahead
Techfastly
Article
Quantum Computing: Challenges & Opportunities Ahead
Oct 1, 2021
3 min read
Countdown To Cybersecurity In The Quantum Era: Will Businesses Be Ready In Time?
The European Business Review
Article
Countdown To Cybersecurity In The Quantum Era: Will Businesses Be Ready In Time?
Jul 31, 2023
☑ As of today, there are no large-scale quantum computers available that could break cryptographic algorithms, but we know they are coming. Due to the time it takes to implement and promulgate a defense, businesses should act now to counter this thre
7 min read
Edge Computing In Europe: A Key Driver Of Business Innovation
The European Business Review
Article
Edge Computing In Europe: A Key Driver Of Business Innovation
Jan 26, 2024
1 83% of our survey respondents believe that edge computing will be essential to remaining competitive in the future but only 65% are using edge today. 2 Super Integrators — edge adopters that tie edge to business in transformation adoption — compris
8 min read
Business applications For Quantum computing
Rotman Management
Article
Business applications For Quantum computing
May 1, 2022
COMPUTERS DO ARITHMETIC. Underlying every amazing application of computers today is math, calculated using binary digits or ‘bits.’ The original computers of the early 1950s could perform about 465 multiplications per second — much faster than the ‘h
11 min read
How Artificial Intelligence Is Helping With Space Exploration
Techfastly
Article
How Artificial Intelligence Is Helping With Space Exploration
Sep 1, 2021
3 min read
Arnab PANDEY
Techfastly
Article
Arnab PANDEY
Apr 1, 2021
11 min read
5 ‘Strong Buy’ Artificial Intelligence Stocks
Kiplinger
Article
5 ‘Strong Buy’ Artificial Intelligence Stocks
Jun 22, 2018
5 min read
Why a Hedge Fund Started a Video Game Competition
Nautilus
Article
Why a Hedge Fund Started a Video Game Competition
Nov 30, 2017
There’s a weird way in which a hedge fund is a confluence of everything. There’s the money of course—Two Sigma, located in lower Manhattan, manages over $50 billion, an amount that has grown 600 percent in 6 years and is roughly the size of the econo
9 min read
Will Generative AI Disrupt Your Company And Your need For Workers?
The European Business Review
Article
Will Generative AI Disrupt Your Company And Your need For Workers?
Jul 31, 2023
5 min read
Machine Learning in Business: Issues for Society
Rotman Management
Article
Machine Learning in Business: Issues for Society
Jan 1, 2020
11 min read
Bots And Robbers What Is AI, And Will It Make Us All Redundant?
Guardian Weekly
Article
Bots And Robbers What Is AI, And Will It Make Us All Redundant?
Nov 3, 2023
What is artificial intelligence? The term was coined in 1955 by a team including Harvard computer scientist Marvin Minsky. With no strict definition of the phrase, almost anything more complex than a calculator has been called artificial intelligence
3 min read
Marketing Leaders Are Taking A Run Approach Towards Cognitive Computing
Techfastly
Article
Marketing Leaders Are Taking A Run Approach Towards Cognitive Computing
Dec 1, 2021
3 min read
Digital Twins: How Big Is The Opportunity For Industrial Organisations?
The European Business Review
Article
Digital Twins: How Big Is The Opportunity For Industrial Organisations?
Sep 20, 2018
4 min read
Advancing Healthcare Medical Image Processing
Techfastly
Article
Advancing Healthcare Medical Image Processing
Dec 1, 2021
3 min read
Leadership Forum: Investing in Disruption
Rotman Management
Article
Leadership Forum: Investing in Disruption
Jan 1, 2019
10 min read
Go With The ’flow
Racecar Engineering
Article
Go With The ’flow
Mar 4, 2022
7 min read
Quantum Jump
Business Today
Article
Quantum Jump
Dec 25, 2018
2 min read

Related categories

Skip carousel

Reviews for Modern Data Mining Algorithms in C++ and CUDA C

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Modern Data Mining Algorithms in C++ and CUDA C - Timothy Masters

T. MastersModern Data Mining Algorithms in C++ and CUDA Chttps://doi.org/10.1007/978-1-4842-5988-7_1

1. Introduction

Timothy Masters¹

(1)

Ithaca, NY, USA

Serious data miners are often faced with thousands of candidate features for their prediction or classification application, with most of the features being of little or no value. Worse still, many of these features may be useful only in combination with certain other features while being practically worthless alone or in combination with most others. Some features may have enormous predictive power, but only within a small, specialized area of the feature space. The problems that plague modern data miners are endless.

My own work over the last 20+ years in the financial market domain has involved examination of decades of price data from hundreds or thousands of markets, searching for exploitable patterns of price movement. I could not have done it without having a large and up-to-date collection of data mining tools.

Over my career, I have accumulated many such tools, and I still keep up to date with scientific journals, exploring the latest algorithms as they appear. In my recent book Data Mining Algorithms in C++, published by the Apress division of Springer, I presented many of my favorite algorithms. I now continue this presentation, divulging details and source code of key modules in my VarScreen variable screening program that has been made available for free download and frequently updated over the last few years. The topics included in this text are as follows:

Hidden Markov models are chosen and optimized according to their multivariate correlation with a target. The idea is that observed variables are used to deduce the current state of a hidden Markov model, and then this state information is used to estimate the value of an unobservable target variable. This use of memory in a time series discourages whipsawing of decisions and enhances information usage.

Forward Selection Component Analysis uses forward and optional backward refinement of maximum-variance-capture components from a subset of a large group of variables. This hybrid combination of principal components analysis with stepwise selection lets us whittle down enormous feature sets, retaining only those variables that are most important.

Local Feature Selection identifies predictors that are optimal in localized areas of the feature space but may not be globally optimal. Such predictors can be effectively used by nonlinear models but are neglected by many other feature selection algorithms that require global predictive power. Thus, this algorithm can detect vital features that are missed by other feature selection algorithms.

Stepwise selection ofpredictive features is enhanced in three important ways. First, instead of keeping a single optimal subset of candidates at each step, this algorithm keeps a large collection of high-quality subsets and performs a more exhaustive search of combinations of predictors that have joint but not individual power. Second, cross-validation is used to select features, rather than using the traditional in-sample performance. This provides an excellent means of complexity control, resulting in greatly improved out-of-sample performance. Third, a Monte-Carlo permutation test is applied at each addition step, assessing the probability that a good-looking feature set may not be good at all, but rather just lucky in its attainment of a lofty performance criterion.

Nominal-to-ordinal conversion lets us take a potentially valuable nominal variable (a category or class membership) that is unsuitable for input to a prediction model and assign to each category a sensible numeric value that can be used as a model input.

As is usual in my books, all of these algorithm presentations begin with an intuitive overview, continue with essential mathematics, and culminate in complete, heavily commented source code. In most cases, one or more sample applications are shown to illustrate the technique.

The VarScreen program, available as a free download from my website TimothyMasters.info, implements all of these algorithms and many more. This program can serve as an effective variable screening program for data mining applications.

T. MastersModern Data Mining Algorithms in C++ and CUDA Chttps://doi.org/10.1007/978-1-4842-5988-7_2

2. Forward Selection Component Analysis

Timothy Masters¹

(1)

Ithaca, NY, USA

The algorithms presented in this chapter are greatly inspired by the paper Forward Selection Component Analysis: Algorithms and Applications by Luca Puggini and Sean McLoone, published in IEEE Transactions on Pattern Analysis and Machine Intelligence, December 2017, and widely available for free download on various Internet sites (just search). However, I have made several small modifications that I believe make it somewhat more practical for real-life applications. The code for a subroutine that performs all of these algorithms is in the file FSCA.CPP.

Introduction to Forward Selection Component Analysis

The technique of principal components analysis has been used for centuries (or so it seems!) to distill the information (variance) contained in a large number of variables down into a smaller, more manageable set of new variables called components or principal components . Sometimes the researcher is interested only in the nature of the linear combinations of the original variables that provide new component variables having the property of capturing the maximum possible amount of the total variance inherent in the original set of variables. In other words, principal components analysis can sometimes be viewed as an application of descriptive statistics. Other times the researcher wants to go one step further, computing the principal components and employing them as predictors in a modeling application.

However, with the advent of extremely large datasets, several shortcomings of traditional principal components analysis have become seriously problematic. The root cause of these problems is that traditional principal components analysis computes the new variables as linear combinations of all of the original variables. If you have been presented with thousands of variables, there can be issues with using all of them.

One possible issue is the cost of obtaining all of these variables going forward. Maybe the research budget allowed for collecting a huge dataset for initial study, but the division manager would look askance at such a massive endeavor on an ongoing basis. It would be a lot better if, after an initial analysis, you could request updated samples from only a much smaller subset of the original variable set.

Another issue is interpretation . Being able to come up with descriptive names for the new variables (even if the name is a paragraph long!) is always good, something we often want to do (or must do to satisfy critics). It’s hard enough putting a name to a linear combination of a dozen or two variables; try understanding and explaining the nature of a linear combination of two thousand variables! So if you could identify a much smaller subset of the original set, such that this subset encapsulates the majority of the independent variation inherent in the original set, and then compute the new component variables from this smaller set, you are in a far better position to understand and name the components and explain to the rest of the world what these new variables represent.

Yet another issue with traditional principal components when applied to an enormous dataset is the all too common situation of groups of variables having large mutual correlation. For example, in the analysis of financial markets for automated trading systems , we may measure many families of market behavior: trends, departures from trends, volatility, and so forth. We may have many hundreds of such indicators, and among them we may have several dozen different measures of volatility, all of which are highly correlated. When we apply traditional principal components analysis to such correlated groups, an unfortunate effect of the correlation is to cause the weights within each correlated set to be evenly dispersed among the correlated variables in the set. So, for example, suppose we have a set of 30 measures of volatility that are highly correlated. Even if volatility is an important source of variation (potentially useful information) across the dataset (market history), the computed weights for each of these variables will be small, each measure garnering a small amount of the total importance represented by the group as a whole. As a result, we may examine the weights, see nothing but tiny weights for the volatility measures, and erroneously conclude that volatility does not carry much importance. When there are many such groups, and especially if they do not fall into obvious families, intelligent interpretation becomes nearly impossible.

The algorithms presented here go a long way toward solving all of these problems. They work by first finding the single variable that does the best job of explaining the total variability (considering all original variables) observed in the dataset. Roughly speaking, we say that a variable does a good job of explaining the total variability if knowledge of the value of that variable tells us a lot about the values of all of the other variables in the original dataset. So the best variable is the one that lets us predict the values of all other variables with maximum accuracy.

Once we have the best single variable, we consider the remaining variables and find the one that, in conjunction with the one we already have, does the best job of predicting all other variables. Then we find a third, and a fourth, and so on, choosing each newly selected variable so as to maximize the explained variance when the selected variable is used in conjunction with all previously selected variables. Application of this simple algorithm gives us an ordered set of variables selected from the huge original set, beginning with the most important and henceforth with decreasing but always optimal importance (conditional on prior selections).

It is well known that a greedy algorithm such as the strictly forward selection just described can produce a suboptimal final set of variables. It is always optimal in a certain sense, but only in the sense of being conditional on prior selections. It can (and often does) happen that when some new variable is selected, a previously selected variable suddenly loses a good deal of its importance, yet nevertheless remains in the selected set.

For this reason, the algorithms here optionally allow for continual refinement of the set of selected variables by regularly testing previously selected variables to see if they should be removed and replaced with some other candidate. Unfortunately, we lose the ordering-of-importance property that we have with strict forward selection, but we gain a more optimal final subset of variables. Of course, even with backward refinement we can still end up with a set of variables that is inferior to what could be obtained by testing every possible subset. However, the combinatoric explosion that results from anything but a very small universe of variables makes exhaustive testing impossible. So in practice, backward refinement is pretty much the best we can do. And in most cases, its final set of variables is very good.

There is an interesting relationship between the choice of whether to refine and the number of variables selected, which we may want to specify in advance. This is one of those issues that may seem obvious but that can still cause some confusion. With strict forward selection, the best variables, those selected early in the process, remain the same, regardless of the number of variables the user decides to keep. If you decree in advance that you want a final set to contain 5 variables but later decide that you want 25, the first 5 in that set of 25 will be the same 5 that you originally obtained. There is a certain intuitive comfort in this property. But if you had specified that backward refinement accompany forward selection, the first 5 variables in your final set of 25 might be completely different from the first 5 that you originally obtained. This is a concrete example of why, with backward refinement , you should pay little attention to the order in which variables appear in your final set.

Regardless of which choice we make, if we compute values for the new component variables as linear combinations of the selected variables, it’s nice if we can do so in such a way that they have two desirable properties:

They are standardized, meaning that they have a mean of zero and a standard deviation of one.

They are uncorrelated with one another, a property that enhances stability for most modeling algorithms.

If we enforce the second property, orthogonality of the components , then our algorithm, which seeks to continually increase explained variance by its selection of variables, has a nice attribute: the number of new variables, the components, equals the number of selected variables. The requirement that the components be orthogonal means that we cannot have more components than we have original variables, and the maximize explained variance selection rule means that as long as there are still variables in the original set that are not collinear, we can compute a new component every time we select another variable. Thus, all mathematical and program development will compute as many components as there are selected variables. You don’t have to use them all, but they are there for you if you wish to use them.

The Mathematics and Code Examples

In the development that follows, we will use the notation shown in the following text, which largely but not entirely duplicates the notation employed in the paper cited at the beginning of this chapter. Because the primary purpose of this book is to present practical implementation information, readers who desire detailed mathematical derivations and proofs should see the cited paper. The following variables will appear most often:

X – The m by v matrix of original data; each row is a case and each column is a variable. We will assume that the variables have been standardized to have mean zero and standard deviation one. The paper makes only the zero-mean assumption, but also assuming unit standard deviation simplifies many subsequent calculations and is no limitation in practice.

xi – Column i of X, a column vector m long.

m – The number of cases; the number of rows in X.

v – The number of variables; the number of columns in X.

k – The number of variables selected from X, with k≤v.

Z – The m by k matrix of the columns (variables) selected from X.

zi – Column i of Z, a column vector m long.

S – The m by k matrix of new component variables that we will compute as a linear transformation of the selected variables Z. The columns of S are computed in such a way that they are orthogonal with zero mean and unit standard deviation.

si – Column i of S, a column vector m long.

The fundamental idea behind any principal components algorithm is that we want to be able to optimally approximate the original data matrix X, which has many columns (variables), by using the values in a components matrix S that has fewer columns than X. This approximation is a simple linear transformation, as shown in Equation (2.1), in which Θ is a k by v transformation matrix .

../images/497058_1_En_2_Chapter/497058_1_En_2_Equ1_HTML.png

(2.1)

In practice, we rarely need to compute Θ for presentation to the user; we only need to know that, by the nature of our algorithm, such a matrix exists and plays a role in internal calculations.

We can express the error in our approximation as the sum of squared differences between the original X matrix and our approximation given by Equation (2.1). In Equation (2.2), the double-bar operator just means to square each individual term and sum the squares. An equivalent way to judge the quality of the approximation is to compute the fraction of the total variance in X that is explained by knowing S. This is shown in Equation (2.3).

../images/497058_1_En_2_Chapter/497058_1_En_2_Equ2_HTML.png

(2.2)

../images/497058_1_En_2_Chapter/497058_1_En_2_Equ3_HTML.png

(2.3)

With these things in mind, we see that we need to intelligently select a set of k columns from X to give us Z , from which we can compute S, the orthogonal and standardized component matrix . Assume that we have a way to compute, for any given S, the minimum possible value of Equation (2.2) or, equivalently, the maximum possible value of Equation (2.3), the explained variance . Then we are home free: for any specified subset of the columns of X, we can compute the degree to which the entire X can be approximated by knowledge of the selected subset of its columns. This gives us a way to score any trial subset. All we need to do is construct that subset in such a way that the approximation error is minimized. We shall do so by forward selection, with optional backward refinement , to continually increase the explained variance.

Maximizing the Explained Variance

In this section, we explore what is perhaps the most fundamental part of the overall algorithm: choosing the best variable to add to whatever we already have in the subset that we are building. Suppose we have a collection of variables in the subset already, or we are just starting out and we have none at all yet. In either case, our immediate task is to compute a score for each variable in the universe that is not yet in the subset and choose the variable that has the highest score.

The most obvious way to do this for any variable is to temporarily include it in the subset, compute the corresponding component variables S (a subject that will be discussed in a forthcoming section), use Equation (2.1) to compute the least-squares weights for constructing an approximation to the original variables, and finally use Equation (2.3) to compute the fraction of the original variance that is explained by this approximation. Then choose whichever variable maximizes the explained variance.

As happens regularly throughout life, the obvious approach is not the best approach. The procedure just presented works fine and has the bonus of giving us a numerical value for the explained variance. Unfortunately, it is agonizingly slow to compute. There is a better way, still slow but a whole lot better than the definitional method.

The method about to be shown computes a different criterion for testing each candidate variable, one with no ready interpretation. However, the authors of the paper cited at the beginning of this chapter present a relatively complex proof that the rank order of this criterion for competing variables is the same as the rank order of the explained variance for the competing variables . Thus, if we choose the variable that has the maximum value of this alternative criterion, we know that this is the same variable that would have been chosen had we maximized the explained variance.

We need to define a new notation. We have already defined Z as the currently selected columns (variables) in the full dataset X. For the purposes of this discussion, understand that if we have not yet chosen any variables, Z will be a null matrix, a matrix with no columns. We now define Z (i) to be Z with column i of X appended. Of course, if Z already has one or more columns, it is assumed that column i does not duplicate a column of X already in Z. We are doing a trial of a new variable, one that is not already in the set. Define q j(i) as shown in Equation (2.4), recalling that x j is column j of X. Then our alternative criterion for trial column (variable) i is given by Equation (2.5).

../images/497058_1_En_2_Chapter/497058_1_En_2_Equ4_HTML.png

(2.4)

../images/497058_1_En_2_Chapter/497058_1_En_2_Equ5_HTML.png

(2.5)

Let’s take a short break to make sure we are clear on the dimensions of these quantities. The universe of variables matrix X has m rows (cases) and v columns (variables). So the summation indexed by j in Equation (2.5) is across the entire universe of variables. The subset matrix Z (i) also has m rows, and it has k columns, with k=1 when we are selecting the first variable, 2 for the second, and so forth. Thus, the intermediate term q j(i) is a column vector k long. The matrix product that we are inverting is k by k, and each term in the summation is a scalar.

It’s an interesting and instructive exercise to consider Equation (2.5) when k=1, our search for the first variable. I’ll leave the (relatively easy) details to the reader; just remember that the columns of X (and hence Z) have been standardized to have zero mean and unit standard deviation . It should require

Enjoying the preview?

Page 1 of 1

Modern Data Mining Algorithms in C++ and CUDA C: Recent Developments in Feature Extraction and Selection Algorithms for Data Science

About this ebook

Timothy Masters

Read more from Timothy Masters

Related authors

Related to Modern Data Mining Algorithms in C++ and CUDA C

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Modern Data Mining Algorithms in C++ and CUDA C

What did you think?

Book preview

Modern Data Mining Algorithms in C++ and CUDA C - Timothy Masters

1. Introduction

2. Forward Selection Component Analysis

Introduction to Forward Selection Component Analysis

The Mathematics and Code Examples

Maximizing the Explained Variance