Ebook632 pages3 hours

Java Data Science Cookbook

Name: Java Data Science Cookbook
Author: Rushdi Shams
ISBN: 9781787127654

By Rushdi Shams

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book

This book provides modern recipes in small steps to help an apprentice cook become a master chef in data science
Use these recipes to obtain, clean, analyze, and learn from your data
Learn how to get your data science applications to production and enterprise environments effortlessly

Who This Book Is For

This book is for Java developers who are familiar with the fundamentals of data science and want to improve their skills to become a pro.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateMar 28, 2017

ISBN9781787127654

Author

Rushdi Shams

Related authors

Skip carousel

Related to Java Data Science Cookbook

Related ebooks

Skip carousel

Mastering Scala Machine Learning
Ebook
Mastering Scala Machine Learning
byAlex Kozlov
Rating: 0 out of 5 stars
0 ratings
Scala Programming for Big Data Analytics: Get Started With Big Data Analytics Using Apache Spark
Ebook
Scala Programming for Big Data Analytics: Get Started With Big Data Analytics Using Apache Spark
byIrfan Elahi
Rating: 0 out of 5 stars
0 ratings
Instant Jsoup How-to
Ebook
Instant Jsoup How-to
byPete Houston
Rating: 0 out of 5 stars
0 ratings
Spring Boot Persistence Best Practices: Optimize Java Persistence Performance in Spring Boot Applications
Ebook
Spring Boot Persistence Best Practices: Optimize Java Persistence Performance in Spring Boot Applications
byAnghel Leonard
Rating: 0 out of 5 stars
0 ratings
Spring Integration Essentials
Ebook
Spring Integration Essentials
byChandan Pandey
Rating: 3 out of 5 stars
3/5
Natural Language Processing with Java and LingPipe Cookbook
Ebook
Natural Language Processing with Java and LingPipe Cookbook
byKrishna Dayanidhi
Rating: 0 out of 5 stars
0 ratings
Scala Data Analysis Cookbook
Ebook
Scala Data Analysis Cookbook
byManivannan Arun
Rating: 0 out of 5 stars
0 ratings
Lucene 4 Cookbook
Ebook
Lucene 4 Cookbook
byEdwood Ng
Rating: 0 out of 5 stars
0 ratings
Lo-Dash Essentials
Ebook
Lo-Dash Essentials
byBoduch Adam
Rating: 0 out of 5 stars
0 ratings
Mastering Java for Data Science
Ebook
Mastering Java for Data Science
byAlexey Grigorev
Rating: 5 out of 5 stars
5/5
Java for Data Science
Ebook
Java for Data Science
byJennifer L. Reese
Rating: 0 out of 5 stars
0 ratings
Distributed Computing in Java 9
Ebook
Distributed Computing in Java 9
byRaja Malleswara Rao Pattamsetti
Rating: 0 out of 5 stars
0 ratings
Scala Test-Driven Development
Ebook
Scala Test-Driven Development
byGaurav Sood
Rating: 0 out of 5 stars
0 ratings
Instant MapReduce Patterns – Hadoop Essentials How-to
Ebook
Instant MapReduce Patterns – Hadoop Essentials How-to
bySrinath Perera
Rating: 0 out of 5 stars
0 ratings
Java 9 with JShell
Ebook
Java 9 with JShell
byGastón C. Hillar
Rating: 0 out of 5 stars
0 ratings
RESTful Java Web Services Security
Ebook
RESTful Java Web Services Security
byRené Enríquez
Rating: 0 out of 5 stars
0 ratings
Apache Mahout Clustering Designs
Ebook
Apache Mahout Clustering Designs
byGupta Ashish
Rating: 0 out of 5 stars
0 ratings
Learning Modular Java Programming
Ebook
Learning Modular Java Programming
byTejaswini Mandar Jog
Rating: 0 out of 5 stars
0 ratings
Talend Open Studio Cookbook
Ebook
Talend Open Studio Cookbook
byRick Barton
Rating: 2 out of 5 stars
2/5
Getting Started with Hazelcast - Second Edition
Ebook
Getting Started with Hazelcast - Second Edition
byMat Johns
Rating: 0 out of 5 stars
0 ratings
Apache Hive Cookbook
Ebook
Apache Hive Cookbook
byShrey Mehrotra
Rating: 0 out of 5 stars
0 ratings
AWS Key Management Service and AWS CloudHSM Third Edition
Ebook
AWS Key Management Service and AWS CloudHSM Third Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Building Python Real-Time Applications with Storm
Ebook
Building Python Real-Time Applications with Storm
byBhatnagar Kartik
Rating: 0 out of 5 stars
0 ratings
ASP.NET 3.5 CMS Development
Ebook
ASP.NET 3.5 CMS Development
byCurt Christianson
Rating: 0 out of 5 stars
0 ratings
Learning Continuous Integration with Jenkins
Ebook
Learning Continuous Integration with Jenkins
byNikhil Pathania
Rating: 0 out of 5 stars
0 ratings
Chaos Engineering A Clear and Concise Reference
Ebook
Chaos Engineering A Clear and Concise Reference
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
AWS CloudFormation A Complete Guide - 2021 Edition
Ebook
AWS CloudFormation A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Restlet in Action: Developing RESTful web APIs in Java
Ebook
Restlet in Action: Developing RESTful web APIs in Java
byThierry Templier
Rating: 0 out of 5 stars
0 ratings
Java Complete Self-Assessment Guide
Ebook
Java Complete Self-Assessment Guide
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Microservices with Azure A Complete Guide - 2019 Edition
Ebook
Microservices with Azure A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings

Programming For You

Skip carousel

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
HTML & CSS: Learn the Fundaments in 7 Days
Ebook
HTML & CSS: Learn the Fundaments in 7 Days
byMichael Knapp
Rating: 4 out of 5 stars
4/5
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
Ebook
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
byAnthony Adams
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming
Ebook
Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming
byConnor P. Milliken
Rating: 0 out of 5 stars
0 ratings
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
Ebook
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
byJason Scotts
Rating: 4 out of 5 stars
4/5
Coding All-in-One For Dummies
Ebook
Coding All-in-One For Dummies
byNikhil Abraham
Rating: 4 out of 5 stars
4/5
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
Ebook
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
byJames Tudor
Rating: 5 out of 5 stars
5/5
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
Ebook
Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS
byTravis Plunk
Rating: 0 out of 5 stars
0 ratings
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
Ebook
Java for Beginners: A Crash Course to Learn Java Programming in 1 Week
byBrady Ellison
Rating: 5 out of 5 stars
5/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application
Ebook
The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application
byPaul Richards
Rating: 0 out of 5 stars
0 ratings
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
Ebook
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
byMark Chan
Rating: 5 out of 5 stars
5/5
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
Ebook
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
byDexter Jackson
Rating: 4 out of 5 stars
4/5
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
Ebook
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
byJoseph Labrecque
Rating: 5 out of 5 stars
5/5
SQL All-in-One For Dummies
Ebook
SQL All-in-One For Dummies
byAllen G. Taylor
Rating: 3 out of 5 stars
3/5
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
Ebook
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
byHeath Haskins
Rating: 5 out of 5 stars
5/5
The Little SAS Book: A Primer, Sixth Edition
Ebook
The Little SAS Book: A Primer, Sixth Edition
byLora D. Delwiche
Rating: 5 out of 5 stars
5/5
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
Ebook
SQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days
byi Code Academy
Rating: 5 out of 5 stars
5/5
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Ebook
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
byEric Vargas
Rating: 0 out of 5 stars
0 ratings
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
Ebook
Python: For Beginners A Crash Course Guide To Learn Python in 1 Week
byTimothy C. Needham
Rating: 4 out of 5 stars
4/5
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
Ebook
Python Machine Learning - Third Edition: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, 3rd Edition
bySebastian Raschka
Rating: 5 out of 5 stars
5/5
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
Ebook
HTML & CSS QuickStart Guide: The Simplified Beginners Guide to Developing a Strong Coding Foundation, Building Responsive Websites, and Mastering the Fundamentals of Modern Web Design
byDavid DuRocher
Rating: 4 out of 5 stars
4/5
Teach Yourself C++
Ebook
Teach Yourself C++
byAl Stevens
Rating: 4 out of 5 stars
4/5
Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS
Ebook
Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS
byGame Guidez
Rating: 5 out of 5 stars
5/5
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
Ebook
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
byKevin Pitch
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

Episode 98. It's HERE, FINALLY HERE! Java 17 LTS Release: So is time to celebrate! We got a new box of toys with the new release of Java! This is also a Long-Term-Support release which means that's usually a "good one" to jump into! Switch Expressions! Helpful Nullpointers, Sealed Classes... there is a TON...
Podcast episode
Episode 98. It's HERE, FINALLY HERE! Java 17 LTS Release: So is time to celebrate! We got a new box of toys with the new release of Java! This is also a Long-Term-Support release which means that's usually a "good one" to jump into! Switch Expressions! Helpful Nullpointers, Sealed Classes... there is a TON...
byJava Pub House
0 ratings
0% found this document useful
25: Selenium, pytest, Mozilla – Dave Hunt: Interview with Dave Hunt @davehunt82. We Cover: Selenium Driver: http://www.seleniumhq.org/ pytest: http://docs.pytest.org/ pytest plugins: pytest-selenium: http://pytest-selenium.readthedocs.io/ pytest-html: https://pypi.python.
Podcast episode
25: Selenium, pytest, Mozilla – Dave Hunt: Interview with Dave Hunt @davehunt82. We Cover: Selenium Driver: http://www.seleniumhq.org/ pytest: http://docs.pytest.org/ pytest plugins: pytest-selenium: http://pytest-selenium.readthedocs.io/ pytest-html: https://pypi.python.
byTest and Code
0 ratings
0% found this document useful
#457: [INTRODUCING] AWS BugBust: Today Nicki is joined by Alex Bush, Head of AI Services and Vishnu Parimi, Principal Product Manager
Podcast episode
#457: [INTRODUCING] AWS BugBust: Today Nicki is joined by Alex Bush, Head of AI Services and Vishnu Parimi, Principal Product Manager
byAWS Podcast
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
How ChatGPT Changes Tech + The End of Remote Work? — With Aaron Levie
Podcast episode
How ChatGPT Changes Tech + The End of Remote Work? — With Aaron Levie
byBig Technology Podcast
100%
100% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Speed Up And Simplify Your Streaming Data Workloads With Red Panda - Episode 152: An interview with Vectorized founder Alexander Gallego about the Red Panda streaming engine and building a drop-in replacement for Kafka with better performance and throughput.
Podcast episode
Speed Up And Simplify Your Streaming Data Workloads With Red Panda - Episode 152: An interview with Vectorized founder Alexander Gallego about the Red Panda streaming engine and building a drop-in replacement for Kafka with better performance and throughput.
byData Engineering Podcast
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
063: Effective Java for Android Developers – Item #13: Minimize the accessibility of classes and members: In this mini-Fragment episode, Donn talks about Item #13 of the Effective Java series - Minimize the accessibility of classes and members. You'll learn why it's important to limit the access on your public API, how it can help you with development and per
Podcast episode
063: Effective Java for Android Developers – Item #13: Minimize the accessibility of classes and members: In this mini-Fragment episode, Donn talks about Item #13 of the Effective Java series - Minimize the accessibility of classes and members. You'll learn why it's important to limit the access on your public API, how it can help you with development and per
byFragmented - An Android Developer Podcast
0 ratings
0% found this document useful
TypeScript Fundamentals — Getting a Bit Deeper: In this episode of Syntax, Scott and Wes continue their discussion of TypeScript Fundamentals with a deeper diver into more advanced use cases. Deque - Sponsor Deque’s axe DevTools makes accessibility testing easy and doesn’t require special...
Podcast episode
TypeScript Fundamentals — Getting a Bit Deeper: In this episode of Syntax, Scott and Wes continue their discussion of TypeScript Fundamentals with a deeper diver into more advanced use cases. Deque - Sponsor Deque’s axe DevTools makes accessibility testing easy and doesn’t require special...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
Podcast episode
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Kubernetes Registry with Benjamin Elder: Benjamin Elder is a Senior Software Engineer at Google, a Kubernetes SIG Testing Chair & Tech Lead, and a Kubernetes Steering Committee member. In this episode we got to chat with Benjamin about the new kubernetes registry migration from k8s.gcr.io to...
Podcast episode
Kubernetes Registry with Benjamin Elder: Benjamin Elder is a Senior Software Engineer at Google, a Kubernetes SIG Testing Chair & Tech Lead, and a Kubernetes Steering Committee member. In this episode we got to chat with Benjamin about the new kubernetes registry migration from k8s.gcr.io to...
byKubernetes Podcast from Google
0 ratings
0% found this document useful
Engineering interview tips & tricks: with Emma Draper & Jonas
Podcast episode
Engineering interview tips & tricks: with Emma Draper & Jonas
byGo Time: Golang, Software Engineering
0 ratings
0% found this document useful
React Hooks - 1 Year Later: In this episode of Syntax, Scott and Wes talk about React Hooks, one year later — what’s changed, how to use them, and more! Sanity - Sponsor is a real-time headless CMS with a fully customizable Content Studio built in React. Get a Sanity...
Podcast episode
React Hooks - 1 Year Later: In this episode of Syntax, Scott and Wes talk about React Hooks, one year later — what’s changed, how to use them, and more! Sanity - Sponsor is a real-time headless CMS with a fully customizable Content Studio built in React. Get a Sanity...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Hasty Treat - Webhooks: In this Hasty Treat, Scott and Wes talk about webhooks — one of those concepts that seems a lot scarier than it actually is. Linode - Sponsor Whether you’re working on a personal project or managing enterprise infrastructure, you deserve simple,...
Podcast episode
Hasty Treat - Webhooks: In this Hasty Treat, Scott and Wes talk about webhooks — one of those concepts that seems a lot scarier than it actually is. Linode - Sponsor Whether you’re working on a personal project or managing enterprise infrastructure, you deserve simple,...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
Python, Django, and Channels: with Andrew Godwin, creator of Django Channels
Podcast episode
Python, Django, and Channels: with Andrew Godwin, creator of Django Channels
byThe Changelog: Software Development, Open Source
0 ratings
0% found this document useful
Serverless, Deno and TypeScript with Brian Leroux: In this episode of Syntax, Scott and Wes talk with Brian Leroux about severless, Deno, Typescript, and more! Netlify - Sponsor Netlify is the best way to deploy and host a front-end website. All the features developers need right out of the box:...
Podcast episode
Serverless, Deno and TypeScript with Brian Leroux: In this episode of Syntax, Scott and Wes talk with Brian Leroux about severless, Deno, Typescript, and more! Netlify - Sponsor Netlify is the best way to deploy and host a front-end website. All the features developers need right out of the box:...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Making Automated Machine Learning More Accessible With EvalML: An interview with Angela Lin and Jeremy Shih about the open source EvalML framework for building automated machine learning workflows.
Podcast episode
Making Automated Machine Learning More Accessible With EvalML: An interview with Angela Lin and Jeremy Shih about the open source EvalML framework for building automated machine learning workflows.
byThe Python Podcast.__init__
100%
100% found this document useful
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
Podcast episode
Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle: The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.
byData Engineering Podcast
0 ratings
0% found this document useful
Managing non-REST APIs like GraphQL and gRPC with Nandan Sridhar and David Feuer: Alexandrina Garcia-Verdin and Stephanie Wong host this week's episode all about managing non-REST APIs.
Podcast episode
Managing non-REST APIs like GraphQL and gRPC with Nandan Sridhar and David Feuer: Alexandrina Garcia-Verdin and Stephanie Wong host this week's episode all about managing non-REST APIs.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Every commit is a gift: celebrating Maintainer Week with Brett Cannon
Podcast episode
Every commit is a gift: celebrating Maintainer Week with Brett Cannon
byThe Changelog: Software Development, Open Source
0 ratings
0% found this document useful
Cloud Dependencies with Mya Pitzeruse: New software abstractions always take advantage of the abstractions that have been built before. Software libraries allow us to import code that sits on the same host as a new program. Open source software let us copy and paste existing code,
Podcast episode
Cloud Dependencies with Mya Pitzeruse: New software abstractions always take advantage of the abstractions that have been built before. Software libraries allow us to import code that sits on the same host as a new program. Open source software let us copy and paste existing code,
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Cloud Native Security Con with Emily Fox: is a security engineer @Apple Cloud Services, a CNCF Technical Oversight Committee member and co-chair for a bunch of CNCF events including recently the Cloud Native Security Conference in Seattle. We had a chance to talk to Emily about the first...
Podcast episode
Cloud Native Security Con with Emily Fox: is a security engineer @Apple Cloud Services, a CNCF Technical Oversight Committee member and co-chair for a bunch of CNCF events including recently the Cloud Native Security Conference in Seattle. We had a chance to talk to Emily about the first...
byKubernetes Podcast from Google
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
#159 - Leveling Up Your Code Reviews from 'Good Enough' to Great - Adrienne Tacke
Podcast episode
#159 - Leveling Up Your Code Reviews from 'Good Enough' to Great - Adrienne Tacke
byTech Lead Journal
0 ratings
0% found this document useful
Managed Kafka with Tom Crayford: Kafka is a distributed log for producers and consumers to publish messages to each other. We’ve done many shows about Kafka as a key building block for distributed systems, but we often leave out the discussion of the complexities of setting up Kafka a...
Podcast episode
Managed Kafka with Tom Crayford: Kafka is a distributed log for producers and consumers to publish messages to each other. We’ve done many shows about Kafka as a key building block for distributed systems, but we often leave out the discussion of the complexities of setting up Kafka a...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Hasty Treat - Why should I use React Hooks?: In this Hasty Treat, Scott and Wes talk about React Hooks and why you might want to use them instead of class components. Sentry - Sponsor If you want to know what’s happening with your errors, track them with . Sentry is open-source error...
Podcast episode
Hasty Treat - Why should I use React Hooks?: In this Hasty Treat, Scott and Wes talk about React Hooks and why you might want to use them instead of class components. Sentry - Sponsor If you want to know what’s happening with your errors, track them with . Sentry is open-source error...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful

Skip carousel

Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
APC
Article
MapReduce: The ‘Big Data’ Idea Inside Your Android Phone
Dec 2, 2019
4 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Monitor Systems And Docker Deployments
Linux Format
Article
Monitor Systems And Docker Deployments
Jun 30, 2020
Welcome to Netdata, software for distributed real-time performance and health monitoring of UNIX machines. Don’t you dare turn that page! A key advantage of Netdata is that it collects all of its metrics without introducing too much load on to the Li
8 min read
Elasticsearch And Kibana Basics
Linux Format
Article
Elasticsearch And Kibana Basics
Dec 15, 2020
1 min read
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
PC Pro Magazine
Article
“We’re Learning As We Go And Accepting Any False Starts As Being A Part Of The Process”
Jul 8, 2021
6 min read
Grafana, Telegraf And Influxdb
Linux Format
Article
Grafana, Telegraf And Influxdb
Jun 30, 2020
If you don’t like Netdata or if you want to try something else, you can give Grafana (https://grafana.com), Telegraf (www.influxdata.com/time-series-platform/telegraf) and InfluxDB (www.influxdata.com/products/influxdb-overview) a try. Grafana can’t
1 min read
Three Low-code Options
PC Pro Magazine
Article
Three Low-code Options
Nov 12, 2020
Counting Intel, Vodafone and VW among its customers, OutSystems helps businesses create cloudbased, on-premises and hybrid applications for mobile and web. Its development environment is predominantly drag-and-drop, with views for processes, data and
3 min read
Route Traffic Between Networks Using A Pi
Linux Format
Article
Route Traffic Between Networks Using A Pi
Jun 2, 2020
A deep-dive into Pi networking solutions resulted in this tutorial. The goal was to uncover a Pi configuration that would enable the routing of network traffic from a wired network to a wireless network. The aim is to build a network router using a R
10 min read
Rokoko Studio 2.0
3D World
Article
Rokoko Studio 2.0
Feb 23, 2021
1 min read
Build A Search And Analytic Engine
Linux Format
Article
Build A Search And Analytic Engine
Mar 10, 2020
7 min read
Your First Steps In Grafana
Linux Format
Article
Your First Steps In Grafana
Nov 17, 2020
The easiest way to get hold of Grafana and begin using it as soon as possible is by downloading and executing its official Docker image. This means that apart from the Docker image, you won’t need to download, set up or install anything else for Graf
1 min read
DJANGO Create A Database-driven Website
Linux Format
Article
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
Build A Dynamic App Security Pipeline
Linux Format
Article
Build A Dynamic App Security Pipeline
Sep 21, 2021
8 min read
How To Develop A RESTful Client In Go
Linux Format
Article
How To Develop A RESTful Client In Go
Nov 16, 2021
Mihalis Tsoukalos is a systems engineer and technical writer. He’s the author of Go Systems Programming and Mastering Go. You can reach him at @mactsouk. The subject of this month’s tutorial is RESTful services. In particular, you’re going to learn h
9 min read
Year Of The Linux Desktop (on Windows)
TechLife
Article
Year Of The Linux Desktop (on Windows)
Nov 15, 2021
3 min read
Updating Sites
Linux Format
Article
Updating Sites
Oct 19, 2021
1 min read
Docker vs Podman
APC
Article
Docker vs Podman
Apr 19, 2021
When Cockpit was first developed, it had plug-in support for administering your Docker containers remotely via its user-friendly web interface. But then Red Hat OS became a major backer of Cockpit, and when Red Hat developed its own alternative to Do
1 min read
What Systems And Software Are Used By The Falcon 9?
Techfastly
Article
What Systems And Software Are Used By The Falcon 9?
Oct 21, 2020
Last summer, SpaceX embarked on the first US-manned spaceflight in almost a decade. What made this event even more historic is that it successfully took NASA astronauts into orbit on a privately-manned spacecraft and delivered them to the Internation
3 min read
Traefik Configuration
Linux Format
Article
Traefik Configuration
Mar 10, 2020
In this tutorial we have configured Traefik using command-line switches in our Docker Compose file (the section starting command:). This is the equivalent of starting the application with a whole bunch of command options each time, and while this wou
1 min read
An easy-to-Understand Overview of Popular extended BPF Tools: BCC, Falco, and More
Techfastly
Article
An easy-to-Understand Overview of Popular extended BPF Tools: BCC, Falco, and More
Apr 1, 2022
7 min read
IT For A New World
Business Today
Article
IT For A New World
Jun 10, 2021
6 min read
Basic Concepts
Linux Format
Article
Basic Concepts
Jul 2, 2019
A messaging system such as Kafka enables you to send messages between processes, applications and servers. Applications connect to Kafka to send or get data. Strictly speaking, a Kafka ‘topic’ is a unit of storage in Kafka: data in Kafka is stored in
1 min read
Get Into Coding!
Linux Format
Article
Get Into Coding!
Aug 23, 2022
1 min read
What Is The Future Of Game Streaming Now That Stadia Is Dead?
APC
Article
What Is The Future Of Game Streaming Now That Stadia Is Dead?
Oct 31, 2022
Once hyped as being ‘the future of gaming’, the Google Stadia game streaming service was officially, just three years after launch and before even making it to Australian shores. When game streaming first launched we did have some apprehension about
2 min read
Workflow
Linux Format
Article
Workflow
Nov 17, 2020
3 min read
What is ELT?
Techfastly
Article
What is ELT?
Apr 1, 2021
It stands for extract, load, and transform- the processes a data pipeline uses for replicating the data from a source system into a target system such as a cloud data warehouse. 1. Extraction is the first step in which data is copied from the source
6 min read
4 Windows Command Prompt Tricks Everyone Should Know
PCWorld
Article
4 Windows Command Prompt Tricks Everyone Should Know
Feb 8, 2017
3 min read
Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read

Related categories

Skip carousel

Reviews for Java Data Science Cookbook

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Java Data Science Cookbook - Rushdi Shams

Java Data Science Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it...

How it works...

There's more...

See also

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

1. Obtaining and Cleaning Data

Introduction

Retrieving all filenames from hierarchical directories using Java

Getting ready

How to do it...

Retrieving all filenames from hierarchical directories using Apache Commons IO

Getting ready

How to do it...

Reading contents from text files all at once using Java 8

How to do it...

Reading contents from text files all at once using Apache Commons IO

Getting ready

How to do it...

Extracting PDF text using Apache Tika

Getting ready

How to do it...

Cleaning ASCII text files using Regular Expressions

How to do it...

Parsing Comma Separated Value (CSV) Files using Univocity

Getting ready

How to do it...

Parsing Tab Separated Value (TSV) file using Univocity

Getting ready

How to do it...

Parsing XML files using JDOM

Getting ready

How to do it...

Writing JSON files using JSON.simple

Getting ready

How to do it...

Reading JSON files using JSON.simple

Getting ready

How to do it ...

Extracting web data from a URL using JSoup

Getting ready

How to do it...

Extracting web data from a website using Selenium Webdriver

Getting ready

How to do it...

Reading table data from a MySQL database

Getting ready

How to do it...

2. Indexing and Searching Data

Introduction

Indexing data with Apache Lucene

Getting ready

How to do it...

How it works...

Searching indexed data with Apache Lucene

Getting ready

How to do it...

3. Analyzing Data Statistically

Introduction

Generating descriptive statistics

How to do it...

Generating summary statistics

How to do it...

Generating summary statistics from multiple distributions

How to do it...

There's more...

Computing frequency distribution

How to do it...

Counting word frequency in a string

How to do it...

How it works...

Counting word frequency in a string using Java 8

How to do it...

Computing simple regression

How to do it...

Computing ordinary least squares regression

How to do it...

Computing generalized least squares regression

How to do it...

Calculating covariance of two sets of data points

How to do it...

Calculating Pearson's correlation of two sets of data points

How to do it...

Conducting a paired t-test

How to do it...

Conducting a Chi-square test

How to do it...

Conducting the one-way ANOVA test

How to do it...

Conducting a Kolmogorov-Smirnov test

How to do it...

4. Learning from Data - Part 1

Introduction

Creating and saving an Attribute-Relation File Format (ARFF) file

How to do it...

Cross-validating a machine learning model

How to do it...

Classifying unseen test data

Getting ready

How to do it...

Classifying unseen test data with a filtered classifier

How to do it...

Generating linear regression models

How to do it...

Generating logistic regression models

How to do it...

Clustering data points using the KMeans algorithm

How to do it...

Clustering data from classes

How to do it...

Learning association rules from data

Getting ready

How to do it...

Selecting features/attributes using the low-level method, the filtering method, and the meta-classifier method

Getting ready

How to do it...

5. Learning from Data - Part 2

Introduction

Applying machine learning on data using Java Machine Learning (Java-ML) library

Getting ready

How to do it...

Classifying data points using the Stanford classifier

Getting ready

How to do it...

How it works...

Classifying data points using Massive Online Analysis (MOA)

Getting ready

How to do it...

Classifying multilabeled data points using Mulan

Getting ready

How to do it...

6. Retrieving Information from Text Data

Introduction

Detecting tokens (words) using Java

Getting ready

How to do it...

Detecting sentences using Java

Getting ready

How to do it...

Detecting tokens (words) and sentences using OpenNLP

Getting ready

How to do it...

Retrieving lemma, part-of-speech, and recognizing named entities from tokens using Stanford CoreNLP

Getting ready

How to do it...

Measuring text similarity with Cosine Similarity measure using Java 8

Getting ready

How to do it...

Extracting topics from text documents using Mallet

Getting ready

How to do it...

Classifying text documents using Mallet

Getting ready

How to do it...

Classifying text documents using Weka

Getting ready

How to do it...

7. Handling Big Data

Introduction

Training an online logistic regression model using Apache Mahout

Getting ready

How to do it...

Applying an online logistic regression model using Apache Mahout

Getting ready

How to do it...

Solving simple text mining problems with Apache Spark

Getting ready

How to do it...

Clustering using KMeans algorithm with MLib

Getting ready

How to do it...

Creating a linear regression model with MLib

Getting ready

How to do it...

Classifying data points with Random Forest model using MLib

Getting ready

How to do it...

8. Learn Deeply from Data

Introduction

Creating a Word2vec neural net using Deep Learning for Java (DL4j)

How to do it...

How it works...

There's more

Creating a Deep Belief neural net using Deep Learning for Java (DL4j)

How to do it...

How it works...

Creating a deep autoencoder using Deep Learning for Java (DL4j)

How to do it...

How it works...

9. Visualizing Data

Introduction

Plotting a 2D sine graph

Getting ready

How to do it...

Plotting histograms

Getting ready

How to do it...

Plotting a bar chart

Getting ready

How to do it...

Plotting box plots or whisker diagrams

Getting ready

How to do it...

Plotting scatter plots

Getting ready

How to do it...

Plotting donut plots

Getting ready

How to do it...

Plotting area graphs

Getting ready

How to do it...

Java Data Science Cookbook

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: March 2017

Production reference: 1240317

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78712-253-6

www.packtpub.com

Credits

About the Author

Rushdi Shams has a PhD on application of machine learning in Natural Language Processing (NLP) problem areas from Western University, Canada. Before starting work as a machine learning and NLP specialist in industry, he was engaged in teaching undergrad and grad courses. He has been successfully maintaining his YouTube channel named Learn with Rushdi for learning computer technologies.

I would like to acknowledge the Almighty Allah for giving me the strength, support, and knowledge to finish the book.

I extend my thanks to my family members, friends, and colleagues for continuous support, encouragement, and constructive criticism.

I would also like to thank Ajith and Cheryl from Packt for their continuous and spontaneous collaboration with me.

About the Reviewer

Prashant Verma started his IT career in 2011 as a Java developer at Ericsson, working in the telecom domain. After a couple of years of Java EE experience, he moved into the big data domain, and has worked on almost all the popular big data technologies such as Hadoop, Spark, Kafka, Flume, Mongo, Cassandra, and so on. He has also worked in Scala and Python. Currently, he works with QA Infotech as Lead Data Engineer, working on solving e-learning domain problems using data analytics and machine learning.

Prashant has also worked on Apache Spark for Java Developers, Packt as a Technical Reviewer.

I want to thank Packt Publishing for giving me the chance to review the book, as well as my employer and my family for their patience while I was busy working on this book.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787122530.

If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

To my lovely wife, Mah-Zereen, and adorable daughter, Ruayda.

Preface

Data science is a popular field for specialization nowadays and covers the broad spectrum of artificial intelligence, such as data processing, information retrieval, machine learning, natural language processing, big data, deep neural networks, and data visualization. In this book, we will understand the techniques that are both modern and smart and presented as easy-to-follow recipes for over 70 problems.

Keeping in mind the high demand for quality data scientists, we have compiled recipes using core Java as well as well-known, classic, and state-of-the-art data science libraries written in Java. We start with the data collection and cleaning process. Then we see how the obtained data can be indexed and searched. Afterwards, we cover statistics both descriptive and inferential and their application to data. Then, we have two back-to-back chapters on the application of machine learning on data that can be foundation for building any smart system. Modern information retrieval and natural language processing techniques are also covered. Big data is an emerging field, and a few aspects of this popular field are also covered. We also cover the very basics of deep learning using deep neural networks. Finally, we learn how to represent data and information obtained from data using meaningful visuals or graphs.

The book is aimed at anyone who has an interest in data science and plans to apply data science using Java to understand underlying data better.

What this book covers

Chapter 1, Obtaining and Cleaning Data, covers different ways to read and write data as well as to clean it to get rid of noise. It also familiarizes the readers with different data file types, such as PDF, ASCII, CSV, TSV, XML, and JSON. The chapter also covers recipes for extracting web data.

Chapter 2, Indexing and Searching Data, covers how to index data for fast searching using Apache Lucene. The techniques described in this chapter can be seen as the basis for modern-day search techniques.

Chapter 3, Analyzing Data Statistically, covers the application of Apache Math API to collect and analyze statistics from data. The chapter also covers higher level concepts such as the statistical significance test, which is the standard tool for researchers when they compare their results with benchmarks.

Chapter 4, Learning from Data - Part 1, covers basic classification, clustering, and feature selection exercises using the Weka machine learning Workbench.

Chapter 5, Learning from Data - Part 2, is a follow-up chapter that covers data import and export, classification, and feature selection using another Java library named the Java Machine Learning (Java-ML) Library. The chapter also covers basic classification with the Stanford Classifier and Massive Online Access (MOA).

Chapter 6, Retrieving Information from Text Data, covers the application of data science to text data for information retrieval. It covers the application of core Java as well as popular libraries such as OpenNLP, Stanford CoreNLP, Mallet, and Weka for the application of machine learning to information extraction and retrieval tasks.

Chapter 7, Handling Big Data, covers the application of big data platforms for machine learning, such as Apache Mahout and Spark-MLib.

Chapter 8, Learn Deeply from Data, covers the very basics of deep learning using the Deep Learning for Java (DL4j) library. We cover the word2vec algorithm, belief networks, and auto-encoders.

Chapter 9, Visualizing Data, covers the GRAL package to generate an appealing and informative display based on data. Among the many functionalities of the package, fundamental and basic plots have been selected.

What you need for this book

We have used Java to solve real-world data science problems. Our focus was to deliver content that can be effective for anyone who wants to know how to solve problems with Java. A minimum knowledge of Java is required, such as classes, objects, methods, arguments and parameters, exceptions, and exporting Java Archive (JAR) files. The code is well supported with narrations, information, and tips to help the readers understand the context and purpose. The theories behind the problems solved in this book, on many occasions, are not thoroughly discussed, but references for interested readers are provided whenever necessary.

Who this book is for

The book is for anyone who wants to know how to solve real-world problems related to data science using Java. The book, as it is very comprehensive in terms of coverage, can also be very useful for practitioners already engaged with data science and looking for solving issues in their projects using Java.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it...

This section contains the steps required to follow the recipe.

How it works...

This section usually consists of a detailed explanation of what happened in the previous section.

There's more...

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Among them, you will find a folder named lib, which is the folder of interest.

A block of code is set as follows:

classVals = new ArrayList();

for (int i = 0; i < 5; i++){

classVals.add(class + (i + 1));

}

Any command-line input or output is written as follows:

@relation MyRelation

@attribute age numeric

@attribute name string

@attribute dob date yyyy-MM-dd

@attribute class {class1,class2,class3,class4,class5}

@data

35,'John Doe',1981-01-20,class3

30,'Harry Potter',1986-07-05,class1

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Select System info from the Administration panel.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors .

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the SUPPORT tab at the top.

Click on Code Downloads & Errata.

Enter the name of the book in the Search box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Java-Data-Science-Cookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/ . Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/JavaDataScienceCookbook_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem.

Chapter 1. Obtaining and Cleaning Data

In this chapter, we will cover the following recipes:

Retrieving all file names from hierarchical directories using Java

Retrieving all file names from hierarchical directories using Apache Commons IO

Reading contents from text files all at once using Java 8

Reading contents from text files all at once using Apache Commons IO

Extracting PDF text using Apache Tika

Cleaning ASCII text files using Regular Expressions

Parsing Comma Separated Value files using Univocity

Parsing Tab Separated Value files using Univocity

Parsing XML files using JDOM

Writing JSON files using JSON.simple

Reading JSON files using JSON.simple

Extracting web data from a URL using JSoup

Extracting web data from a website using Selenium Webdriver

Reading table data from MySQL database

Introduction

Every data scientist needs to deal with data that is stored on disks in several formats, such as ASCII text, PDF, XML, JSON, and so on. Also, data can be stored in database tables. The first and foremost task for a data scientist before doing any analysis is to obtain data from these data sources and of these formats, and apply data-cleaning techniques to get rid of noises present in them. In this chapter, we will see recipes to accomplish this important task.

We will be using external Java libraries (Java archive files or simply JAR files) not only for this chapter

Enjoying the preview?

Page 1 of 1

Java Data Science Cookbook

About this ebook

Rushdi Shams

Related authors

Related to Java Data Science Cookbook

Related ebooks

Programming For You

Related podcast episodes

Related articles

Related categories

Reviews for Java Data Science Cookbook

What did you think?

Book preview

Java Data Science Cookbook - Rushdi Shams

Table of Contents

Java Data Science Cookbook

Java Data Science Cookbook

About the Author

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

There's more...

See also

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Chapter 1. Obtaining and Cleaning Data

Introduction