Temukan book favorit Anda selanjutnya
Jadilah anggota hari ini dan baca gratis selama 30 hariMulai 30 hari gratis AndaInformasi Buku
Mastering Machine Learning with R - Second Edition
Oleh Lesmeister Cory
Tindakan Buku
Mulai Membaca- Penerbit:
- Packt Publishing
- Dirilis:
- Apr 24, 2017
- ISBN:
- 9781787284487
- Format:
- Buku
Deskripsi
This book is for data science professionals, data analysts, or anyone with a working knowledge of machine learning, with R who now want to take their skills to the next level and become an expert in the field.
Tindakan Buku
Mulai MembacaInformasi Buku
Mastering Machine Learning with R - Second Edition
Oleh Lesmeister Cory
Deskripsi
This book is for data science professionals, data analysts, or anyone with a working knowledge of machine learning, with R who now want to take their skills to the next level and become an expert in the field.
- Penerbit:
- Packt Publishing
- Dirilis:
- Apr 24, 2017
- ISBN:
- 9781787284487
- Format:
- Buku
Tentang penulis
Terkait dengan Mastering Machine Learning with R - Second Edition
Pratinjau Buku
Mastering Machine Learning with R - Second Edition - Lesmeister Cory
Title Page
Mastering Machine Learning with R
Second Edition
Advanced prediction, algorithms, and learning methods with R 3.x
Cory Lesmeister
BIRMINGHAM - MUMBAI
Copyright
Mastering Machine Learning with R
Second Edition
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2015
Second Edition: April 2017
Production reference: 1140417
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78728-747-1
www.packtpub.com
Credits
About the Author
Cory Lesmeister has over a dozen years of quantitative experience and is currently a Senior Quantitative Manager in the banking industry, responsible for building marketing and regulatory models. Cory spent 16 years at Eli Lilly and Company in sales, market research, Lean Six Sigma, marketing analytics, and new product forecasting. A former U.S. Army active duty and reserve officer, Cory was in Baghdad, Iraq, in 2009 serving as the strategic advisor to the 29,000-person Iraqi Oil Police, where he supplied equipment to help the country secure and protect its oil infrastructure. An aviation aficionado, Cory has a BBA in aviation administration from the University of North Dakota and a commercial helicopter license.
About the Reviewers
Doug Ortiz is an Independent Consultant who has been architecting, developing, and integrating enterprise solutions throughout his whole career. Organizations that leverage his skillset have been able to rediscover and reuse their underutilized data via existing and emerging technologies, such as Microsoft BI Stack, Hadoop, NoSQL databases, SharePoint, Hadoop, and related toolsets and technologies.
He is the founder of Illustris, LLC, and can be reached at dougortiz@illustris.org.
Interesting aspects of his profession are listed here:
Has experience integrating multiple platforms and products
Helps organizations gain a deeper understanding and value of their current investments in data and existing resources, turning them into useful sources of information
Has improved, salvaged, and architected projects by utilizing unique and innovative techniques
His hobbies include yoga and scuba diving.
Miroslav Kopecky has been a passionate JVM enthusiast since the first moment he joined SUN Microsystems in 2002. He truly believes in distributed system design, concurrency and parallel computing. One of Miro’s most favorite hobbies is the development of autonomic systems. He is one of the co-authors and main contributors to the open source Java IoT/Robotics framework Robo4J.
Miro is currently working on the online energy trading platform for enmacc.de as a senior software developer.
I would like to thank my family and my wife Tanja for the big support during reviewing this book.
Packt Upsell
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787287475
If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Table of Contents
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
A Process for Success
The process
Business understanding
Identifying the business objective
Assessing the situation
Determining the analytical goals
Producing a project plan
Data understanding
Data preparation
Modeling
Evaluation
Deployment
Algorithm flowchart
Summary
Linear Regression - The Blocking and Tackling of Machine Learning
Univariate linear regression
Business understanding
Multivariate linear regression
Business understanding
Data understanding and preparation
Modeling and evaluation
Other linear model considerations
Qualitative features
Interaction terms
Summary
Logistic Regression and Discriminant Analysis
Classification methods and linear regression
Logistic regression
Business understanding
Data understanding and preparation
Modeling and evaluation
The logistic regression model
Logistic regression with cross-validation
Discriminant analysis overview
Discriminant analysis application
Multivariate Adaptive Regression Splines (MARS)
Model selection
Summary
Advanced Feature Selection in Linear Models
Regularization in a nutshell
Ridge regression
LASSO
Elastic net
Business case
Business understanding
Data understanding and preparation
Modeling and evaluation
Best subsets
Ridge regression
LASSO
Elastic net
Cross-validation with glmnet
Model selection
Regularization and classification
Logistic regression example
Summary
More Classification Techniques - K-Nearest Neighbors and Support Vector Machines
K-nearest neighbors
Support vector machines
Business case
Business understanding
Data understanding and preparation
Modeling and evaluation
KNN modeling
SVM modeling
Model selection
Feature selection for SVMs
Summary
Classification and Regression Trees
An overview of the techniques
Understanding the regression trees
Classification trees
Random forest
Gradient boosting
Business case
Modeling and evaluation
Regression tree
Classification tree
Random forest regression
Random forest classification
Extreme gradient boosting - classification
Model selection
Feature Selection with random forests
Summary
Neural Networks and Deep Learning
Introduction to neural networks
Deep learning, a not-so-deep overview
Deep learning resources and advanced methods
Business understanding
Data understanding and preparation
Modeling and evaluation
An example of deep learning
H2O background
Data upload to H2O
Create train and test datasets
Modeling
Summary
Cluster Analysis
Hierarchical clustering
Distance calculations
K-means clustering
Gower and partitioning around medoids
Gower
PAM
Random forest
Business understanding
Data understanding and preparation
Modeling and evaluation
Hierarchical clustering
K-means clustering
Gower and PAM
Random Forest and PAM
Summary
Principal Components Analysis
An overview of the principal components
Rotation
Business understanding
Data understanding and preparation
Modeling and evaluation
Component extraction
Orthogonal rotation and interpretation
Creating factor scores from the components
Regression analysis
Summary
Market Basket Analysis, Recommendation Engines, and Sequential Analysis
An overview of a market basket analysis
Business understanding
Data understanding and preparation
Modeling and evaluation
An overview of a recommendation engine
User-based collaborative filtering
Item-based collaborative filtering
Singular value decomposition and principal components analysis
Business understanding and recommendations
Data understanding, preparation, and recommendations
Modeling, evaluation, and recommendations
Sequential data analysis
Sequential analysis applied
Summary
Creating Ensembles and Multiclass Classification
Ensembles
Business and data understanding
Modeling evaluation and selection
Multiclass classification
Business and data understanding
Model evaluation and selection
Random forest
Ridge regression
MLR's ensemble
Summary
Time Series and Causality
Univariate time series analysis
Understanding Granger causality
Business understanding
Data understanding and preparation
Modeling and evaluation
Univariate time series forecasting
Examining the causality
Linear regression
Vector autoregression
Summary
Text Mining
Text mining framework and methods
Topic models
Other quantitative analyses
Business understanding
Data understanding and preparation
Modeling and evaluation
Word frequency and topic models
Additional quantitative analysis
Summary
R on the Cloud
Creating an Amazon Web Services account
Launch a virtual machine
Start RStudio
Summary
R Fundamentals
Getting R up-and-running
Using R
Data frames and matrices
Creating summary statistics
Installing and loading R packages
Data manipulation with dplyr
Summary
Sources
Preface
A man deserves a second chance, but keep an eye on him
-John Wayne
It is not so often in life that you get a second chance. I remember that only days after we stopped editing the first edition, I kept asking myself, Why didn't I...?
, or What the heck was I thinking saying it like that?
, and on and on. In fact, the first project I started working on after it was published had nothing to do with any of the methods in the first edition. I made a mental note that if given the chance, it would go into a second edition.
When I started with the first edition, my goal was to create something different, maybe even create a work that was a pleasure to read, given the constraints of the topic. After all the feedback I received, I think I hit the mark. However, there is always room for improvement, and if you try and be everything to all people, you become nothing to everybody. I'm reminded of one of my favorite Frederick the great quotes, He who defends everything, defends nothing
. So, I've tried to provide enough of the skills and tools, but not all of them, to get a reader up and running with R and machine learning as quickly and painlessly as possible. I think I've added some interesting new techniques that build on what was in the first edition. There will probably always be the detractors who complain it does not offer enough math or does not do this, that, or the other thing, but my answer to that is they already exist! Why duplicate what was already done, and very well, for that matter? Again, I have sought to provide something different, something that would keep the reader's attention and allow them to succeed in this competitive field.
Before I provide a list of the changes/improvements incorporated into the second edition, chapter by chapter, let me explain some universal changes. First of all, I have surrendered in my effort to fight the usage of the assignment operator <- versus just using =. As I shared more and more code with others, I realized I was out on my own using = and not <-. The first thing I did when under contract for the second edition was go line by line in the code and change it. The more important part, perhaps, was to clean and standardize the code. This is also important when you have to share code with coworkers and, dare I say, regulators. Using RStudio facilitates this standardization in the most recent versions. What sort of standards! Well, the first thing is to properly space the code. For instance, I would not hesitate in the past to write c(1,2,3,4,5,6). Not anymore! Now, I will write this--c(1, 2, 3, 4, 5, 6)--as a space after commas, which makes it easier to read. If you want other ideas, please have a look a Google's R style guide, https://google.github.io/styleguide/Rguide.xml/. I also received a number of e-mails saying that the data I scraped off the Web wasn't available. The National Hockey League decided to launch a completely new version of their statistics, so I had to start from scratch. Problems such as that led me to put data on GitHub.
All in all, I put forth a rather large effort to put the best possible tool in your hands to get you going. On another note, in the month of February '17, there was much attention on the Web on these comments from entrepreneur Mark Cuban:
Artificial Intelligence, deep learning, machine learning--whatever you’re doing if you don’t understand it--learn it. Because otherwise you’re going to be a dinosaur within 3 years.
I personally think there's going to be a greater demand in 10 years for liberal arts majors than there were for programming majors and maybe even engineering, because when the data is all being spit out for you, options are being spit out for you, you need a different perspective in order to have a different view of the data. And so is having someone who is more of a freer thinker.
Besides the fact that these comments created a bit of a stir on the blogosphere, they also seem to be, at first glance, mutually exclusive. But think about what he is saying here. I think he gets to the core of why I felt compelled to write this book. Here is what I believe, machine learning needs to be embraced and utilized, to some extent, by the masses: the tired, the poor, the hungry, the proletariat, and the bourgeoisie. More and more availability of computational power and information will make machine learning something for virtually everyone. However, the flip side of that and what, in my mind, has been and will continue to be a problem is the communication of results. What are you going to do when you describe true positive rate and false positive rate and receive blank stares? How do you quickly tell a story that enlightens your audience? If you think it can't happen, please drop me a note, I'd be more than happy to share my story.
We must have people who can lead these efforts and influence their organization. If a degree in history or music appreciation helps in that endeavor, then so be it. I study history every day, and it has helped me tremendously. Cuban's comments have reinforced my belief that in many ways, the first chapter is the most important in this book. If you are not asking your business partners what they plan to do differently
, you'd better start tomorrow. There are far too many people working far too hard to complete an analysis that is completely irrelevant to the organization and its decisions.
What this book covers
Here is a list of changes from the first edition by chapter:
Chapter 1, A process for success, has the flowchart redone to update an unintended typo and add additional methodologies.
Chapter 2, Linear Regression – the Blocking and Tackling of Machine Learning, has the code improved, and better charts have been provided; other than that, it remains relatively close to the original.
Chapter 3, Logistic Regression and Discriminant Analysis, has the code improved and streamlined. One of my favorite techniques, multivariate adaptive regression splines, has been added; it performs well, handles non-linearity, and is easy to explain. It is my base model, with others becoming challengers
to try and outperform it.
Chapter 4, Advanced Feature Selection in Linear Models, has techniques not only for regression but also for a classification problem included.
Chapter 5, More Classification Techniques – K-Nearest Neighbors and Support Vector Machines, has the code streamlined and simplified.
Chapter 6, Classification and Regression Trees, has the addition of the very popular techniques provided by the XGBOOST package. Additionally, I added the technique of using random forest as a feature selection tool.
Chapter 7, Neural Networks and Deep Learning, has been updated with additional information on deep learning methods and has improved code for the H2O package, including hyper-parameter search.
Chapter 8, Cluster Analysis, has the methodology of doing unsupervised learning with random forests added.
Chapter 9, Principal Components Analysis, uses a different dataset, and an out-of-sample prediction has been added.
Chapter 10, Market Basket Analysis, Recommendation Engines, and Sequential Analysis, has the addition of sequential analysis, which, I'm discovering, is more and more important, especially in marketing.
Chapter 11, Creating Ensembles and Multiclass Classification, has completely new content, using several great packages.
Chapter 12, Time Series and Causality, has a couple of additional years of climate data added, along with a demonstration of different methods of causality test.
Chapter 13, Text Mining, has additional data and improved code.
Chapter 14, R on the Cloud, is another chapter of new content, allowing you to get R on the cloud, simply and quickly.
Appendix A, R Fundamentals, has additional data manipulation methods.
Appendix B, Sources, has a list of sources and references.
What you need for this book
As R is free and open source software, you will only need to download and install it from https://www.r-project.org/. Although it is not mandatory, it is highly recommended that you download IDE and RStudio from https://www.rstudio.com/products/RStudio/.
Who this book is for
This book is for data science professionals, data analysts, or anyone with working knowledge of machine learning with R, who now want to take their skills to the next level and become an expert in the field.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: The data frame is available in the R MASS package under the biopsy name.
Any command-line input or output is written as follows:
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: In order to download new modules, we will go to Files | Settings | Project Name | Project Interpreter.
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Machine-Learning-with-R-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringMachineLearningwithRSecondEdition_ColorImages.pdf.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem.
A Process for Success
If you don't know where you are going, any road will get you there.
- Robert Carrol
If you can't describe what you are doing as a process, you don't know what you're doing.
- W. Edwards Deming
At first glance, this chapter may seem to have nothing to do with machine learning, but it has everything to do with machine learning (specifically, its implementation and making change happen). The smartest people, best software, and best algorithms do not guarantee success, no matter how well it is defined.
In most, if not all, projects, the key to successfully solving problems or improving decision-making is not the algorithm, but the softer, more qualitative skills of communication and influence. The problem many of us have with this is that it is hard to quantify how effective one is around these skills. It is probably safe to say that many of us ended up in this position because of a desire to avoid it. After all, the highly successful TV comedy The Big Bang Theory was built on this premise. Therefore, the goal of this chapter is to set you up for success. The intent is to provide a process, a flexible process no less, where you can become a change agent: a person who can influence and turn their insights into action without positional power. We will focus on Cross-Industry Standard Process for Data Mining (CRISP-DM). It is probably the most well-known and respected of all processes for analytical projects. Even if you use another industry process or something proprietary, there should still be a few gems in this chapter that you can take away.
I will not hesitate to say that this all is easier said than done; without question, I'm guilty of every sin (both commission and omission) that will be discussed in this chapter. With skill and some luck, you can avoid the many physical and emotional scars I've picked up over the last 12 years.
Finally, we will also have a look at a flow chart (a cheat sheet) that you can use to help you identify what methodologies to apply to the problem at hand.
The process
The CRISP-DM process was designed specifically for data mining. However, it is flexible and thorough enough to be applied to any analytical project, whether it is predictive analytics, data science, or machine learning. Don't be intimidated by the numerous lists of tasks as you can apply your judgment to the process and adapt it for any real-world situation. The following figure provides a visual representation of the process and shows the feedback loops that make it so flexible:
Figure 1: CRISP-DM 1.0, Step-by-step data mining guide
The process has the following six phases:
Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment
For an in-depth review of the entire process with all of its tasks and subtasks, you can examine the paper by SPSS, CRISP-DM 1.0, step-by-step data mining guide, available at https://the-modeling-agency.com/crisp-dm.pdf.
I will discuss each of the steps in the process, covering the important tasks. However, it will not be in as detailed as the guide, but more high-level. We will not skip any of the critical details but focus more on the techniques that one can apply to the tasks. Keep in mind that these process steps will be used in later chapters as a framework in the actual application of the machine-learning methods in general and the R code, in particular.
Business understanding
One cannot underestimate how important this first step in the process is in achieving success. It is the foundational step, and failure or success here will likely determine failure or success for the rest of the project. The purpose of this step is to identify the requirements of the business so that you can translate them into analytical objectives. It has the following four tasks:
Identifying the business objective.
Assessing the situation.
Determining analytical goals.
Producing a project plan.
Identifying the business objective
The key to this task is to identify the goals of the organization and frame the problem. An effective question to ask is, What are we going to do different?
This may seem like a benign question, but it can really challenge people to work out what they need from an analytical perspective and it can get to the root of the decision that needs to be made. It can also prevent you from going out and doing a lot of unnecessary work on some kind of fishing expedition.
As such, the key for you is to identify the decision. A working definition of a decision can be put forward to the team as the irrevocable choice to commit or not commit the resources. Additionally, remember that the choice to do nothing different is indeed a decision.
This does not mean that a project should not be launched if the choices are not absolutely clear. There will be times when the problem is not, or cannot be, well defined; to paraphrase former Defense Secretary Donald Rumsfeld, there are known-unknowns. Indeed, there will probably be many times when the problem is ill defined and the project's main goal is to further the understanding of the problem and generate hypotheses; again calling on Secretary Rumsfeld, unknown-unknowns, which means that you don't know what you don't know. However, with ill-defined problems, one could go forward with an understanding of what will happen next in terms of resource commitment based on the various outcomes from hypothesis exploration.
Another thing to consider in this task is the management of expectations. There is no such thing as perfect data, no matter what its depth and breadth are. This is not the time to make guarantees but to communicate what is possible, given your expertise.
I recommend a couple of outputs from this task. The first is a mission statement. This is not the touchy-feely mission statement of an organization, but it is your mission statement or, more importantly, the mission statement approved by the project sponsor. I stole this idea from my years of military experience and I could write volumes on why it is effective, but that is for another day. Let's just say that, in the absence of clear direction or guidance, the mission statement, or whatever you
Ulasan
Ulasan
Pendapat orang tentang Mastering Machine Learning with R - Second Edition
00 peringkat / 0 ulasan