Anda di halaman 1dari 17

Al.I.

Cuza University of Iai


Faculty of Economics and Business Administration
Department of Accounting, Information Systems and
Statistics

Data Analysis & Data


Science with R
Functions for dealing with NAs,
NULLs, dates, strings, regular
expressions, etc.
By Marin Fotache

R script associated with this


presentation
04a_functions_for_NAs_strings_etc.R

http://1drv.ms/1E1m81i

Web sites with R tutorials for


system functions
http://www.sr.bham.ac.uk/~ajrs/R/r-

function_list.htmlort
http://www.sr.bham.ac.uk/~ajrs/R/r-functio
n_list.html

Two main types of missing values


NA:

Stands for Not Available

Is the equivalent of NULL in relational databases

When importing data from Excel, tab-delimited files, etc., usually


unknown values are represented by NA's

Not to be confounded with "NA" string (which sometimes occurs


when importing)

Main function: is.na()

NULL:

Completely different from NULLs in relational databases

Within a vector, an element can be NA but not NULL (NULL is


atomic); if used inside a vector, a NULL element simply
dissapears

Function is.na
Create

a very simple vector:


> y <- c(1, 2, 3, NA)
Function is.na applied to a vector shows, for each
element, if that element has a NA value
> is.na(y)
[1] FALSE FALSE FALSE

TRUE

Function na.fail
Function

na.fail checks if there are NA values in a dataset;


na.fail will generate an error if there is at least one NA
within one of the columns of the data set

Data

frames student_gi, patientdata, mpi,


FuelEfficiency.new, ToyotaCorolla do not contain NA
values

> na.fail(student_gi)
1001
1002
1003
1004
1005

name age scholarship lab_assessment final_grade


Popescu I. Vasile 23
Social
Bine
9.00
Ianos W. Adriana 19
Studiu1
Foarte bine
9.45
Kovacz V. Iosef 21
Studiu2
Excelent
9.75
Babadag I. Maria 22
Merit
Bine
9.00
Pop P. Ion 31
Studiu1
Slab
6.00

Function na.fail (cont.)


Data frame leadership contains at least one NA value,
so na.fail will generate an error:
> na.fail(leadership)
Error in na.fail.default(leadership) : missing
values in object
> leadership
manager
date country gender age q1 q2 q3
q4 q5
1
1 2010/10/24
US
M 32 5 4 5
5 5
2
2 1995/10/28
US
F 45 3 5 2
5 5
3
3 1985/10/1
UK
F 25 3 5 5
5 2
4
4 2000/12/10
UK
M 59 3 3 4

Check NA's for range of


elements

Display,

for each student, if the e-mail is missing or

not
> head(studs2014)
> is.na(studs2014[,"email"])
Display only the students with misssing e-mail
address:
> studs2014[is.na(studs2014$email),]
Display, for each observation, if variables q1:q5 are
NA in data frame leadership:
> is.na(leadership[,6:10])
q1
q2
q3
q4
q5
[1,] FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE

Counting NA's
Counting

the number if NA values within an


entire data frame is possible with function sum
and the following syntax:
sum(is.na(the.data.frame))

How

many NA values are the in data frame


leadership?

> sum(is.na(leadership))
[1] 2
How

many NA values are the in data frame


comp?

> sum(is.na(comp))

Function complete.cases
Display

observations/rows which have at least one

NA
> leadership[!complete.cases(leadership),]
manager
date country gender age q1 q2 q3
q4 q5
4
4 2000/12/10
UK
M 59 3 3 4
NA NA

> comp[!complete.cases(comp),]
Counting

how many observations/rows have at


least one NA and how many have no NAs (are
complete cases)
> table(complete.cases(leadership))
FALSE
1

TRUE
4

Counting/displaying NA's for


variables and ranges
How

many students have no e-mail address ?

Using sum...

> sum(is.na(studs2014$email))
[1] 1617

...or table

> table(is.na(studs2014$email))
FALSE TRUE
4875 1617
Display,

for each observation, observations in


which at least one value of variables q1:q5 is NA
in data frame leadership
> leadership[!
complete.cases(leadership[6:10]),]
manager

date country gender age q1 q2 q3

NULLs
Completely

different from databases (NA is pretty


close to the concept of NULL in relational databases
Within a vector, an element can be NA but not NULL
(NULL is atomic)
When used inside a vector, a NULL element simply
dissapears
> v.with.na <- c(1, 3, 4, NA, 5, 6)
> v.with.null <- c(1, 3, 4, NULL, 5, 6)
> v.with.na
[1]

4 NA

> v.with.null

[1] 1 3 4 5 6

NULLs (cont.)
> length(v.with.na)
[1] 6
> length(v.with.null)
[1] 5
> sum(v.with.na)
[1] NA
> sum(v.with.null)
[1] 19
In

a data frame, a variable/column set to NULL also


dissapears

> names(adl2013_stud)
[1] "Nr"
"NumePren" "Matricol" "Email"
> adl2013_stud$Nr <- NULL
> names(adl2013_stud)
[1] "NumePren" "Matricol" "Email"

Functions for managing string


variables

Base R

nchar
substr
strsplit
paste / paste0 / sprintf

Package

stringr

str_c() - string concatenation ~ paste()


str_length() - number of characters ~ nchar()
str_sub() - extracts substrings ~ substring()
str_dup() - duplicates characters ~ no equivalent
str_trim() - removes leading and trailing whitespace ~ no
equivalent
str_pad() - pads a string ~ no equivalent
str_wrap() - wraps a string paragraph ~ strwrap()
str_trim() - trims a string ~ no equivalent

word () extracts words from a string ~ no equivalent

Some web pages for processing


strings in R
Gaston

Sanchez - Handling and


Processing Strings in R

http://gastonsanchez.com/Handling_and_Pr
ocessing_Strings_in_R.pdf
John Myles White - Text Processing in R

http://www.johnmyleswhite.com/notebook/2
009/02/25/text-processing-in-r/
Elana J. Fertig - Processing strings with R
https://www.academia.edu/1744442/String
_processing_with_R

Regular expressions
Vital

for text/string/web searching


Implemented in almost every programming
language
In SQL the basic mechanist is rudimentar and
based on operators: LIKE, ILIKE, SIMILAR TO
R has full support for regular expressions

Functions in base R:

grep, grepl,
sub, gsub,
regexpr, gregexpr

Functions in package stringr:

strdetect
strextract, strextractall, strmatch, strmatchall
strlocate, strlocateall
strreplace, strreplaceall
strsplit, strsplitfixed

Web sites/video-tutorials for regular


expression (generally and R specific)
Basics

of regular expressions (in general


and in R)

http://www.rexegg.com/regex-quickstart

.html
http://www.r-bloggers.com/regular-expres
sion-and-associated-functions-in-r/
http://www.r-bloggers.com/r-talk-on-regu
lar-expressions-regex/
Regular Expressions
https://www.youtube.com/watch?v=NvHjY

Anda mungkin juga menyukai