Anda di halaman 1dari 3

September 11, 2018 → Big data and Intro to R

Today’s Topics
- Review big data & discuss the 6 V’s
- Installation of R and R Studio
- Intro to R objects
- Intro to tidyverse data transformations
- Piping data operators
- Exercise in Class

Big Data
2001: original 3 v’s → goal: define what big data is
- Volume → tiered storage/hub and spoke, selective data retention, stat sampling,
redundancy elim, offload “cold” data, outsourcing
- Velocity → op data stores, data caches, point-to-point data routing, balance data latency
w/ decision cycles
- Variety → inconsistency resolution, XML translation, app-aware EAI adapters,
middleware, distributed query management, metadata management
Need to get added value out of data (4th V)

Volume
- Big data implies enormous amounts of structured and unstructured data that is
generated by social and sensor networks, transaction and search history, and manual
data collection
Variety
- Variety of sources and contains structured and unstructured
- Data types inc numbers, text, emails, images, txt msg, web pgs, audio, etc
Velocity
- Flow of data that needs to stored and analyzed is continuous
- Analyzed in real time to gain a strategic advantage
- Sampling can help mitigate some of the problems with large data volume and velocity
Veracity
- Data veracity characterizes inherent noise, biases, abnormalities, and mistakes present
in all data streams
- Data must be cleaned real-time and processes must be established to keep “dirty data”
from accumulating
Validity
- May not be valid for intended use
- Valid data for the intended use is essential to making decisions based on the data
- Ex. data from one state isn’t representative of whole U.S.
Volatility
- Data changes over time → volatility, characterizes degree to which data changes
- Decisions and analyses are based on data that has an “expiration date”
- Data scientists must define at what point in time a data stream is no longer relevant
Value
September 11, 2018 → Big data and Intro to R

- Value provided by big data

R
Packages in R
- Enormous advantage -- new techniques available w/o delay
- Allows you to build customized stat program suited to your own need
- Downside = package # inc → hard to choose package that best suits your needs

R Object
- Almost all things in R are objects
- Graphics are written out and not stored as objects
- Scripts are used to make objects. Write a script that by its end, creates the objects and
graphics you need
- Objects are classified by 2 criteria:
- MODE: how objects are stored in R (char, numeric, logical, factor, list, func)
- LENGTH: all objects have a length in R length
- CLASS: how objects are treated by functions (matrix, array, data.frame, etc)

R Mode of a Variable(data type)


- Numeric
- Logical (true, false)
- Character (ASCII)
- Factor (small set of char values)
- Date(rep as int, 1970-01-01)

R Object Definition
- Object: collection of atomic variables and/or other objects that belong together
- Advantages:
- Encapsulation (use objects and methods someone else has written w/o caring
about internals)
- Generic functions
- Inheritance
- Disadvantages:
- Overcomplicated, baroque program architecture

R Vectors
- Simplest R data object
- Typical variable in a diff computer language is vector w/ length 1
- Vector: ordered collection of data of the same mode
- 1d, same mode(char, numeric, int, etc), create vector using c()
- Ex. a<- c(1, 2, 3, 4)
- R will change mode of a vector in given data
September 11, 2018 → Big data and Intro to R

R Matrices
- Matrix: rectangular table of data where all component/elements are of the same type
- 2d, convert vector to matrix with dim function
- Ex. dim(a) <- c(2, 2)
-
Accessing data in vectors, matrices, arrays
- Individual elements of a vector, matrix or an array are accessed with”[]”
- Pos value → specific elements
- Neg value → excludes specific elements
- X <- sample(1:3, rep=TRUE)
- [1] 1 2 1 3 1 2 3 1 2
- IndexOne <-x == 1
- TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE
- x[IndexOne] <= 0
-
R Data Frames
- Rectangular data object where vectors are of same length
- Columns can be of diff data types
- All elements in a column are of the same data type
- Df are commonly used object, each row reps an observation and each column is a
variable
- Function data.frame() converts a matrix or a collection of vectors into a dataframe
- Allows an array to have different data types
- Rows and columns may have names
- data_frame() → specific type of frame

R Lists
- Lists are most general type of objects
- Ordered colection of data of diff dat atypes
- list() creates list → ex. Subject1 <- list(name=”riley”, age=25, married=F)
- Access to components is typically through the name
- subject1$name or subject1[1]

R Formulas
- Formula is a data object with a class=”formula”
- Y~x
- Y is modeled as a function of x
- X is the predictor variables and y is the resposne variable
- + +x include variable in model
- - -x remove variable from model
- : z: x include interaction bw x and y
- * z*x include x, z and the interaction in model

Anda mungkin juga menyukai