Anda di halaman 1dari 8

Step 3: Explore and Visualize the Data In

Database Advanced Analytics Tutorial


SQL Server 2016 and later

Updated: April 19, 2016


Applies To: SQL Server 2016
Developing a data science solution usually includes intensive data exploration and data visualization. In this step, you'll review
the sample data, and then generate some plots using R functions. These R functions are already included in R Services In
database; in this walkthrough, you'll practice calling R functions from TransactSQL.

Review the Data


First, take a minute to review the sample data, if you haven't already.
In the original dataset, the taxi identifiers and trip records were provided in separate files. However, to make the sample data
easier to use, the two original datasets have been joined on the columns medallion, hack_license, and pickup_datetime. The
records were also sampled to get just 1% of the original number of records. The resulting downsampled dataset has 1,703,957
rows and 23 columns.
Taxi identifiers
The medallion column represents the taxis unique id number.
The hack_license column contains the taxi driver's license number anonymized .
Trip and fare records
Each trip record includes the pickup and dropoff location and time, and the trip distance.
Each fare record includes payment information such as the payment type, total amount of payment, and the tip amount.
The last three columns can be used for various machine learning tasks. The tip_amount column contains continuous
numeric values and can be used as the label column for regression analysis. The tipped column has only yes/no values
and is used for binary classification. The tip_class column has multiple class labels and therefore can be used as the label
for multiclass classification tasks.
This walkthrough demonstrates only the binary classification task; you are welcome to try building models for the other
two machine learning tasks, regression and multiclass classification.
The values used for the label columns are all based on the tip_amount column, using these business rules:

Derived column name

Rule

tipped

If tip_amount > 0, tipped = 1, otherwise tipped = 0

tip_class

Class 0: tip_amount = $0
Class 1: tip_amount > $0 and tip_amount <= $5
Class 2: tip_amount > $5 and tip_amount <= $10
Class 3: tip_amount > $10 and tip_amount <= $20
Class 4: tip_amount > $20

Create Plots using R in T-SQL


Because visualization is such a powerful tool for understanding the distribution of the data and outliers, R provides many
packages for visualizing data. The standard opensource distribution of R also includes many functions for crating histograms,
scatter plots, box plots, and other data exploration graphs.
R typically creates images using an R device for graphical output. You can capture the output of this device and store the image
in a varbinary data type for rendering in application, or you can save the images to any of the support file formats .JPG, .PDF,
etc..
In this section, you'll learn how to work with each type of output using stored procedures.
Storing plots as varbinary data type
Saving plots in files .JPG, .PDF on the server

Storing plots as varbinary data type


You'll use rxHistogram, one of the enhanced R functions provided in R Services Indatabase, to plot a histogram based on
data from a TransactSQL query. To make it easier to call the R function, you will wrap it in a stored procedure, PlotHistogram.
The stored procedure returns the image as a stream of varbinary data, which obviously you cannot view directly. However, you
can use the bcp utility to get the varbinary data and save it as an image file on a client computer.
To create the stored procedure PlotHistogram
1. In SQL Server Management Studio, open a new query window.
2. Select the database for the walkthrough, and create the procedure using this statement.

CREATEPROCEDURE[dbo].[PlotHistogram]
AS
BEGIN
SETNOCOUNTON;
DECLARE@querynvarchar(max)=
N'SELECTtippedFROMnyctaxi_sample'
EXECUTEsp_execute_external_script@language=N'R',
@script=N'

image_file=tempfile();
jpeg(filename=image_file);
#Plothistogram
rxHistogram(~tipped,data=InputDataSet,col=''lightgreen'',
title=''TipHistogram'',xlab=''Tippedornot'',ylab=''Counts'');
dev.off();
OutputDataSet<data.frame(data=readBin(file(image_file,"rb"),what=raw(),n=1e6));
',
@input_data_1=@query
WITHRESULTSETS((plotvarbinary(max)));
END
GO

Be sure to modify the code to use the correct table name, if needed.
The variable @query defines the query text 'SELECTtippedFROMnyctaxi_sample', which is passed to the R
script as the argument to the script input variable, @input_data_1.
The R script is fairly simple: an R variable image_file is defined to store the image, and then the rxHistogram
function is called to generate the plot.
The R device is set to off.
In R, when you issue a highlevel plotting command, R will open a graphics window, called a device. You can
change the size and colors and other aspects of the window, or you can turn the device off if you are writing to a
file or handling the output some other way.
The R graphics object is serialized to an R data.frame for output. This is a temporary workaround for CTP3.
To output varbinary data to viewable graphics file
1. In Management Studio, run the following statement:

EXEC[dbo].[PlotHistogram]

Results
plot
0xFFD8FFE000104A4649...
2. Open a PowerShell command prompt and run the following command, providing the appropriate instance name,
database name, username, and credentials as arguments:

bcp"execPlotHistogram"queryout"plot.jpg"S<SQLServerinstancename>d<database
name>U<username>P<password>

Note

Command switches for bcp are casesensitive.


3. If the connection is successful, you will be prompted to enter more information about the graphic file format. Press ENTER
at each prompt to accept the defaults, except for these changes:
For prefixlength of field plot, type 0
Type Y if you want to save the output parameters for later reuse.

Enterthefilestoragetypeoffieldplot[varbinary(max)]:
Enterprefixlengthoffieldplot[8]:0
Enterlengthoffieldplot[0]:
Enterfieldterminator[none]:
Doyouwanttosavethisformatinformationinafile?[Y/n]
Hostfilename[bcp.fmt]:

Results
Starting copy...
1 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 3922 Average : (0.25 rows per sec.)

Tip

If you save the format information to file bcp.fmt, the bcp utility generates a format definition that you can apply to similar
commands in future without being prompted for graphic file format options. To use the format file, add fbcp.fmt to the
end of any command line, after the password argument.
4. The output file will be created in the same directory where you ran the PowerShell command. To view the plot, just open
the file plot.jpg.

Saving plots in files (jpg, pdf) on the server


Outputting an R plot to a binary data type might be convenient for consumption by applications, but it is not very useful to a
data scientist who needs the rendered plot during the data exploration stage. Typically the data scientist will generate multiple
data visualizations, to get insights into the data from different perspectives.
To generate graphs that can be more easily viewed, you can use a stored procedure that creates the output of R in popular
formats such as .JPG, .PDF, and .PNG. After the stored procedure creates the graphic, simply open the file to visualize the plot.
In this step, you'll create a new stored procedure, PlotInOutputFiles, that demonstrates how to use R plotting functions to create
histograms, scatterplots, and more in .JPG and .PDF format. The graphics files are saved to local files; that is, in a folder on the
SQL Server instance.
To create the stored procedure PlotInOutputFiles
1. In SQL Server Management Studio, open a new query window, and paste in the following TransactSQL statement.

CREATEPROCEDURE[dbo].[PlotInOutputFiles]
AS
BEGIN
SETNOCOUNTON;
DECLARE@querynvarchar(max)=
N'SELECTcast(tippedasint)astipped,tip_amount,fare_amountFROM[dbo].
[nyctaxi_sample]'
EXECUTEsp_execute_external_script@language=N'R',
@script=N'
#Setoutputdirectoryforfilesandcheckforexistingfileswithsamenames
mainDir<''C:\\temp\\plots''
dir.create(mainDir,recursive=TRUE,showWarnings=FALSE)
setwd(mainDir);
print("Creatingoutputplotfiles:",quote=FALSE)

#Openajpegfileandoutputhistogramoftippedvariableinthatfile.
dest_filename=tempfile(pattern=''rHistogram_Tipped_'',tmpdir=mainDir)
dest_filename=paste(dest_filename,''.jpg'',sep="")
print(dest_filename,quote=FALSE);
jpeg(filename=dest_filename);
hist(InputDataSet$tipped,col=''lightgreen'',xlab=''Tipped'',
ylab=''Counts'',main=''Histogram,Tipped'');
dev.off();

#Openapdffileandoutputhistogramsoftipamountandfareamount.
#Outputstwoplotsinonerow
dest_filename=tempfile(pattern=''rHistograms_Tip_and_Fare_Amount_'',tmpdir=
mainDir)
dest_filename=paste(dest_filename,''.pdf'',sep="")
print(dest_filename,quote=FALSE);
pdf(file=dest_filename,height=4,width=7);
par(mfrow=c(1,2));
hist(InputDataSet$tip_amount,col=''lightgreen'',
xlab=''Tipamount($)'',
ylab=''Counts'',
main=''Histogram,Tipamount'',xlim=c(0,40),100);
hist(InputDataSet$fare_amount,col=''lightgreen'',
xlab=''Fareamount($)'',
ylab=''Counts'',
main=''Histogram,
Fareamount'',
xlim=c(0,100),100);
dev.off();
#Openapdffileandoutputanxyplotoftipamountvs.fareamountusinglattice;
#Only10,000sampledobservationsareplottedhere,otherwisefileislarge.
dest_filename=tempfile(pattern=''rXYPlots_Tip_vs_Fare_Amount_'',tmpdir=mainDir)
dest_filename=paste(dest_filename,''.pdf'',sep="")
print(dest_filename,quote=FALSE);
pdf(file=dest_filename,height=4,width=4);
plot(tip_amount~fare_amount,
data=InputDataSet[sample(nrow(InputDataSet),10000),],
ylim=c(0,50),
xlim=c(0,150),
cex=.5,
pch=19,
col=''darkgreen'',
main=''TipamountbyFareamount'',
xlab=''FareAmount($)'',
ylab=''TipAmount($)'');
dev.off();',
@input_data_1=@query
END

The output of the SELECT query within the stored procedure is stored in the default R data frame, InputDataSet.
Various R plotting functions can then be called to generate the actual graphics files.
Most of the embedded R script represents options for these graphics functions, such as plot or hist.

All files are saved to the local folder C:\temp\Plots\.


The destination folder is defined by the arguments provided to the R script as part of the stored procedure. You
can change the destination folder by changing the value of the variable, mainDir.
2. Run the statement to create the stored procedure.
To generate the graphics files
1. In SQL Server Management Studio, run the following SQL query:

EXECPlotInOutputFiles

Results
STDOUT message(s) from external script:
[1] Creating output plot files:
[1]C:\\temp\\plots\\rHistogram_Tipped_18887f6265d4.jpg[1]C:\\temp\\plots\\rHistograms_Tip_and_Fare_Amount_1888441e542c.p
df[1]C:\\temp\\plots\\rXYPlots_Tip_vs_Fare_Amount_18887c9d517b.pdf
2. Open the destination folder and review the files that were created by the R code in the stored procedure. The numbers in
the file names are randomly generated.
rHistogram_Tipped_nnnn.jpg: Shows the number of trips that got a tip 1 vs. the trips that got no tip 0. This histogram is
much like the one you generated in the previous step.
rHistograms_Tip_and_Fare_Amount_nnnn.pdf: Shows the distribution of values in the tip_amount and fare_amount
columns.

![](../Image/RSQL%20images/rsql_devtut_TipAmtFareAmt.PNG)

rXYPlots_Tip_vs_Fare_Amount_nnnn.pdf: A scatterplot with the fare amount on the xaxis and the tip amount on the y
axis.

![](../Image/RSQL%20images/rsql_devtut_TipAmtByFareAmt.PNG)

3. To output the files to a different folder, change the value of the mainDir variable in the R script embedded in the stored
procedure.
You can also modify the script to output different formats, more files, and so on.

Next Step
Step 4: Create Data Features using TSQL

Previous Step
Step 2: Import Data to SQL Server using PowerShell

See Also
InDatabase Advanced Analytics for SQL Developers Tutorial
SQL Server R Services Tutorials
2016 Microsoft

Anda mungkin juga menyukai