Big Insights

IBM InfoSphere BigInsights Version 2.
Tutorials
GC19-4104-01
IBM InfoSphere BigInsights Version 2.1
Tutorials
GC19-4104-01
Note Before using this information and the product that it supports, read the information in Notices and trademarks on page 87.
Copyright IBM Corporation 2013. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Chapter 1. InfoSphere BigInsights Tutorials . . . . . . . . . . . . . . 1 Chapter 2. Tutorial: Managing your big data environment. . . . . . . . . . . 3
Lesson 1: Starting to use the InfoSphere BigInsights Console . . . . . . . . . . . . . . . . 3 Lesson 2: Exploring the InfoSphere BigInsights Console . . . . . . . . . . . . . . . . 4 Summary of managing your big data environment . 5 Lesson 6: Upgrading your application . . Summary of developing your first big data application . . . . . . . . . . . . . . . . 29 . 31
Chapter 6. Tutorial: Developing Big SQL queries to analyze big data . . . . 33

Lesson 1: Connecting to the IBM Big SQL server . . Lesson 2: Creating a project and an SQL script file Lesson 3: Creating tables and loading data . . . . Lesson 4: Running basic Big SQL queries . . . . Lesson 5: Analyzing the data with Big SQL . . . . Lesson 6: Publishing a Big SQL application . . . . Lesson 7: Analyzing the IBM Big SQL data in BigSheets . . . . . . . . . . . . . . . Lesson 8: Analyzing BigSheets data in IBM Big SQL tables . . . . . . . . . . . . . . . . Lesson 9: Analyzing the Big SQL data in the Big SQL Console . . . . . . . . . . . . . . Lesson 10: Analyzing the Big SQL data in a client spreadsheet application . . . . . . . . . . Lesson 11: Writing advanced Big SQL queries and including optimization hints . . . . . . . . . Summary of developing Big SQL queries to analyze big data . . . . . . . . . . . . . . . 34 36 37 42 45 48 51 54 58 59 62 64
Chapter 3. Tutorial: Importing data for analysis . . . . . . . . . . . . . . . 7

Lesson 1: Managing your data . . . . . . . . 8 Lesson 2: Importing data by using the BoardReader application . . . . . . . . . . . . . . . 8 Lesson 3: Importing data by using the Distributed File Copy application . . . . . . . . . . . 10 Summary of importing data to the distributed file system . . . . . . . . . . . . . . . . 11
Chapter 4. Tutorial: Analyzing big data with BigSheets . . . . . . . . . . . 13

Lesson 1: Creating master workbooks from social media data . . . . . . . . . . . . . Lesson 2: Tailoring your data by creating child workbooks . . . . . . . . . . . . . Lesson 3: Combining the data from two workbooks Lesson 4: Creating columns by grouping data . . Lesson 5: Viewing data in BigSheets diagrams . . Lesson 6: Visualizing and refining the results in charts . . . . . . . . . . . . . . . Lesson 7: Exporting data from your workbooks . Summary of analyzing data with BigSheets tutorial . 13 . 15 18 . 19 . 20 . 20 . 23 24
Chapter 7. Tutorial: Creating an extractor to derive valuable insights from text documents. . . . . . . . . 65
Lesson 1: Setting up your project . . . . . . Lesson 2: Selecting input documents and labeling examples . . . . . . . . . . . . . . Lesson 3: Writing and testing AQL . . . . . Summary - the basic lessons . . . . . . . Lesson 4: Writing and testing AQL for candidates Lesson 5: Writing and testing final AQL . . . . Lesson 6: Finalizing and exporting the extractor . Lesson 7: Publishing the AQL module . . . . Summary of creating your first text analytics application . . . . . . . . . . . . . . 66 . . . . . . . 67 70 77 77 82 84 84
Chapter 5. Tutorial: Developing your first big data application . . . . . . . 25

Lesson 1: Creating an InfoSphere BigInsights project Lesson 2: Creating and populating a Jaql file with application logic . . . . . . . . . . . . . Lesson 3: Testing your application . . . . . . . Lesson 4: Publishing your application in the InfoSphere BigInsights applications catalog . . . . Lesson 5: Deploying and running your application on the cluster . . . . . . . . . . . . . . 25 26 27 28 29
. 85
Notices and trademarks . . . . . . . 87 Providing comments on the documentation . . . . . . . . . . . 91
Copyright IBM Corp. 2013
iii
iv
IBM InfoSphere BigInsights Version 2.1: Tutorials
Chapter 1. InfoSphere BigInsights Tutorials
Manage
Import
Analyze Delve into BigSheets, an intuitive spreadsheet-like tool, to create analytic queries without any previous programming experience.
Within minutes, dive into the Collect and import data for world of big data with exploration and analysis that robust, browser-based helps you make sense of control. seemingly unrelated data.
Develop Easily develop your first big data application by using the InfoSphere BigInsights Eclipse plugin.
Query
Extract
Quickly master the intricacies Discover the power of Text of SQL queries for Hadoop Analytics by creating with IBM Big SQL. extractors to derive valuable insights from text documents.
Related information: InfoSphere BigInsights, Version 2.1: Tutorials (PDF)
Chapter 2. Tutorial: Managing your big data environment

Learn how to use the InfoSphere BigInsights Console to check the status of services, view the health of your system, and monitor the status of applications. Within minutes, you will be able to quickly navigate and use the InfoSphere BigInsights Console to manage your big data environment. This tutorial does not cover real-time monitoring of dashboards, or application linking. These topics are for more advanced users.
Learning objectives
After completing the lessons in this tutorial, you will have learned how to complete the following tasks: v Use the InfoSphere BigInsights Console to inspect the status of your cluster, start and stop components, and access tools that are available for open source components. v Work with the distributed file system. In particular, you will explore the Hadoop Distributed File System (HDFS) directory structure, create subdirectories, and upload files to HDFS. v Launch applications and inspect their status. You will also learn how to view output in BigSheets, a spreadsheet-like tool.
Time required
This module should take approximately 20 minutes to complete.
Lesson 1: Starting to use the InfoSphere BigInsights Console

In this lesson, you log into the InfoSphere BigInsights Console to explore the Welcome page, and ensure that all InfoSphere BigInsights services are running. These services, such as MapReduce and Hadoop Distributed File System (HDFS) or General Parallel File System (GPFS), are required for InfoSphere BigInsights to function correctly. You can start, stop, and manage these services directly from the InfoSphere BigInsights Console. In addition, you can use the InfoSphere BigInsights Console, to view the health of your cluster, deploy applications, manage cluster instances, work with BigSheets, manage files, and schedule workflows, jobs, and tasks from a single location. Procedures 1. Log in to the InfoSphere BigInsights Console.
Option In a non-SSL installation Description Enter the following URL in your browser: http://host_name:8080 host_name is the name of the host where the InfoSphere BigInsights Console is running.
Option In an SSL installation
Description Enter the following URL in your browser: https://host_name:8443 host_name is the name of the host where the InfoSphere BigInsights Console is running.
2. Explore each section of the Welcome page to learn more about the tasks and resources that are available.
Option Understand IBM big data tools Description Click the graphic to open an interactive model in the InfoSphere BigInsights Information Center. Quick access to commonly used InfoSphere BigInsights tasks. Links to internal and external quick links and downloads to enhance your environment. Online resources available to learn more about InfoSphere BigInsights.
Tasks Quick Links
Learn More
3. In the Welcome page, under Tasks, click View, start or stop a service. 4. Ensure that all services are running. If some of the InfoSphere BigInsights services are not running, click the service that is not running and then click Start. After you start these services, they should remain active for the remainder of this tutorial.
Lesson 2: Exploring the InfoSphere BigInsights Console

In this lesson, you navigate the different sections of the InfoSphere BigInsights Console to familiarize yourself with the capabilities that are available to you. Administrators use the InfoSphere BigInsights Console to inspect the overall health of the system, as well as complete basic functions such as starting and stopping specific servers and components, and adding nodes to the cluster. Other users can interact with files in the Hadoop Distributed File System (HDFS), manage applications, and work with BigSheets. You use BigSheets in a later module to work with social media data and create a BigSheets workbook. Procedures 1. On the Cluster Status page, click the Hive service to display detailed information for this component. From here, you can start or stop the Hive service, or any Hadoop service, depending on your needs. 2. Click the Welcome tab, and then in the Quick Links section, click Access secure cluster servers. A list of URLs displays. Click the hive link, which opens the Hive Web Interface into a new browser window. You see an open source tool that is provided with Hive for administration purposes, such as displaying Hive tables and drilling into the database schema. 3. In the InfoSphere BigInsights Console, click the Files tab to begin exploring your distributed file system. You use the Files page to explore the contents of
your file system, create new subdirectories, upload small files for test purposes, and complete other file-related functions. 4. Become familiar with the functions that are provided by using the icons at the top of the pane in the Files page. These icons are used throughout the tutorial. Hover over an icon with your cursor to learn its function.
5. In the Files page, expand the directory tree in the left navigation. If you already uploaded files to the HDFS, you can navigate through the directory to locate them. Otherwise, you can browse the sample files and directories that are included with IBM InfoSphere BigInsights Quick Start Edition. 6. Click the Applications tab, and then click Manage to view applications that are available in your cluster. From this tab, you can deploy applications to and delete applications from the cluster. 7. Select the Boardreader application and then click Deploy. In the Deploy Application window, click Deploy. You use the Boardreader application in a later module to import data into your HDFS. 8. To view the status of applications, click the Applications Status tab. If this is the first time that you are using the InfoSphere BigInsights Console, no applications, workflows, or jobs display. After you run applications, workflows, or jobs, you can view their status from this page.
Summary of managing your big data environment

In this tutorial, you learned about the different sections of the InfoSphere BigInsights Console and how you can use them to get started with managing your big data environment.
Lessons learned
You now have a good understanding of the following tasks: v Getting started with common tasks in the InfoSphere BigInsights Console v Starting and stopping InfoSphere BigInsights services v Managing and interacting with files in your distributed file system v Deploying applications to the cluster
Additional resources
To learn more about the tasks that you can complete by using InfoSphere BigInsights, use the interactive conceptual models. These models provide insight into some of the other tutorials that you can complete by using the product. v Overview of InfoSphere BigInsights v Developing applications by using the InfoSphere BigInsights Tools for Eclipse v Creating text extractors by using Text Analytics For more information about how your organization can use InfoSphere BigInsights to efficiently manage and mine big data for valuable insights, read this data sheet.
Chapter 2. Tutorial: Managing your big data environment
Chapter 3. Tutorial: Importing data for analysis

Learn how to import data into your distributed file system from your local system or network by using the InfoSphere BigInsights Console and IBM provided applications. Business data is stored in various formats and sources. Before you import your data into the InfoSphere BigInsights distributed file system, you must determine what questions you want to answer through analysis, identify the data type of your sources, and use the tools and procedures that best fit your business need. You can use InfoSphere BigInsights with your existing infrastructure or data warehouse to import data and content in its original formats, or you can import huge volumes of at-rest (static) data or incoming data in motion (continually updated data). After you import your data, you can explore the data separately or combine the data to complete exploration and analysis. Many businesses might want to examine the popularity of a specific brand or service in social media. The data that is provided for this lesson is the result of a BoardReader application search for the instances of the phrase "IBM Watson" on the Internet. This search is detailed in the developerWorks article, Analyzing social media and structured data with InfoSphere BigInsights: Get a quick start with BigSheets. IBM Watson is a research project that uses complex analytics to answer questions that are presented in a natural language. For this tutorial, and the related tutorial on BigSheets, only news and blog data that was returned by the search is used. The returned data was slightly modified to contain only a subset of the information that the BoardReader application collects from blogs and news feeds. The full-text/HTML content of posts, news items, and certain metadata, was removed to keep the size of each file manageable. The BoardReader application requires a license for use. If you have a license, you can choose to follow the steps in the lesson on using the BoardReader application (Lesson 2), or download the data to your computer and import it to the InfoSphere BigInsights distributed file system for use with the Distributed File Copy application (Lesson 3). To obtain a license, see the BoardReader website.
Learning objectives
After you complete the lessons in this tutorial, you will understand the concepts and know how to: v Create a folder for your sample data in the InfoSphere BigInsights distributed file system. v Collect and import data by using the BoardReader application. v Import data from your local system or network by using the Distributed File Copy application. v Locate imported data in the distributed file system for use in BigSheets, Big SQL, and Text Analytics.
Time required
The time required to complete this tutorial depends on which method you choose to use to import your data, and the cluster configuration and the number of nodes
available for your use. If you choose to complete the BoardReader lesson, this tutorial will take approximately 20 minutes to complete. If you choose to use only the Distributed File Copy application, this tutorial will take approximately 5 minutes to complete.
Prerequisites
Before you begin this tutorial, ensure that you installed the InfoSphere BigInsights tools for Eclipse, and that you have access to the application through the InfoSphere BigInsights Console.
Lesson 1: Managing your data

Before you import your data, it is important to determine how you want to manage your data in the InfoSphere BigInsights distributed file system. About this task For this module, there are two options for gathering your data. However, to best manage your information you should first create a folder to store the data. Procedures 1. Open the InfoSphere BigInsights Console. 2. From the Files tab, select the HDFS tab. 3. Create a directory to store this data in the distributed file system. Click the ) in the Files toolbar. Create Directory folder icon ( 4. Name your directory. For this lesson, in the HDFS directory, create the directory bi_sample_data. If you are using the InfoSphere BigInsights Quick Start Edition, the home directory is /user/biadmin/. 5. In the bi_sample_data directory, create a subdirectory named bigsheets where you can store and access this same IBM Watson data for the BigSheets tutorial. You now have a directory to store all of your source data files and application results.
Lesson 2: Importing data by using the BoardReader application

The data that is used in this tutorial is gathered by using the BoardReader application. This application is just one method of collecting data and importing it into the InfoSphere BigInsights distributed file system. Prerequisites To use the BoardReader application, each customer must contact BoardReader to obtain a valid license key. To obtain a license, see the BoardReader website. If you do not have access to a BoardReader license, you can follow along to learn the steps to use the application, or skip to the next lesson to download the finished data by using the Distributed File Copy application. The following is an example of what the properties file for BoardReader might look like, and your_key_value is your license key:
boardreaderkey=your_key_value
You must create a credential file with the BoardReader key.There are private and public files in the credentials store. The private credentials store contains your private information in the /user/username/credstore/private directory. If you want to import data by using an SFTP or FTP connection, make sure that this connection is running on your system. About this task Collecting social media data can be challenging because each site can hold different information and use varying data structures. Also, visiting numerous sites to gather your information is a time-consuming process. For this lesson, the BoardReader sample application that is provided with InfoSphere BigInsights can search blogs, news feeds, discussion boards, and video sites. Procedures 1. Deploy the BoardReader application to make it available for your use. a. In the InfoSphere BigInsights Console, in the Applications tab, click Manage. b. From the navigation tree, expand the Import directory. c. Select the BoardReader application, and click the Deploy button (
). d. In the Deploy Application window, select Deploy. 2. From the toolbar on the top of the hierarchy tree window, select Run. 3. Select the BoardReader application. 4. Define the Execution name of your project. This step creates a project, and you can track the results and reuse the project later. For example, enter the Execution name br_ibmwatson. 5. Define your application parameters. a. In the Results path field, specify the directory for the application's output. Use the Browse button to locate the file /bi_sample_data/bigsheets in the Hadoop Distributed File System (HDFS) directory. If you are using the InfoSphere BigInsights Quick Start Edition, the directory is /user/biadmin/bi_sample_data/bigsheets. b. Define the Maximum matches that you want to be returned from the search. Since you want to be able to use this data for full scale analysis, use the range 1,000. c. Select a Start date and an End date. Define a specific past time frame for the BoardReader to search. To search for this Watson data, define the start date as January 1, 2011. Define the end date as March 31, 2012. d. Select a Properties file. The Properties file references the file in the InfoSphere BigInsights credentials store that was populated with the BoardReader license key. e. In the Search terms field, enter the term "IBM Watson" as the subject of this search. This string causes the BoardReader application to search for any instance of both terms appearing together. 6. Select Run to run the search in the BoardReader application. The data is imported to the specified results path. 7. Verify that the BoardReader application conducted a successful search. You can examine the status in the Application History panel. Return to the Files tab,
and locate the /bi_sample_data/bigsheets directory to locate your search results. If you are using the InfoSphere BigInsights Quick Start Edition, the directory is /user/biadmin/bi_sample_data/bigsheets.
Lesson 3: Importing data by using the Distributed File Copy application

The Distributed File Copy application copies files to and from a remote source to the InfoSphere BigInsights distributed file system by using Hadoop Distributed File System (HDFS), GPFS, FTP, or SFTP. You can also copy files to and from your local file system. Prerequisites To use the Distributed File Copy application with SFTP, you can create a credential file. There are private and public files in the credentials store. The private credentials store contains the private information for each user that is in the /user/username/credstore/private directory. The following is an example of what the properties file for SFTP might look like:
database=db2inst2 dbuser=pascal password=[base64]LDo8LTor
Note: The Distributed File Copy application is designed to move large amounts of data. This application is designed to run on a Linux platform. To upload smaller data sets (less than 2G), you can use the Upload function from the Files tab in the InfoSphere BigInsights Console. For more information about this import method, see Tutorial: Analyzing big data with BigSheets About this task For this lesson, you will download the IBM Watson data that was the result of the BoardReader application search to your local system, and then upload it to the file system for analysis. Before you begin, you must first download the data to your local system. The data is in the Download section of the developerWorks article, "Analyzing social media and structured data with InfoSphere BigInsights: Get a quick start with BigSheets". Accept the terms and conditions and save the file article_sampleData to your local system. After you unzip the file, the article_sampleData folder should contain the files RDBMS_data.csv, blogs-data.txt, news-data.txt, and a README.txt file that details the data output. Procedures 1. Deploy the Distributed File Copy application to make it available for your use. a. In the InfoSphere BigInsights Console, in the Applications tab, click Manage. b. From the navigation tree, expand the Import directory. c. Select the Distributed File Copy application, and click the Deploy button ( ) d. In the Deploy Application window, select Deploy. 2. From the toolbar on the top of the hierarchy tree window, select Run.
10
3. Select the Distributed File Copy application. 4. Define your application parameters. a. Specify an Execution name. This step creates a project, and you can track the results and reuse the project later. Name the execution dc_ibmwatson. b. In the Input path field, specify the fully qualified path to the article_sampleData file on your local file system. For example, sftp://username:password@localhost/file/path/article_sampleData The default is HDFS, if an SFTP or FTP connection, or a GPFS file system, is not specified c. In the Output path field, specify the fully qualified path to where you want to store the data, for example /bi_sample_data/bigsheets/ article_sampleData. If you are using the InfoSphere BigInsights Quick Start Edition, the directory is /user/biadmin/bi_sample_data/bigsheets/ article_sampleData. Make sure to include the name of the file that you want to import in the file path to prevent the folder from being mistaken as the name of the data file. d. Optional: If you are using SFTP to connect to your local file system, use the Browse button to specify the fully qualified path to your properties file in the InfoSphere BigInsights credentials store. 5. Select Run to import the file. 6. Verify that the Distributed File Copy application conducted a successful import. To verify the import, you can examine the status in the Application History panel. Return to the Files tab, and locate the /bi_sample_data/bigsheets directory to locate your import results. If you are using the InfoSphere BigInsights Quick Start Edition, the directory is /user/biadmin/bi_sample_data/ bigsheets/article_sampleData.
Summary of importing data to the distributed file system

In this module, you learned about how to use the Distributed File Copy application and the BoardReader application to import data into the InfoSphere BigInsights distributed file system.
Lessons learned
You now have a good understanding of the following tasks: v Creating a new directory in the InfoSphere BigInsights distributed file system. v How to deploy an IBM-provided application. v Collecting and importing data with the BoardReader application. v Importing data with the Distributed File Copy application. v Locating your data for use with BigSheets, Big SQL, or Text Analytics.
11
12
Chapter 4. Tutorial: Analyzing big data with BigSheets

Learn how to use BigSheets, a browser-based tool that is included in the InfoSphere BigInsights Console, to analyze and visualize big data. BigSheets uses a spreadsheet-like interface that can model, filter, combine, explore, and chart data collected from multiple sources, such as from an application that collects social media data by crawling the Internet. Data is categorized and formatted by creating master workbooks, read-only representations of your complete original data set. From these master workbooks, you can derive child workbooks, editable versions of the master workbooks, in which you can create specific sheets to manipulate and tailor your data meet your analysis needs. In this tutorial, you link social media data about IBM Watson with simulated internal IBM data about media outreach efforts. Your goal, to analyze the visibility, coverage, and sentiment around IBM Watson is a common requirement of analysts and data scientists. This tutorial teaches you the key aspects of BigSheets so that you can quickly begin analyzing your own big data.
Learning objectives
After you complete the lessons in this module, you will understand the concepts and processes that are associated with: v Creating master workbooks from files that you upload into your distributed file system cluster v Creating child workbooks to tailor and explore the data v Merging data from two sources into one workbook v Creating columns to group and sort data v Viewing data in diagrams to see relationships between workbooks and the process of getting a workbook into its current state v Charting and refining the results of your analysis v Exporting your results
Time required
This module takes approximately 60 minutes to complete.
Lesson 1: Creating master workbooks from social media data

In this lesson, you upload two social media data files from the Internet to your cluster and use these files to create two new master workbooks. Note: For the purposes of this tutorial, you are uploading small sample data files that are less than 2 GB. To load files larger than 2 GB, you must use the Import feature. For more information, see the Importing data for analysis tutorial. Master workbooks protect and preserve the raw data in its original form. If, during your data explorations, you accidentally remove a column, you can recreate the child workbook from the master workbook without reloading the original data.
13
Master workbooks also model the format for the data. This format is determined by applying a reader, a data format translator that maps data into the spreadsheet-like structure necessary for BigSheets. BigSheets provides several built-in readers for working with common data formats. Procedures 1. Collect the social media files: a. In your web browser, enter the following URL: http://www.ibm.com/ developerworks/data/library/techarticle/dm-1206socialmedia/. This URL takes you to a BigSheets article on IBM developerWorks. b. Scroll down until you see the Download section. Click the HTTP link, review the terms and conditions, and then click I ACCEPT THE TERMS AND CONDITIONS. c. In the Opening sampleData.zip window, select Save File, and click OK. The sampleData.zip file is saved to the default location of your downloaded files. For example, on a Windows system, the default download directory is often C:\Documents and Settings\Administrator\My Documents\Downloads. d. If your InfoSphere BigInsights Console is open in the same web browser, close the tab that contains the deveoperWorks article. Otherwise, you can close or minimize your web browser. 2. Extract and upload the files to your cluster: Typically, you create master workbooks from the BigSheets tab from files that are already in your cluster. a. Go to the location of your downloaded files, and open the sampleData.zip file. b. Extract the files to a local directory. For example, on a Windows system, you may extract the files to C:\temp. c. Open the InfoSphere BigInsights Console by pointing your browser to http://<host>:<port>/, and click the Files tab. d. Expand the main hdfs:// directory and open biginsights > sheets directories by clicking the plus sign ( ) next to each directory. e. Make sure that the sheets directory is highlighted and click the Create ). Directory icon ( f. In the Name field of the Create Directory window, enter Watson_data, and click OK. g. Click the Upload icon ( ). h. In the Upload window, click Browse, and navigate to the extracted files location. Select the SampleData/article_sampleData/blogs-data.txt file, and click Open. The blogs-data.txt file is listed under Files to Upload. i. Click Browse again, select the news-data.txt file, and click Open. The news-data.txt file is listed under Files to Upload. j. In the Upload window, click OK to upload the files. It might take a minute to load the files. The window refreshes, and you can see the two files that you uploaded in the Watson_data directory. 3. Select the blogs-data.txt file, and click the Sheet radio button. In the Preview area of the window, you see that the data is not displayed properly. It is formatted in a JSON Array structure. 4. Select a new reader to map the data to the spreadsheet format:
14
a. Click the Edit icon ( ). b. Select JSON Array from the drop-down list, and click the green check ) to apply the reader. You immediately see the data map to the mark ( columns and rows of the spreadsheet-like interface in the Preview area. c. Since the data columns exceed the viewing space, click Fit column(s). The first eight columns display in the Preview area. Note: Depending on the size of your web browser window, you might need to scroll to see Fit column(s). d. Click Save as Master Workbook. e. In the Name field, enter Watson_Blogs. Spaces are valid characters for workbook names. f. In the Description field, enter Watson blog data from blogs-data.txt, then click Save. 5. Click the Workbooks link in the breadcrumb at the top of the window. You are moved to the BigSheets tab, and you see your new master workbook, Watson_Blogs. 6. Click New Workbook. 7. In the Name field, enter Watson_News. 8. In the Description field, enter Watson news feed data from news-data.txt. 9. Under File, navigate to the /biginsights/sheets/Watson_data directory and select the news-data.txt file. The right side of the window displays the file name and contents. This data is also in JSON Array format. 10. Click the Edit icon ( ), select JSON Array from the drop-down list, and
) to apply the reader. click the green check mark ( 11. Since the data columns exceed the viewing space, click Fit column(s). The first eight columns display in the Preview area. 12. Save the master workbook by clicking the green check mark ( lower right corner of the screen. ) in the
Note: Depending on the size of your web browser window, you might need ). to scroll to see the green check mark ( You are moved to the BigSheets tab, and you see your new workbook, Watson_news. 13. To see both master workbooks that you created, click the Workbooks link. You are now ready to explore the data that you loaded.
Lesson 2: Tailoring your data by creating child workbooks

Typically, before you analyze and investigate data, you must tailor its format and content. In this lesson, you create child workbooks from each master workbook and remove unwanted columns to refine the amount and type of your data. In addition to protecting the original data, master workbooks set the data format (including the data types for the columns). Therefore, you must create child workbooks in which to modify your data. Child workbooks inherit their format and data from their master workbooks, but you can tailor their attributes to display only necessary data.
15
Procedures 1. From the BigSheets tab of the InfoSphere BigInsights Console, select the Watson_News master workbook. 2. Click Build new workbook. A new workbook is created with the name: Watson_News(1). 3. Rename the workbook by clicking the Edit icon ( ), entering the new name
). of Watson News Revised, and clicking the green check mark ( 4. To see columns A through H within your web browser, click Fit column(s). 5. For your analysis, you do not need the IsAdult column (column E). Remove it by clicking the down arrow in the column heading and selecting Remove. Learn more about column actions: Notice all the column actions that are available to you in the drop-down list. You can rename, hide, and remove a column; insert a new column; sort the data in a column; and organize the columns. When you remove columns from a child workbook, you delete only the data from the child workbook. The master workbook on which this child workbook is based always contains the original data as it was loaded. If you decide later that you want the IsAdult data in your analysis, you can create another child workbook from the Watson_News master workbook. Why not just hide the IsAdult column: When you hide a column, the data in that column is still included when you run the workbook or create a chart. The only way to remove the data from the analysis or chart is to remove the column. 6. As you review the data in this Watson News Revised child workbook, you decide that you do not need several other columns. You can use the same method, as in the previous step, to remove them one at a time or remove multiple columns at once: a. Click the down arrow in any column heading, and select Organize Columns. b. Click the red X ( ) next to the following columns to mark them for removal: v Crawled v Inserted v MoveoverUrl v PostSize c. Click the green check mark ( ) to remove the columns.
Important: When you click the green check mark, there is no undo option to restore the columns. If you accidentally removed more columns than you intended, you cannot undo your last action to restore them. If you make a mistake when you are removing the columns, you must create another child workbook. In this case, you would create another child workbook from the Watson_News master workbook and restart this lesson at step 1. 7. Click Fit column(s) to resize the remaining columns. You now see columns A through H:
16
A Country
B FeedInfo
F Tags
G Type
H Url
Language PublishedSubjectHtml
8. Save and exit the workbook by clicking Save and selecting Save & Exit. If you are prompted with a Save workbook window, you can save the workbook with or without entering a description. 9. You are prompted with the message This workbook has never been run. Press Run to run it or Close to dismiss this message. Click Run. You see a progress indicator in the upper right corner of the window. Until now, you have been working with a subset of the data. BigSheets keeps only a limited number of rows in memory. The lower right corner displays a message that indicates you are seeing only a simulated sample of 50 rows of data. When you run the data, you apply all changes that you made since the last time you saved the workbook to the full data set. The progress bar monitors the progress of the job. Behind the scenes, Pig scripts initiate MapReduce jobs. The runtime performance depends upon the volume of data that is associated with your data collection and the system resources that are available. 10. Now, create a child workbook from the Watson_Blogs master workbook and remove the columns that are not needed for your analysis: a. To return to the page that displays all your workbooks, click the Workbooks link. b. Select the Watson_Blogs master workbook, and click Build new workbook. A new workbook is created with the name: Watson_blogs(1). Learn more about the differences in icons for master workbooks and child workbooks: Notice that the Watson News Revised workbook has a child workbook icon ( ) next to it, whereas the Watson_Blogs and
) . You can Watson_News master workbooks have a different icon ( quickly distinguish master workbook from child workbooks by these icons. The child workbook icon looks like a mini spreadsheet, whereas the master workbook icon has what looks like a lock that requires a key over the spreadsheet image, indicating that the master workbook is read-only. c. Rename the new child workbook by clicking the Edit icon ( Watson Blogs Revised, then clicking the green check mark ( ), typing
). d. Use the Organize Columns function to remove the following columns: v Crawled v Inserted v IsAdult v PostSize Remember to select the green check mark in the Organize Columns window. To merge workbooks, each workbook must contain the same data columns and data types, or schema. For this reason, you must remove columns so that both child workbooks contain the same columns: v Column A: Country v Column B: FeedInfo
17
v v v v v
Column Column Column Column Column
C: Language D: Published E: SubjectHtml F: Tags G: Type
v Column H: Url e. Save and exit the workbook. f. When prompted, click Run to apply the changes that you made to the child workbook. Both new child workbooks that you created have the exact same columns and data types. Now you can merge the two child workbooks into a new workbook, where you can explore and analyze your data.
Lesson 3: Combining the data from two workbooks

In this lesson, you combine the data from the two child workbooks into a single data collection. By merging the data, you have a central place to explore, analyze, and chart the coverage of the IBM Watson data. To merge the data, create a new workbook from an existing workbook, load the data from the existing workbook into the new workbook, and then merge the data. Procedures 1. In the InfoSphere BigInsights Console click the BigSheets tab, and select the Watson News Revised workbook. 2. Click Build new workbook. The name of the workbook is Watson News Revised(1), indicating that it is a child workbook of Watson News Revised. You change the name of this workbook later when you save and exit the workbook. 3. Click Add sheets, and then select Load. Learn more about types of sheets: Each type of sheet provides different predefined logic for analyzing data. Use the Load sheet to include the data of another workbook as a sheet in the current workbook. For more information about the types of sheets, see Types of sheets. 4. In the Load window, select the Watson Blogs Revised workbook link from the list of existing workbooks. 5. In the Sheet Name field, enter Watson Blogs Revised, and click the green check ). In the Load window, you see details of the columns and the first mark ( few rows of data in that workbook. At the bottom of your workbook, you see two tabs, Watson News Revised and Watson Blogs Revised. 6. Click Add sheets, and select Union. 7. In the Sheet Name field of the New Sheets: Union dialog, enter News and Blogs to indicate that this sheet contains the merged data. 8. From the Select sheet drop-down list, select the Watson News Revised sheet, click the green plus sign ( ) to add the sheet (you see the sheet move to the
). Your bottom of the dialog), and then click the green check mark ( workbook now displays the new tab, News and Blogs, at the bottom of your screen.
18
9. To save the workbook, click Save. When prompted for a name and description, enter Watson News Blogs in the Name field. Enter a description, such as Combined news and blogs data in the Description text box, and click Save. You successfully combined the blog and news data into one workbook, which you can use to analyze and explore the data. Next, you group similar data from multiple columns into one column.
Lesson 4: Creating columns by grouping data

In this lesson, you learn how to create columns by grouping similar information. You want to discover how many news articles and blog posts are written in each language. You accomplish this goal by using the Pivot sheet and its functions to combine, calculate, and sort the language data. First, use the Calculate function to count the total number of articles and posts by language. Then, sort the column by language to display the most popular languages first. Procedures 1. Make sure that the Watson News Blogs workbook is open. If the workbook is not open, from the BigSheets tab, click the Edit link next to Watson News Blogs. Clicking the Edit link from the main BigSheets tab opens the selected workbook in edit mode, so you do not have to click the Edit icon ( 2. Click Add Sheets, and select Pivot. ).
3. In the New Sheet: Pivot window, complete the required information: a. In the Sheet name field, enter Pivot by language. b. From the Group by columns drop-down list, select Language and click the ) to add the column. The Language column name green plus sign ( displays in the bottom of the dialog. c. At the bottom of the window, click the Calculate tab. d. In the Create columns based on groups text box, enter ). NumberArticlesandPosts and then click the green plus sign ( e. From the NumberArticlesandPosts drop-down list, select COUNT. f. To select the column to count, in the Column drop-down list, select Language. g. Click the green check mark ( ) to create the Pivot sheet.
On the Pivot by language sheet, you see two columns, Language and NumberArticlesandPosts. The Language column on the new sheet displays all the languages from the News and Blogs sheet. The NumberArticlesandPosts column counts the number of posts in each language. 4. Now you sort the Pivot sheet by the number of posts to see the languages that most people use to post about IBM Watson. Click the drop-down arrow to the right of the NumberArticlesandPosts column, select Sort, and then click Descending. You can see that English is the most popular language for posts and articles with 3169, followed by Russian, Spanish, and Chinese - Simple. But notice that Chinese (spelling) and Chinese - Traditional are also near the top of the list. You combine these values into one Chinese language value later, when you create a chart. 5. Click Save & Exit to save and close the workbook.
19
6. Click Run to save, sort, and process the entire data set for the workbook. You see a progress indicator in the upper right corner of the window. After you run the workbook, you see different results for the number of English posts in the NumberArticlesandPosts column, 5464. Now, view your worksheets and sheets in the BigSheets diagrams. There, you visualize the results of your analysis by creating and refining charts.
Lesson 5: Viewing data in BigSheets diagrams

In this lesson, you view the two diagrams in BigSheets to understand the relationships between workbooks and sheets and the process of modifying the data in a workbook. The Workbook diagram shows the sheets and processes that created the selected workbook, the relationships between workbooks, and on which master or child workbook the current workbook is based. Procedures ). 1. From the Watson News Blogs workbook, click the Workflow Diagram icon ( The diagram shows how you used the Watson Blogs Revised and Watson News Revised child workbooks to create the Watson News Blogs workbook. You can also see the source (master workbook) for each child workbook. in the upper 2. When you are finished looking at the diagram, click the red right corner. ). In the workbook diagram, you can 3. Click the Workbook Diagram icon ( see the types of sheets and associated processes that you used to get to the current workbook. in the upper 4. When you are finished looking at the diagram, click the red right corner. You now know how the two diagrams in BigSheets can help you visualize the relationships between workbooks and sheets and the process of creating a workbook. Now you are ready to explore and analyze the data in the Watson News and Blogs workbook.
Lesson 6: Visualizing and refining the results in charts

In this lesson, you visualize the results of the sorted and combined Watson blogs and news data in the Watson News Blogs workbook by creating a simple horizontal bar chart. Then, you refine your chart to improve your data. BigSheets provides various charts and maps. A chart plots data points in a grid, such as a typical pie chart or a bar chart. A cloud shows the importance of the values by displaying the size of the words relative to their importance. A map contains charts that represent geographic data, such as a heat map that shows the concentration of data points geographically. For more information about the types of charts, see Chart and map types. Procedure 1. Open the Watson News Blogs workbook, click Add chart, and then select chart > Horizontal Bar. It might take a few minutes to populate the categories of charts the first time.
20
2. In the New chart: Horizontal Bar window, enter or select the following values: a. In the Chart Name field, enter Language Coverage. The chart name is the name that displays on the tab at the bottom of the worksheet. b. In the Title field, enter IBM Watson Coverage by Language. The title of the chart displays on the chart. c. From the X Axis drop-down list, select NumArticlesandPosts. d. In the X Axis Label, enter Number of posts. e. From the Y Axis drop-down list, select Language. f. In the YAxis Label, enter Language of post. g. From the Sort By drop-down list, select X Axis. You want to sort by the number of posts. h. From the Occurrence Order drop-down list, select Descending. You want to see the language with the highest number of posts first. i. In the Limit field, enter 12. You want to see only the top 12 languages by the number of posts. j. Click the green check mark ( sample data. ) to see a preview of the chart with
3. Click Run to generate the chart from the full set of workbook data. Even though you see the preview chart immediately, remember that the actual chart is not displayed until you see 100% on the progress bar. It might take some time to generate the chart from the full set of data. Use the progress bar to monitor the status of the completed chart. After the bar chart is generated, you can see that Russian is the second most popular language for posts. You also see that the fifth and sixth most popular languages are variations on the Chinese language. By combining these values, the Chinese language is actually the second most popular language for posts. This situation is common, especially when you combine data that is collected from various sources such as different social media sites. 4. To clean up the data, combine all the Chinese languages and post numbers into one value: a. Click Edit to edit the workbook. You return to the Pivot by language sheet. b. Select the News and Blogs sheet by clicking the tab name at the bottom of the window. c. Insert a new column by clicking the down arrow next to the Language column and selecting Insert Right > New Column. d. Enter Language_Revised for the name of the new column, and then click the green check mark to create the column. Your cursor moves to the fx (or function) area for you to provide the function to generate the contents of the new column. e. Enter the following formula as the function IF(SEARCH(Chin*, #Language) >0, Chinese, #Language) and then click the green check ) to apply the formula and generate the values for the mark ( Language Revised column. This formula searches the Language column (indicated by #column_name) for any value that starts with Chin and combines those values into one value in the Language Revised column. The wildcard asterisk character ensures that all variations of the Chinese language, regardless of spelling or words that follow the word Chinese (such as Chinese Simple), are included. If the value does not start with Chin, then the formula copies the value, as is, into the Language Revised column.
21
Learn more about BigSheets functions and formulas: To understand how to use the BigSheets functions, and to see some examples of using formulas, see Formulas. f. Now, change the settings for the Pivot by language sheet to use the new column: Select the Pivot by language sheet, click the down arrow to the right of the sheet name, and select Sheet Settings. g. In the Pivot window, from the Group by Columns drop-down list, select ) to add the column. Language_Revised, and click the green plus sign ( h. Click the red X ( ) next to the Language column to remove it. You want to group and calculate the number of posts by the Language_Revised column instead of the Language column. i. Click the Calculate tab. In the Column drop-down list, select Language_Revised. j. Click the green check mark ( ) to apply your changes.
The new column named Language_Revised replaces the Language column to the right of the NumArticlesandPosts column. Click the B at the top of the Language_Revised column and drag it to the left of the NumArticlesandPosts column. 5. Save and exit the workbook. Then, click Run to update the entire data set for the workbook. When the data is updated, you see a message that an error occurred while you sample the chart. This error occurred because you updated the Pivot by language chart to use the Language_Revised column. The current Language Coverage chart is based on the Language column. 6. Click OK to close the error message. 7. Delete the previous Language Coverage chart: Click the triangle to the right of the Language Coverage sheet, and select Delete chart. 8. Click Add chart, and then select chart > Horizontal Bar to create another chart that is based on the updated data. 9. In the New chart: Horizontal Bar window, enter or select the following values: a. In the Chart Name field, enter Language Coverage. The chart name is the name that displays on the tab at the bottom of the worksheet. b. In the Title field, enter IBM Watson Coverage by Language. The title of the chart displays on the chart. c. From the X Axis drop-down list, select NumArticlesandPosts. d. In the X Axis Label, enter Number of posts. e. From the Y Axis drop-down list, select Language_Revised. f. In the YAxis Label, enter Language of post. g. From the Sort By drop-down list, select X Axis. You want to sort by the number of posts. h. From the Occurrence Order drop-down list, select Descending. You want to see the language with the highest number of posts first. i. In the Limit field, enter 12. ) to see a preview of the chart with j. Click the green check mark ( sample data. 10. Click Run to generate a new chart. After the chart completes, all Chinese languages are combined into one bar in the bar chart, which shows Chinese as
22
the second most popular language for posts. Posts in the Russian language are the third most popular language. If you hover over the bars in the chart, you can see the actual numbers of posts. You used BigSheets to generate a simple horizontal bar chart from your social media data collections. You also analyzed the bar chart and refined the data to determine the 12 most commonly used languages to generate posts about IBM Watson. Next, you learn how to export data from your workbooks.
Lesson 7: Exporting data from your workbooks

In this lesson, you export the data from the Watson News and Blogs workbook into a browser tab and into a CSV file. You might want to share the results of your BigSheets analysis with colleagues, who do not have direct access to IBM InfoSphere BigInsights. You can export your analysis results in various data formats, including CSV (comma-separated values), JSON Array, JSON Object, and TSV (tab-separated values). You can also export the data to a new tab of your browser so that you can import a subset of your data into another application. Or you might want to save the data, if your privileges include saving files to the cluster. Procedures 1. If the Watson News Blogs workbook is not open, open it. From the BigSheets tab, select Watson News Blogs. Do not open the workbook in Edit mode. You cannot export workbooks that are currently in Edit mode. If you do open the workbook in Edit mode, you do not see Export as, but you see Add sheets. 2. Click Export as. 3. The Export To: option is set by default to Browser Tab. Click OK to export the data from the current Result sheet into a new tab on your web browser. 4. Click the IBM InfoSphere BigInsights tab in your web browser to return to the InfoSphere BigInsights Console and the Watson News Blogs workbook. 5. Click Export as again. 6. In the Format Type drop-down list, select CSV, which produces a comma-separated value file. 7. In the Export To option, select File. 8. Click Browse and enter or select the following parameters: a. Set the path by opening the main hdfs:// folder and then selecting the tmp folder. b. In the file name text box at the bottom of the window, enter the file name watson_news_blogs. File names can contain spaces. Avoid entering special characters in the file name, or you receive an error message. c. Click OK. 9. Select the Include Headers check box to include column names in the file. You can select this check box only if you selected either the CSV or TSV format. 10. Click OK. You receive a message that the Workbook has been successfully exported. Click OK to close the message. You can check your results by clicking the Files tab and opening the main hdfs:// folder and then the tmp folder. You see the watson_news_blogs.csv file listed.
23
You just exported the results of the Watson News Blogs workbook into both a web browser tab and a CSV file on your Hadoop Distributed File System (HDFS) cluster.
Summary of analyzing data with BigSheets tutorial

In this tutorial, you analyzed social media data from two sources by creating master and child workbooks, tailored that data to your analysis goals, generated charts to visualize and refine your results, and exported your results.
Lessons learned
You now have a good understanding of the following tasks: v How to use data files that are on your cluster to create master workbooks v How to create child workbooks from both master workbooks and existing child workbooks v How to tailor and explore the data in your workbooks by removing unneeded data and combining two workbooks using both the Load and Union sheets v How to group similar data and then create columns that calculate and sort the data using the Pivot sheet v How to view the BigSheets diagrams and see the relationships between workbooks and the operations that you used to modify a workbook v How to visualize your analysis results in a simple horizontal bar chart v How to export BigSheets workbook data into both a tab in your web browser and a CSV file
Extra resources
To learn more about how to use BigSheets to analyze your big data, see the following resources: v Overview of BigSheets v Analyzing data with BigSheets
24
Chapter 5. Tutorial: Developing your first big data application

Learn how to develop your first big data application, which writes data to a file and stores the results in your distributed file system. Learn more about the sample code: The code that is shown in this tutorial is intended for educational purposes only, and is not intended for use in a production application. You develop the application in Jaql, a query and scripting language that uses a data model based on the Javascript Object Notation (JSON) format. You learn how to create, publish, and deploy the application by using the InfoSphere BigInsights Tools for Eclipse so that you can run the application from the InfoSphere BigInsights Console. This module does not cover the full syntax and usage of the Jaql language, which is explained in the InfoSphere BigInsights Information Center. However, you can apply many of the application development techniques in this module to other applications.
Learning objectives
After completing the lessons in this tutorial, you will have learned how to complete the following tasks: v v v v v Create an InfoSphere BigInsights project. Create and populate a Jaql file with application logic. Test your application. Publish your application to the InfoSphere BigInsights catalog. Deploy and run your application on the cluster.
v Upgrade your application to accept input parameters.
Time required
This tutorial should take approximately 40 minutes to complete.
Prerequisites
The InfoSphere BigInsights Tools for Eclipse must be installed in your Eclipse environment. Experience with Eclipse is not required, but understanding the concepts and the development environment might be helpful when working with the InfoSphere BigInsights Tools for Eclipse.
Lesson 1: Creating an InfoSphere BigInsights project

In this lesson, you create an InfoSphere BigInsights project by using the InfoSphere BigInsights Tools for Eclipse. The project that you create will contain the files, applications, programs, and modules that your application requires to run. After you create a project, you can create an InfoSphere BigInsights program.
25
Procedures 1. Open Eclipse. 2. Set the perspective to BigInsights. a. Click Window > Open Perspective > Other. b. Select BigInsights, and then click OK. 3. Click Help > Task Launcher for Big Data to open the Task Launcher for Big Data. 4. From the Develop tab, under Quick Links, click Create a new BigInsights project. 5. Enter WriteMessage as the project name, and then click Finish. The project that you created, WriteMessage, displays in the Project Explorer pane. Now that your project is created, you can create a program. In this module, you are creating a Jaql application.
Lesson 2: Creating and populating a Jaql file with application logic

In this lesson, you create a Jaql file within your new project and include Jaql statements that determine how the application logic operates. You create a simple Jaql application with no input parameters. Prerequisites Before you begin, ensure that the InfoSphere BigInsights Tools for Eclipse is open. Procedures 1. From the Task Launcher for Big Data, click the Develop tab. 2. Under Tasks, click Create a BigInsights program. 3. In the Create a BigInsights program window, select JAQL Script, and then click OK. 4. Select the parent folder, WriteMessage, enter MyJaql.jaql as the file name, and then click Finish. Your new file, MyJaql.jaql, opens in an editor within Eclipse. 5. Copy and paste the following code into the MyJaql.jaql file. The following code writes the results of the Jaql query as a text file (myMsg.txt) to the /user/biadmin/sampleData directory in your distributed file system. You might need to modify the specified directory to match your environment. Ensure that your user ID has write access to the directory that you specify.
// sample message term=Hello World; // Location of the output file. Modify this location to fit your environment. output=/user/biadmin/sampleData/myMsg.txt; // Write the sample message as a text file in HDFS. write ([term], lines(location=output)); // Alternatively, you can hard code input parameters as shown in the following example. // write ([Hello World!], lines(location="/user/biadmin/sampleData/myMsg.txt"));
6. Save and close the MyJaql.jaql file.
26
Now that your Jaql application contains logic, you can test it to ensure that the program runs as designed.
Lesson 3: Testing your application

In this lesson, you define a connection to an existing InfoSphere BigInsights cluster and configure the runtime properties for your application. Prerequisites Before you begin this lesson, ensure that the InfoSphere BigInsights Tools for Eclipse is open. Procedures 1. Create a server connection to the InfoSphere BigInsights Console. Because Jaql is run from the Eclipse environment, your operating system user ID from where you run the Jaql shell must be the same as your InfoSphere BigInsights user ID. a. In the Overview tab of the Task Launcher for Big Data, under First Steps, click Create a BigInsights server connection. Tip: If you upgrade the InfoSphere BigInsights server to a newer version, or upgrade your InfoSphere BigInsights Tools for Eclipse, you might need to refresh the configuration files of your InfoSphere BigInsights sever in Eclipse. Go to the InfoSphere BigInsights server view in Eclipse, and either delete and register the InfoSphere BigInsights server again, or expand the InfoSphere BigInsights server node, right-click Configuration Files and select Refresh from the server. b. Enter the URL for your InfoSphere BigInsights Console, including the server name, user ID, and password. c. Click Test connection to verify your server connection. d. When you see a message indicating that the system successfully tested the connection, click Finish. 2. In the Task Launcher for Big Data, from the Develop tab, under Tasks, click Create a configuration and run a BigInsights program. 3. Select JAQL as the program type that you want to create, and then click OK. 4. Enter the input parameters for your program. a. Enter MyJaqlProgram for the name of your program. b. Select WriteMessage as the project name. c. Select MyJaql.jaql as the Jaql script. d. Select the server that you want to connect to. 5. Click Apply and then Run. 6. Review the output in the Eclipse Console pane and verify that no errors were reported. If errors were reported, fix the errors and run the application again. 7. Log in to the InfoSphere BigInsights Console.
Option In a non-SSL installation Description Enter the following URL in your browser: http://host_name:8080 host_name is the name of the host where the InfoSphere BigInsights Console is running.
27
Option In an SSL installation
Description Enter the following URL in your browser: https://host_name:8443 host_name is the name of the host where the InfoSphere BigInsights Console is running.
8. In the InfoSphere BigInsights Console, from the Files tab, expand the directory from step 5 in Lesson 2 in your distributed file system tree to locate the .txt file that your application created. Tip: After you identify a server connection and create a Jaql program configuration, you can test Jaql statements directly from your MyJaql.jaql file. Highlight the statement that you want to run, right click, and select Run the JAQL statement. Now that your Jaql program is working, you can publish it as an application in the InfoSphere BigInsights applications catalog.
Lesson 4: Publishing your application in the InfoSphere BigInsights applications catalog

In this lesson, you publish your application in the InfoSphere BigInsights applications catalog. When packaging and publishing, you identify the icon and name for your application, define the application workflow, and complete related tasks. Procedures 1. In the Task Launcher for Big Data, under the Publish and run tab, click Publish a BigInsights application. 2. On the Location panel of the BigInsights Application Publish wizard, select the WriteMessage project, specify the server that you want to publish the application to, and then click Next. 3. On the Application panel, select the Create New Application radio button. Optionally, enter a description for your application, provide a custom icon file, and specify a category for your application (such as test). Click Next. 4. On the Type panel, select Workflow for your Application Type, and then click Next. 5. On the Workflow panel, select the Create a new single action workflow.xml file radio button, and select Jaql as your Action Type. a. In the Properties field, select the script that you created, and then click Edit. b. Accept the supplied value (script) that is shown in the Name field. c. Enter MyJaql.jaql as the name of your Jaql file in the Value field. d. Click OK and then click Next. 6. On the Parameters panel, accept the default properties, and then click Next. The Parameters panel will be empty because your application currently does not have any input parameters. 7. On the Publish panel, verify that the MyJaql.jaql file displays under the workflow folder of your application package, and then click Finish. Your application is published to the InfoSphere BigInsights applications catalog.
28
Now that your application is published to the applications catalog, you can deploy it on the cluster so that other users can access the application from the InfoSphere BigInsights Console.
Lesson 5: Deploying and running your application on the cluster

In this lesson, you deploy your application to the cluster so that other users can run it from the InfoSphere BigInsights Console. Procedures 1. Log in to the InfoSphere BigInsights Console in a web browser. Ensure that the user that you log in as has administrative access for deploying and running applications. 2. Click the Applications tab, and then click Manage. 3. Select the WriteMessage application that you created in the list of applications, and then click Deploy. In the Deploy Applications window, click Deploy. Your application is now deployed to the cluster and can be run. 4. On the Applications tab, click Execute. 5. Select the WriteMessage application. 6. In the Execution Name field, enter WriteMessageTrial, and then click Run. The application displays in the Application History pane, and shows the progress of the application run. 7. When the application finishes running, click the arrow icon in the Details column to display further information about the workflow and your application run.
Lesson 6: Upgrading your application

In this lesson, you upgrade your application to make it more flexible by modifying the code to accept two input parameters. Procedures 1. Log in to the InfoSphere BigInsights Console. a. Click Applications, and then click Manage. b. Select your application, and then click Undeploy. You must undeploy applications to replace them with newer versions. 2. In the InfoSphere BigInsights Tools for Eclipse, back up your WriteMessage project. 3. Expand your WriteMessage project and then open the MyJaql.jaql file. 4. Delete the contents of the MyJaql.jaql file, and then copy and paste the following code into the file. Block 1 declares two external variables, TERM and OUTPUT, that represent input parameters that will become part of the user interface for your application. Each parameter is assigned to a Jaql variable (term and output). Block 2 uses the output file information that you provide to write the results to HDFS.
// Block 1 // Define the search parameters extern TERM; extern OUTPUT; // Search term that the user enters as input. term=[TERM];
29
// The full path and file name that the user enters for the output. output=[OUTPUT]; // Block 2 // The following statement writes the input message as a text file in HDFS. write ([term[0]],lines(location=output[0]));
5. Test your Jaql script locally with the parameters that you added. a. In Eclipse, click Run > Run Configurations. b. From the list of configurations, expand JAQL, and then select MyJaqlProgram. c. In the Run Configurations window, click the Arguments tab. d. Copy the following statement and paste it into the Program arguments field.
-e "TERM=Hello World!;OUTPUT=/user/biadmin/sampleData/myMsg.txt"
e. Click Run to test your application. 6. Publish your application to the InfoSphere BigInsights Console applications catalog. a. In the Project Explorer pane, right-click your project and select BigInsights Application Publish. b. Select the same InfoSphere BigInsights server that you used when publishing your application previously, and then click Next. c. On the Specify Applications panel, ensure that the Replace Existing Application check box is selected. Accept the existing values for the remaining items, and then click Next. d. On the Type panel, select Workflow, and then click Next. e. On the Workflow panel, select Create a new single action workflow.xml file, and select Jaql as your Action Type. Because you are introducing new parameters, you cannot accept the default setting to use the existing workflow. 1) In the Properties field, select the script that you created, and then click Edit. 2) Accept the supplied value (script) that is shown in the Name field. 3) Enter MyJaql.jaql as the name of your Jaql file in the Value field. f. Click New to create a new property. In the New Property window, select eval for the Name field from the dropdown menu, and then enter the following statement in the Value field.
TERM="${jsonEncode(term)}";OUTPUT="${jsonEncode(output)}";
When you run your application, you want to provide values for the TERM and OUTPUT variables. To provide these values, you enter the previous statement to assign an Oozie variable to each Jaql variable. Oozie is the workflow engine that runs the application, and each Oozie parameter is enclosed within a dollar symbol and braces, ${}. To easily correlate the Oozie variables with the Jaql parameters, the same variable names are used for the Oozie parameters (term and output). The jsonEncode() function is used to escape special characters and avoid code injection when users enter input in the InfoSphere BigInsights Console. g. Click OK and then click Next. On the Parameters panel, all Oozie parameters that you specified in the workflow are listed. You must select each parameter and edit its properties to provide information about how each parameter displays in the InfoSphere BigInsights Console.
30
h. For the term parameter, set the display name to Search term and the type to string. Enter Hello World! as the default value, provide a brief description, ensure that the Required check box is selected, and then click OK. i. For the output parameter, set the display name to Output file and the type to File Path. Enter a path name for the default value, provide a brief description, ensure that the Required check box is selected, and then click OK. j. On the Publish panel, verify your parameters, click Next, and then click Finish. 7. In the InfoSphere BigInsights Console, on the Applications page, click Manage, refresh the applications catalog, select your application, and then click Deploy. 8. In the Configuration column, click the settings icon. Under Security, select all available groups and then click Save. 9. On the Applications panel, click Run, and then select your application from the list. Your application prompts the user to specify a search term or phrase and includes the default value that you specified in your input parameters. You can browse to select an existing file in the distributed file system, or enter a new location for the output file. Accept the default search term, and then under Execution, click Run. 10. After your application runs, open the InfoSphere BigInsights Console, click Files, and then navigate to the location that your specified for the output file. 11. Optional: In the InfoSphere BigInsights Tools for Eclipse, expand the WriteMessage project, and then expand BIApp > application. a. Open the application.xml file that the InfoSphere BigInsights Tools for Eclipse generated. b. On the Design tab of the XML editor, expand workflow > action > jaql to view the values that you set for the workflow properties. c. Expand application-template > properties > property to view the values that you set for your input parameters.
Summary of developing your first big data application

In this tutorial, you created your first application, published it to the InfoSphere BigInsights applications catalog, deployed your application to the cluster, and then ran the application from the InfoSphere BigInsights Console.
Lessons learned
You now have a good understanding of the following tasks: v Creating an InfoSphere BigInsights project by using the InfoSphere BigInsights Tools for Eclipse. v Establishing a server connection and testing your application. v Publishing your application to the InfoSphere BigInsights applications catalog. v Deploying your application to the cluster and running it in the InfoSphere BigInsights Console. v Upgrading your application to accept input parameters and redeploying it to the InfoSphere BigInsights Console.
31
For more information about developing applications with InfoSphere BigInsights, see this video on the IBM Big Data channel on YouTube.
32
Chapter 6. Tutorial: Developing Big SQL queries to analyze big data

Learn how to use IBM Big SQL, an SQL language processor, to summarize, query, and analyze data in a data warehouse system for Hadoop. Big SQL provides broad SQL support that is typical of commercial databases. You can issue queries using JDBC or ODBC drivers to access data that is stored in InfoSphere BigInsights, in the same way that you access databases from your enterprise applications. You can use the Big SQL server to execute standard SQL queries. Multiple queries can be executed concurrently. Big SQL provides support for large ad hoc queries by using MapReduce parallelism and point queries, which are low-latency queries that return information quickly to reduce response time and provide improved access to data. The Big SQL server is multi-threaded, so scalability is limited only by the performance and number of CPUs on the computer that runs the server. If you want to issue larger queries, you can increase the hardware performance of the server computer that Big SQL runs on. This tutorial uses data from the fictional Sample Outdoor Company. The Sample Outdoor Company sells and distributes outdoor products to third-party retailer stores around the world. The company also sells directly to consumers through its online store. For the last several years, the company has steadily grown into a worldwide operation, selling their line of products to retailers in nearly every part of the world. The fictional Sample Outdoor company began as a business-to-business operation. The company does not manufacture its own products; rather, the products are manufactured by a third party, then the products are sold to third-party retailers. The company is expanding its business by creating a presence on the web. Now they also sell directly to consumers through their online store. You will learn more about the products and sales of the fictional Sample Outdoor Company by running Big SQL queries and doing the analysis on the data in the following lessons.
Learning objectives
You will use the InfoSphere BigInsights Tools for Eclipse to create Big SQL queries so that you can extract large subsets of data for analysis. After you complete the lessons in this module, you will understand the concepts and know how to do the following actions: v Connect to the Big SQL server by using the InfoSphere BigInsights Tools for Eclipse. v Use the InfoSphere BigInsights Tools for Eclipse to load sample data and to create queries. v Use BigSheets to analyze data generated from Big SQL queries. v Use the InfoSphere BigInsights Console to run Big SQL queries. v Use the InfoSphere BigInsights Tools for Eclipse to export data.
33
Time required
This module should take approximately 1 hour to complete, depending on whether you also complete the optional lessons.
Lesson 1: Connecting to the IBM Big SQL server

In this lesson, you view the JDBC connection, learn how to navigate to the Data Management view in Eclipse, and how to open the SQL editor. Big SQL is installed with the IBM InfoSphere BigInsights product in the following directory: $BIGSQL_HOME/bin/bigsql. Prerequisites Ensure that you complete the following actions before you begin the tutorial: 1. Review the status of Big SQL. From the InfoSphere BigInsights Console, click the Cluster Status page. Then, verify that the Big SQL node is in a Running state on the Big SQL Server Summary page. 2. Start the Big SQL service if it is not already started with one of the following actions: From the command-line type the following command:
<$BIGSQL_HOME>/bin/bigsql start
Tip: $BIGSQL_HOME usually refers to /opt/ibm/biginsights/bigsql. From the InfoSphere BigInsights Console a. Click the Cluster Status page. b. Select Big SQL Server and click Start in the Big SQL Server Summary page if the Big SQL server is not started. 3. Download Eclipse Version 3.6.x and then enable your Eclipse environment for InfoSphere BigInsights application development. Learn more about another way to interface with the Big SQL server: For the purposes of this tutorial, Eclipse is used as a client for the Big SQL server. Outside of this tutorial, you can also use Java SQL Shell (JSqsh), a command-line client. a. Download and install the Eclipse IDE 3.6.x for Java EE developers from the Eclipse website. For more information about installing Eclipse, see Installing the InfoSphere BigInsights Tools for Eclipse. b. From the InfoSphere BigInsights Console Welcome page, find the Quick Links section, and click Enable your Eclipse development environment for BigInsights application development. c. Choose one of the options to install the InfoSphere BigInsights Tools for Eclipse based on whether you want to install from the web server directly or download an archive with the plug-ins to your computer first. 4. Start Eclipse, and complete the instructions in Installing the InfoSphere BigInsights Tools for Eclipse to add the InfoSphere BigInsights Tools for Eclipse. 5. Ensure that a JDBC driver definition exists. Generally, either a JDBC or an ODBC driver definition must exist before you can create a connection to a Big SQL server. For the purposes of this tutorial, only a JDBC driver definition is required. If you created an InfoSphere BigInsights server connection to a particular InfoSphere BigInsights server, the Big SQL JDBC driver connection and the Big
34
SQL JDBC connection profile were both created for you automatically.For more information about creating a connection profile, click Help when you are in the InfoSphere BigInsights Eclipse tools, and find Developing Big SQL queries > Creating a connection profile. Procedures 1. From the IBM InfoSphere BigInsights Eclipse environment, click Window > Open Perspective > Other > Database Development. Ensure that the Data Source Explorer view is open. 2. From the Data Source Explorer view, expand the Database Connections folder. 3. Right-click the Big SQL connection, and select Properties. Verify the host name and other connection information. 4. Click Test Connection to make sure that you have a valid connection to the Big SQL server. If you do not have a connection, follow these steps: a. From the Data Source Explorer view, right-click the Database Connections directory and click Add Repository. b. In the New Connection Profile window, select Big SQL JDBC in the Connection Profile Types list. c. Optional: Edit the text in the Name field or the Description field, and click Next. d. Optional: In the Specify a Driver and Connection Details window, select a Big SQL Driver from the Drivers list. You can add or modify a driver by clicking the New Driver Definition icon or the Edit Driver Definition icon. e. On the General page, complete the connection properties: Schema Use the default name, which is default, or specify a valid schema or database name. The catalog (syscat) stores the valid schema and database names. Host The name of the host where the Big SQL server is running, such as my.server.com
Port number The port number that is used by the Big SQL server. The default port number is 7052. User name The name of the user ID that is used to connect to the Big SQL server. Password The password that is associated with the user name. Note: If the InfoSphere BigInsights server is installed with security, you must provide a valid InfoSphere BigInsights user ID and password to connect when you use the Big SQL JDBC driver. Otherwise, enter any user ID and password in the User name and Password fields to enable the Finish or Next and Test Connection buttons. For example, you can type the following information: User name user Password user
35
f. Specify the Save password check box to save the password for the Big SQL server login. g. The Connection URL field displays the URL that is generated. For example,
jdbc:bigsql://<host_name>:7052/default
h. If the Big SQL server is configured with SSL authentication, open the Optional page to view or add an SSL property according to the $BIGSQL_HOME/bigsql/conf/bigsql-conf.xml configuration file from the target Big SQL server. i. Specify the Connect when the wizard completes check box to connect to the Big SQL server after you finish defining this connection profile. j. Specify the Connect every time the workbench is started to connect to the Big SQL server automatically by using this connection profile when you launch this Eclipse workspace. k. Click Test Connection to ping the server and to verify that the connection profile is working. l. Click Finish to create the connection profile. For information about how to create a simple JDBC application that opens a database connection, and runs a Big SQL query, see Lesson 4: Running basic Big SQL queries on page 42.
Lesson 2: Creating a project and an SQL script file

You can create Big SQL scripts and run them from the InfoSphere BigInsights Eclipse environment. Procedures 1. Create an InfoSphere BigInsights project in Eclipse. a. From the Eclipse menu bar, click File > New > Other. b. In the Select a wizard window, expand the BigInsights folder, select BigInsights Project, and then click Next. c. Type myBigSQL in the Project name field, and then click Finish. d. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective window opens. Click Yes to switch to the BigInsights perspective. 2. Create an SQL script file. a. From the Eclipse menu bar, click File > New > Other. b. In the Select a wizard window, expand the BigInsights folder, and select SQL Script, and then click Next. c. In the New SQL File window, in the Enter or select the parent folder field, select myBigSQL. Your new SQL file is stored in this project folder. d. In the File name field, type aFirstFile. The .sql file extension is added automatically. e. Click Finish. 3. In the Select Connection Profile window, select the Big SQL connection. The properties of the selected connection display in the Properties field. When you select the Big SQL connection, the Big SQL database-specific context assistant and syntax checks are activated in the editor that is used to edit your SQL file. 4. Click Finish to close the Select Connection Profile window. 5. In the SQL Editor that opens the aFirstFile.sql SQL file you created, add the following Big SQL comments:
36
--This is a beginning SQL script --These are comments. Any line that begins with two -- dashes is a comment line, -- and is not part of the processed SQL statements.
Some lines in the file contain two dashes in front of text. Those dashes mark the text as comment lines. Comment lines are not part of the processed code. It is always useful to comment your Big SQL code as you write the code, so that the reasons for using some statements are clear to all who must use the file. You will be using this file in later lessons. 6. Save aFirstFile.sql by using the keyboard short, CTRL-S.
Lesson 3: Creating tables and loading data

In this lesson, you will create the tables, and load the data that you will use to run queries and create reports about the fictional Sample Outdoor Company. The examples in this tutorial use Big SQL tables and data that are managed by Hive. This tutorial uses sample data that is provided in the $BIGSQL_HOME/samples folder on the local file system of the InfoSphere BigInsights server. By default, the $BIGSQL_HOME environment variable is set to the installed location, which is /opt/ibm/biginsights/bigsql/. The time range of the fictional Sample Outdoor Company data is 3 years and 7 months, starting January 1, 2004 and ending July 31, 2007. The 43-month period reflects the history that is made available for analysis. The schema that is used in this tutorial is the GOSALESDW. It contains fact tables for the following areas: v Distribution v Finance v Geography v Marketing v Organization v Personnel v Products v Retailers v Sales v Time. Procedures 1. Access the sample SQL files by using one of the following methods: Running from a Linux command-line On the Linux machine that is associated with the cluster, complete the following steps: a. Start jsqsh to ensure that it has performed its initial configuration: 1) Change your directory to $BIGSQL_HOME/bin. 2) Type ./jsqsh. 3) If it is the first time that you have run jsqsh, you will be prompted to enter the jsqsh setup process. Just click Enter. 4) Choose (Q)uit at the JSQSH SETUP WIZARD screen prompt and then type quit.
37
b. Change your directory to $BIGSQL_HOME/samples, or /opt/ibm/biginsights/bigsql/samples if you installed in the default location. c. Open and review the README file. d. Type the setup command:
./setup.sh -u <user_name> -p <password> -s <host_name> -n <port_number>
The host_name and port_number parameters are optional unless you are running on a distributed cluster or if authentication is enabled. Then, you must use at least the -u, -p, and -s parameters. e. The script runs when you see this message
Loading the data .. will take a few minutes to complete .. Please check /var/ibm/biginsights/bigsql/ temp/GOSALESDW_LOAD.OUT file for progress information.
The script runs three files: GOSALESDW_drop.sql, GOSALESDW_ddl.sql, and GOSALESDW_load.sql. f. You can see the tables in the InfoSphere BigInsights Console Files tab: hdfs://biginsights/hive/warehouse/gosalesdw.db Running from an InfoSphere BigInsights Eclipse project a. On the InfoSphere BigInsights Console, open the Files tab. b. Create a directory that you can use to hold the DDL. c. On the InfoSphere BigInsights Console, open the Applications tab. d. Run the Distributed File Copy with these parameters, making sure that you resolve the $BIGSQL_HOME environment variable to /opt/ibm/biginsights/bigsql/ if you installed in the default location: Note: If the Distributed File Copy application is not yet deployed on your cluster, see Lesson 3: Importing data by using the Distributed File Copy application on page 10 for instructions. In the following command, the Input path is for the Linux file system that is connected to the InfoSphere BigInsights cluster. You use the user name and password of that cluster. For example, if the user name is user and the password is mypw, the input path is sftp://user:mypw@my.server.com:22/opt/ibm/biginsights/bigsql/ samples/queries. Do not use 8080 as the port number because it is not available.
Execution name: someName Input path: sftp://<user_name>:<password>@hostname:<port>/ <$BIGSQL_HOME>/samples/queries Output path: hdfs://<path_to_directory_that_you_created>
Note: For information about using a credential file with the Distributed File Copy application, see Lesson 3: Importing data by using the Distributed File Copy application. A credential file might be required depending on your particular environment. e. Click Run. The queries folder, which was the folder that you requested in the input path of the Distributed File Copy application, is uploaded to the Hadoop Distributed File System. f. From the queries folder in Hadoop Distributed File System (HDFS), download the GOSALESDW_drop.sql, GOSALESDW_ddl.sql, and
38
GOSALESDW_load.sql files to a local directory. These files are the only SQL scripts that you need to drop the tables, or create and populate the tables. There are more SQL scripts in this folder, but the remaining files are for your use when you want to explore with more samples. g. Then, copy each of the three *.sql files from your local path so that you can use them in Eclipse. h. In InfoSphere BigInsights Eclipse, right-click the myBigSQL project, and click Paste. i. From the Project Explorer in the InfoSphere BigInsights Eclipse tools, expand the myBigSQL project. Double-click a file to open it in the Big SQL editor. You can then run the statements in the file from the editor: 1) To drop any tables in the GOSALESDW schema that might have already been created, from the Big SQL editor, click the Run SQL icon in file GOSALESDW_drop.sql. The Run SQL icon is in the menu bar at the top of the file. 2) To create the tables and the GOSALESDW schema, from the Big SQL editor, click the Run SQL icon in file GOSALESDW_ddl.sql. 3) To load the data in the GOSALESDW tables, from the Big SQL editor, click the Run SQL icon in file GOSALESDW_load.sql. Learn more about loading data: In some cases, you might want to access data that is available to you from the HDFS file system. Change the LOAD statement to the following example, assuming that data for table SLS_PRODUCT_DIM is in a directory that is called testSQL in the tmp folder of the Files page of the InfoSphere BigInsights Console:
load hive data inpath /tmp/testSQL/SLS_PRODUCT_DIM.txt overwrite into table <someSchema>.SLS_PRODUCT_DIM;
For information about all the Big SQL LOAD statement options, see LOAD statement Learn more about the SQL editor window: In the SQL editor window, you can run SQL statements as you edit them, select connection profiles, and import or export SQL statements. To run an individual statement, highlight the whole statement and press F5. To run the entire SQL file, click the Run SQL icon that is in the menu bar at the top of the SQL file. You can use the context assistant to help you complete the statements. A syntax checker adds a red indicator next to any invalid line. You can hover over that indicator to see why a problem occurred. Running from the Big SQL Console (by copying and pasting statements into the window) a. On the InfoSphere BigInsights Console, open the Files tab. b. Create a directory that you can use to hold the DDL. c. On the InfoSphere BigInsights Console, open the Applications tab.
39
d. Run the Distributed File Copy with these parameters, making sure that you resolve the $BIGSQL_HOME environment variable to /opt/ibm/biginsights/bigsql/ if you installed in the default location: Note: If the Distributed File Copy application is not yet deployed on your cluster, see Lesson 3: Importing data by using the Distributed File Copy application on page 10 for instructions. In the following command, the Input path is for the Linux file system that is connected to the InfoSphere BigInsights cluster. You use the user name and password of that cluster. For example, if the user name is user and the password is mypw, the input path is sftp://user:mypw@my.server.com:22/opt/ibm/biginsights/bigsql/ samples/queries. Do not use 8080 as the port number because it is not available.
Execution name: someName Input path: sftp://<user_name>:<password>@hostname:<port>/ <$BIGSQL_HOME>/samples/queries Output path: hdfs://<path_to_directory_that_you_created>
Note: For information about using a credential file with the Distributed File Copy application, see Lesson 3: Importing data by using the Distributed File Copy application. A credential file might be required depending on your particular environment. e. Click Run. The queries folder, which was the folder that you requested in the input path of the Distributed File Copy application, is uploaded to the Hadoop Distributed File System. f. From the HDFS path, select the GOSALESDW_drop.sql file to open it in the right pane of the Console. Copy all of the statements from the file. Open the Big SQL Console, and paste the statements into the top pane of the Console. Click Run. g. From the HDFS path, download the GOSALESDW_ddl.sql file to your local file system. This file is too large to display in the InfoSphere BigInsights Console window. Open the file and copy all of the statements from the file. Open the Big SQL Console, and paste the statements into the top pane of the Console. Click Run. h. From the HDFS path, select the GOSALESDW_load.sql file to open it in the right pane of the Console. Copy all of the statements from the file. Open the Big SQL Console, and paste the statements into the top pane of the Console. Click Run. 2. Now that you have run the GOSALESDW_ddl.sql file, examine it to learn something about what you just did. a. The first line of this file creates the database.
create database if not exists gosalesdw;
A Big SQL database schema is a way to logically group objects, such as tables or functions. b. The second line of this file declares that the database is to be used to contain the tables.
use gosalesdw;
The USE clause establishes a default schema or database for the session. All unqualified table names that are referenced in Big SQL statements and DDL statements default to this schema. The default schema is default.
40
In the later lessons, the USE clause is not used, because all of the tables that are referenced are fully qualified. That means that you include an unambiguous schema name as part of the table name. Therefore, instead of running the statement as SELECT * FROM go_region_dim;, use the fully qualified reference as SELECT * FROM GOSALESDW.go_region_dim, where GOSALESDW is the schema or database name. c. The remaining statements create the tables and the columns within each table that will be used during this tutorial. 3. Since you connected to an InfoSphere BigInsights server when you first opened the file, the tables are created on that server. If the results of the create statements are successful, you can view the tables in the cluster. Click the Files tab in the InfoSphere BigInsights Console and look for the GOSALESDW database at hdfs/biginsights/hive/warehouse/gossalsdw.db. In the next lesson, you will see another way to verify that tables actually exist, by using the SELECT * FROM <table_name> statement. You can view the results of each Big SQL statement that you run from the Eclipse tools in the SQL Results view in the current perspective of the Eclipse tools. Any errors are listed and any result tables are displayed. Learn more about the SQL Results view: The SQL Results view contains the results of your SQL statements or scripts. You can change the look and feel of the results page, and the number of rows that get returned from each query (the default is 500). Follow these steps if you want to change the number of rows that get returned: a. From the Eclipse menu bar, click Window > Preferences. b. From the Preferences window, click Data Management > SQL Development > SQL Results View Options. c. In the SQL Results View Options window, find the Max row count field and increase the value from the default of 500. This value controls the number of rows that are retrieved. A value of zero retrieves all rows. d. In the Max display row count field, increase the value from the default of 500. This value controls the number of rows that you see. A value of zero displays all rows. Be aware that making this number too large can produce performance problems. e. Click OK to save your changes. By default, the view contains two panes. v The left pane contains the Status column and the Operation column. Status View the status of the current statement or the status of previous statements in the Status column. Possible values include the following values: Started, Warning, Succeeded, Terminated, or Failed. If you run multiple statements, you can see the status in a tree structure. Expand the status line to see the results for each individual statement. In the cases where you received a Warning, your statement might still show a Succeeded as the final status in the expanded list of status. Operation The statement that was run. v The right pane contains three tabs: Status Shows the statement that was run and any messages that were generated. The query execution time is also included.
41
Parameters For a routine procedural object, displays the input and output parameter names, data types, values, and parameter types. Result1 Shows the tables and columns that are generated from the statement, if the statement produces output. Right-click this result to Save, Export, or Print the results. For more information about accessing the InfoSphere BigInsights Console, see Lesson 1: Starting to use the InfoSphere BigInsights Console on page 3 in Tutorial: Managing your big data environment.
Lesson 4: Running basic Big SQL queries

You now have tables with data that represents the fictional Sample Outdoor Company. In this lesson, you will explore some of the basic Big SQL queries, and begin to understand the sample data. You will use the SQL file that you created in Lesson 2: Creating a project and an SQL script file on page 36 to test the tables and data that you created in the previous lesson. In this lesson, several sample SQL files that are included with the downloaded Big SQL samples can be used to examine the data, and to learn how to manipulate Big SQL. Procedures 1. From the Eclipse Project Explorer, open the myBigSQL project, and double-click the aFirstFile.sql file. 2. In the SQL editor pane, type the following statement:
SELECT * FROM GOSALESDW.GO_REGION_DIM;
Each complete SQL statement must end with a semi-colon. The statement selects, or fetches, all the rows that exist in the GO_REGION_DIM table. 3. Click the Run SQL icon. Depending on how much data is in the table, a SELECT * statement might take some time to complete. Your result should contain 21 records or rows. You might have a script that contains several queries. When you want to run the entire script, click the Run SQL icon. When you want to run a specific statement, or set of statements, and you include the schema name with the table name (gosalesdw.go_region_dim), highlight the statement or statements that you want to run and press F5. 4. Improve the SELECT * statement by adding a predicate to the second statement to return fewer rows. A predicate is a condition on a query that reduces and narrows the focus of the result. A predicate on a query with a multi-way join can improve the performance of the query.
SELECT * FROM gosalesdw.go_region_dim WHERE region_en LIKE Amer%;
5. Click Run SQL to run the entire script. This query results in four records or rows. It might run more efficiently than the original statement because of the predicate. 6. You can learn something more about the structure of the table with some queries to the syscat table. The Big SQL catalog tables provide metadata support to the database. The Big SQL catalog consists of four tables in the schema SYSCAT: schemas, tables, columns, and indexcolumns. Type the following query, and this time, select the statement, and press F5.
42
SELECT * FROM syscat.columns WHERE tablename=go_region_dim AND schemaname=gosalesdw;
This query uses two predicates in a WHERE clause. The query finds all of the information from the syscat.columns table as long as the tablename is 'go_region_dim' and the schemaname is 'gosalesdw'. Since you are using AND, both predicates must be true to return a row. Use single quotation marks around string values. The result of the query to the syscat.columns table is the metadata, or the structure of the table. Look at the SQL Results tab in Eclipse to see your output. The SQL Results tab in Eclipse shows 54 rows as your output. That means that there are 54 columns in the table go_region_dim. 7. In addition to the query to learn about the structure of a table, you can also run a query that returns the number of rows in a table. Type the following query, select the statement, and then press F5.
SELECT COUNT(*) FROM gosalesdw.go_region_dim;
The COUNT aggregate function returns the number of rows in the table, or the number of rows that satisfy the WHERE clause in the SELECT statement, if a WHERE clause was part of the statement. The result is the number of rows in the set. A row that includes only null values is included in the count. In this example, there are 21 rows in the go_region_dim table. Learn more about aggregate functions: The COUNT and COUNT(*) statements are part of a group of statements that are called the aggregate functions. Aggregate functions return a single scalar value per group. Other aggregate functions are AVG, MAX, MIN, and SUM. To learn more about aggregate functions, see Aggregate functions 8. Another form of the count function is the COUNT (distinct <expression>) statement. As the name implies, you can determine the number of unique values in a column, such as region_en. Type this statement in your SQL file:
SELECT COUNT (distinct region_en) FROM gosalesdw.go_region_dim;
The result is 5. This result means that there are five unique region names in English (the column name region_en). 9. Another useful statement in Big SQL is the LIMIT clause. The LIMIT clause specifies a limit on the number of output rows that are produced by the SELECT statement for a given document. Type this statement in your SQL file, select the statement, and then press F5:
SELECT * FROM GOSALESDW.DIST_INVENTORY_FACT limit 50;
The statement returns 53837 rows without the limit clause and depending on the setting in the Eclipse preferences for the SQL Results View. Refer to the previous lesson about the SQL Results view to verify your settings. If there are fewer rows than the LIMIT value, then all of the rows are returned. This statement is useful to see some output quickly. 10. Save your SQL file. Learn more about using Big SQL from a JDBC client application: You can use Big SQL through a JDBC application to open a database connection, run a Big SQL query, and then display the results of the query. a. In the IBM InfoSphere BigInsights Eclipse environment, create a Java project by clicking File > New > Project. From the New Project window, select Java Project. Click Next. b. Type a name for the project in the Project Name field, such as MyJavaProject. Click Next.
43
c. Open the Libraries tab and click Add External Jars. Select the Big SQL JDBC driver from your local path (such as bigsql-jdbc-driver.jar). d. Click Finish. Click No when you are asked if you want to open a different perspective. e. Right-click the MyJavaProject project, and click File > New > Package. In the Name field, in the New Java Package window, type a name for the package, such as aJavaPackage4me. Click Finish. f. Right-click the aJavaPackage4me package, and click File > New > Class. g. In the New Java Class window, in the Name field, type SampApp. Select the public static void main(String[] args) check box. Click Finish. h. Copy or type the following code into the SampApp.java file:
package aJavaPackage4me; i //Import required package(s) import java.sql.*; public class SampApp { /** * @param args */ ii // set JDBC & database info static final String db = "jdbc:bigsql://<host_name>:7052/<schema_name>"; static final String user = "<user_name>"; static final String pwd = "<user_password>"; public static void main(String[] args) { Connection conn = null; Statement stmt = null; System.out.println("Started sample JDBC application."); try{ iii // Register JDBC driver Class.forName("com.ibm.biginsights.bigsql.jdbc.BigSQLDriver"); iv // Get a connection conn = DriverManager.getConnection(db, user, pwd); System.out.println("Connected to the database."); // Execute a query stmt = conn.createStatement(); System.out.println("Created a statement."); String sql; sql = "select * from gosalesdw.sls_product_dim " + "where product_key=30001"; ResultSet rs = stmt.executeQuery(sql); System.out.println("Executed a query.");
vi
// Obtain results System.out.println("Result set: "); while(rs.next()){ //Retrieve by column name int product_key = rs.getInt("product_key"); int product_number = rs.getInt("product_number"); //Display values System.out.print("* Product Key: " + product_key + "\n"); System.out.print("* Product Number: " + product_number + "\n"); } vii // Close open resources rs.close(); stmt.close(); conn.close(); }catch(SQLException sqlE){ // Process SQL errors
44
sqlE.printStackTrace(); }catch(Exception e){ // Process other errors e.printStackTrace(); }finally{ // Ensure resources are closed before exiting try{ if(stmt!=null) stmt.close(); }catch(SQLException sqle2){ } // nothing we can do try{ if(conn!=null) conn.close(); }catch(SQLException sqlE){ sqlE.printStackTrace(); }// end finally block }// end try block System.out.println("Application complete"); } }
1) After the package declaration, ensure that you include the packages that contain the JDBC classes that are needed for database programming. 2) Set up the database information so that you can refer to it. 3) Then, register the JDBC driver so that you can open a communications channel with the database. 4) Open the connection. 5) Run a query by submitting an SQL statement to the database. 6) Extract data from result set. 7) Clean up the environment by closing all of the database resources. i. Save the file and right-click the Java file and click Run as > Java Application. j. The results show in the Console view of Eclipse:
Started sample JDBC application. Connected to the database. Created a statement. Executed a query. Result set: * Product Key: 30001 * Product Number: 1110 Application complete
Lesson 5: Analyzing the data with Big SQL

In this lesson, you create and run Big SQL queries so that you can better understand the products and market trends of the fictional Sample Outdoor Company. Procedures 1. Create an SQL file. a. From the Eclipse menu, click File > New > Other. b. In the Select a wizard window, expand the BigInsights folder, select SQL Script from the list of wizards, and then click Next.
45
c. In the New SQL File window, select the myBigSQL project folder in the Enter or select the parent folder field. d. In the File name field, type companyInfo. The .sql file extension is added automatically. e. Click Finish. 2. To learn what products were ordered from the fictional Sample Outdoor Company, and by what method they were ordered, you must join information from multiple tables in the gosalesdw database because it is a relational database where not everything is in one table. a. Type or copy the following comments and statement into the companyInfo.sql file:
--Fetch the product name and the quantity and -- the order method. --Product name has a key that is part of other -- tables that we can use as a join predicate. --The order method has a key that we can use -- as another join predicate. --Query 1 SELECT pnumb.product_name, sales.quantity, meth.order_method_en FROM gosalesdw.sls_sales_fact sales, gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb, gosalesdw.sls_order_method_dim meth WHERE pnumb.product_language=EN AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key;
Because there is more than one table reference in the FROM clause, the query can join rows from those tables. A join predicate specifies a relationship between at least one column from each table to be joined. v The predicates such as prod.product_number=pnumb.product_number help to narrow the results to product numbers that match in two tables. v This query also uses an alias in the SELECT and FROM clauses, such as pnumb.product_name. pnumb is the alias for the gosalesdw.sls_product_lookup table. That alias can now be used in the where clause so that you do not need to repeat the complete table name, and the WHERE clause is not ambiguous. v The use of the predicate and pnumb.product_language=EN helps to further narrow the result to only English output. This database contains thousands of rows of data in various languages, so restricting the language provides some optimization. b. Highlight the statement, beginning with the keyword SELECT and ending with the semi-colon, and press F5. Review the results in the SQL Results page. You can now begin to see what products are sold, and how they are ordered by customers.
Table 1. A portion of the results product_name Course Pro Putter Blue Steel Max Putter quantity 587 214 order_method_en Telephone Telephone
46
Table 1. A portion of the results (continued) product_name Course Pro Gloves Glacier Deluxe quantity 576 129 order_method_en Telephone Sales visit
c. By default, the Eclipse SQL Results page limits the output to only 500 rows. You can change that value in the Data Management preferences. However, to find out how many rows the query actually returns in a full Big SQL environment, type the following query into the companyInfo.sql file, then select the query, and then press F5:
--Query 2 SELECT COUNT(*) --(SELECT pnumb.product_name, sales.quantity, -- meth.order_method_en FROM gosalesdw.sls_sales_fact sales, gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb, gosalesdw.sls_order_method_dim meth WHERE pnumb.product_language=EN AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key;
The result for the query is 446,023 rows. 3. Update the query that is labeled --Query 1 to restrict the order method to equal only 'Sales visit'. Add the following string just before the semi-colon:
AND order_method_en=Sales visit
4. Select the entire --Query 1 statement, and press F5. The result of the query displays in the SQL Results page in Eclipse. The results now show the product and the quantity that is ordered by customers actually visiting a retail shop. 5. To find out which method of all the methods has the greatest quantity of orders, you must add a GROUP BY clause (group by pll.product_line_en, md.order_method_en). You will also use a SUM aggregate function (sum(sf.quantity)) to total the orders by product and method. In addition, you can clean up the output a bit by using an alias (as Product) to substitute a more readable column header.
SELECT pll.product_line_en AS Product, md.order_method_en AS Order_method, sum(sf.QUANTITY) AS total FROM gosalesdw.sls_order_method_dim AS md, gosalesdw.sls_product_dim AS pd, gosalesdw.sls_product_line_lookup AS pll, gosalesdw.sls_product_brand_lookup AS pbl, gosalesdw.sls_sales_fact AS sf WHERE pd.product_key = sf.product_key AND md.order_method_key = sf.order_method_key AND pll.product_line_code = pd.product_line_code AND pbl.product_brand_code = pd.product_brand_code GROUP BY pll.product_line_en, md.order_method_en;
6. Select the complete statement, and press F5. Your results in the SQL Results page show 35 rows. The result table should look like the following output:
47
Table 2. Total quantity ordered by product and order method Product Camping Equipment Camping Equipment Camping Equipment Camping Equipment Camping Equipment Camping Equipment Camping Equipment Golf Equipment Golf Equipment Golf Equipment Golf Equipment Golf Equipment Golf Equipment Golf Equipment Mountaineering Equipment Mountaineering Equipment Mountaineering Equipment Mountaineering Equipment Mountaineering Equipment Mountaineering Equipment Mountaineering Equipment Outdoor Protection Outdoor Protection Outdoor Protection Outdoor Protection Outdoor Protection Outdoor Protection Outdoor Protection Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories Order_method E-mail Fax Mail Sales visit Special Telephone Web E-mail Fax Mail Sales visit Special Telephone Web E-mail Fax Mail Sales visit Special Telephone Web E-mail Fax Mail Sales visit Special Telephone Web E-mail Fax Mail Sales visit Special Telephone Web total 1413084 413958 348058 2899754 203528 2792588 19230179 333300 102651 80432 263788 38585 601506 3693439 199214 292408 81259 1041237 93856 549811 7642306 905156 311583 328098 1601526 183075 1836347 6848660 791905 359414 115208 1007107 117758 1472592 31043721
7. Save the companyInfo.sql file.
Lesson 6: Publishing a Big SQL application

In this optional lesson, you will learn how to publish an application that can run a simple Big SQL script.
48
After you create scripts that contain Big SQL statements, you can create applications that can run those Big SQL scripts with variables. For more information about developing InfoSphere BigInsights applications, click Help in the InfoSphere BigInsights Eclipse environment, and search for BigInsights > Developing BigInsights Applications. Procedures 1. In the Project Explorer, right-click your Big SQL project, and select BigInsights Application Publish. 2. From the BigInsights Application Publish window, complete the wizard with the following information:
Option Location page Description Verify the server name and click Next. If you do not have a valid connection to a Big SQL server, you can create one by clicking Create. For information about creating a connection to a Big SQL server, click Help from the InfoSphere BigInsights Eclipse, and search for Developing Big SQL queries > Creating a connection profile. See also, Lesson 1: Connecting to the IBM Big SQL server on page 34. Application page 1. Specify to Create New Application. 2. Type a unique application name in the Name field, such as myBigSQLApp. 3. You can type some text in the Description field. This is optional, but can help others to understand how to run your application when it is deployed on the server. 4. Optionally, select an icon that contains a visual identifier for this application file from the local file path by clicking Browse. You can use a default icon that is provided by the server. 5. List one or more categories that you want the application to be listed under. This is a helpful tag when you search for new applications to deploy. 6. Click Next. Type page Select Workflow as the Application Type, and click Next.
49
Option Workflow page
Description 1. Specify to Create a new single action workflow.xml file from a workflow. Learn more about creating applications: You can create an application from the Eclipse Project Explorer. When you do, the system automatically creates a workflow.xml file in the BIApp folder inside your project. Then you can double-click the workflow.xml file and the Workflow editor opens with a Workflow Elements page and the workflow.xml page. The Workflow Elements page is a graphical user interface to create the actions for your application. Click the Help in Eclipse Tools for more information. For more information about creating applications, see InfoSphere BigInsights applications. 2. Select the Action Type from the list, in this case Big SQL. 3. In the Properties pane, list the properties that you need. For the purposes of this lesson, you need to list the name of the script that contains your Big SQL statements so that you can run that script from the InfoSphere BigInsights server. Select Script from the list. The value is companyInfo.sql. You will also need the credentials properties file for connecting to Big SQL. Add the property credentials_prop and the value /user/biadmin/credstore/private/ bigsql.properties, assuming that the user name for the cluster you are on is biadmin. Remember: Before running this application, you must create this credentials property file, bigsql.properties. Its contents include the properties user, password, server, and port. For more information about the Big SQL elements, and credentials properties, see Big SQL action. 4. Click Next.
Parameters page Publish page
Click New if you have parameters to add. In this case, click Next. You see the structure of your application in the preview of the BIApp.zip file. You can click Add to add other scripts or supporting information. Click Finish
The application that you just published from Eclipse is now in the InfoSphere BigInsights Console. Click the Application page, and click Manage to see the new application. From here you can deploy so that it is available to users.
50
Lesson 7: Analyzing the IBM Big SQL data in BigSheets

In this lesson, you will run queries and then export the results to other applications, such as BigSheets. You will also see how to use a BigSheets workbook as input to a Big SQL table. Each feature of InfoSphere BigInsights provides powerful insights from manipulating and analyzing data. But even more powerful is the working relationship between the features, such as BigSheets and IBM Big SQL. Procedures 1. Create an SQL file called org.sql: a. From the Eclipse menu, click File > New > Other. b. In the Select a wizard window, expand the BigInsights folder, select SQL Script from the list of wizards, and then click Next. c. In the New SQL File window, select the myBigSQL project folder in the Enter or select the parent folder field. d. In the File name field, type org. The .sql file extension is added automatically. e. Click Finish. 2. Double-click the file to open it in the SQL editor. 3. To see the result of sales by year, you can make use of some unique features of Big SQL. The first feature is the WITH clause, which creates an inline view with a table name. You can refer to that table name in any FROM clause of the fullselect that follows the WITH clause. The second feature is the RANK OVER function, which efficiently simplifies the ranking of a column value, such as sale_total. For example, type or copy the following code into the org.sql file:
WITH sales ( year, total_sales, ranked_sales) AS ( SELECT EXTRACT(YEAR FROM CAST(CAST (order_day_key AS varchar(100)) AS timestamp)) AS year, SUM (sale_total) AS total_sales, RANK () OVER (ORDER BY SUM (sale_total) DESC) AS ranked_sales FROM gosalesdw.sls_sales_fact GROUP BY EXTRACT(YEAR FROM CAST(CAST (order_day_key AS varchar(100)) AS timestamp)) ) SELECT year, total_sales, ranked_sales FROM sales ORDER BY year, ranked_sales DESC;
4. Click the Run SQL icon. The statement shows the inline view, sales, which simplifies the final SELECT statement. In addition, the nested aggregate functions demonstrate how the data types and presentation can be manipulated.
Table 3. Results of a query to obtain the top sales by year year 2004 2005 2006 2007 total_sales 914352803.72 1159195590.16 1495891100.90 1117336274.07 ranked_sales 4 2 1 3
51
5. To see the quantity of products that are ordered by the brand and organization, create an inline view by using a WITH clause. You can create multiple inline views, such as SALES and INVENTORY by using a single WITH clause. Type the following statement:
WITH sales AS (SELECT sf.product_key AS prod_key, sf.quantity AS quantity, pll.product_line_en AS product, pbl.product_brand_en as brand FROM gosalesdw.sls_order_method_dim AS md, gosalesdw.sls_product_dim AS pd, gosalesdw.sls_product_line_lookup AS pll, gosalesdw.sls_product_brand_lookup AS pbl, gosalesdw.emp_employee_dim AS ed, gosalesdw.sls_sales_fact AS sf WHERE pd.product_key = sf.product_key AND md.order_method_key = sf.order_method_key AND ed.employee_key = sf.employee_key AND pll.product_line_code = pd.product_line_code AND pbl.product_brand_code = pd.product_brand_code ), inventory AS (SELECT distinct if.product_key AS prod_key, od.organization_code1 AS org_code, od.organization_level AS org FROM gosalesdw.go_branch_dim AS bd, gosalesdw.go_org_dim AS od, gosalesdw.dist_inventory_fact AS if WHERE if.branch_key = bd.branch_key AND od.organization_key = if.organization_key) SELECT sales.product AS Product, SUM(sales.quantity) AS total, sales.brand AS brand, inventory.org_code FROM sales, inventory WHERE sales.prod_key = inventory.prod_key GROUP BY sales.product, sales.brand, inventory.org_code;
6. Highlight the new statement and press F5. The query returns 30 records. You can see the output in the Eclipse SQL Results page, The results show the brand for each product line and the total sales.
Table 4. Results of a query to select the brand for each product line and the total sales Product Camping Equipment Camping Equipment Camping Equipment Camping Equipment Camping Equipment Camping Equipment Camping Equipment Golf Equipment Golf Equipment Golf Equipment Mountaineering Equipment total 5227686 4066514 1254682 5256058 5668086 6328570 26800702 950424 7858210 1418768 1792852 brand Canyon Mule EverGlow Extreme Firefly Hibernator Star TrailChef Blue Steel Course Pro Hailstorm Extreme org_code GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON
52
Table 4. Results of a query to select the brand for each product line and the total sales (continued) Product Mountaineering Equipment Mountaineering Equipment Mountaineering Equipment Outdoor Protection Outdoor Protection Outdoor Protection Outdoor Protection Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories Personal Accessories total 3696300 12948680 1362350 6268500 5333428 1658708 10768254 4102522 3047227 9883602 3279654 1218420 2947248 1307362 1606494 7495876 1258014 7080489 791367 brand Firefly Granite Husky BugShield Extreme Relief Sun Alpha Antoni Edge Epoch Extreme Glacier Mountain Man Polar Relax Seeker Trakker Xray org_code GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON GOCON
To do some further analysis on the results, you can use BigSheets to create charts that you can show on the dashboard in your InfoSphere BigInsights server. 7. Export the output from your query to a CSV file that you can use in BigSheets. a. In the SQL Results page, right-click the Result1 tab. Select at least one row, and then click Export > Current Result. b. In the Select Export Format window, click the Browse button to locate a destination directory in your local system. The default name is <path>/result.<filetype> in a Linux environment, or <path>\ result.<filetype> in a Windows environment. Change the result.<filetype> file name to SampleResults. c. In the Format field, select CSV file (*.csv). You should be aware that BigSheets can handle different readers. You should be prepared to select the correct reader in BigSheets to see this data in tabular format. For more information, see Changing the data reader for workbooks. d. Click Finish. 8. To make this SampleResults.csv file available to BigSheets, you must upload the file to the InfoSphere BigInsights server: a. Open the InfoSphere BigInsights Console, and click the Files page.
53
b. To create a directory on the server, select an existing folder, such as the tmp . directory, and click the Create Directory icon c. In the Create Directory window, name the directory SamplesOutput, and click OK. . d. Select the SamplesOutput directory, and click the Upload icon e. In the Upload window, click Browse and navigate to the SampleResults.csv file. Click Open. Click OK to upload the file to the server. The file is available to users on the cluster. From here you can click the file, and select Sheets. Continue the steps to create a workbook, and charts, by reading about BigSheets in Master workbooks, workbooks, and sheets. 9. Optional: Instead of using the Export and Upload features that are just described, consider using the CREATE TABLE AS... function of Big SQL. a. In the same org.sql file, add this line in front of the WITH clause:
CREATE TABLE gosalesdw.myprod_brand (Product, Total, Brand, Org_code) ROW FORMAT DELIMITED FIELDS TERMINATED BY \t AS
b. Click the Run SQL icon. The new table is created in the GOSALESDW schema. c. Open the InfoSphere BigInsights Console, and click the Files page. d. Locate your new table in hdfs/biginsights/hive/warehouse/gosalesdw.db/ myprod_brand/part-00000. e. Click the file to see the columns in the new table. Click the Sheet radio button. f. Click Save as Master Workbook. The BigSheets tab of the InfoSphere BigInsights Console opens in the View Results page. From there you can continue with BigSheets functions.
Lesson 8: Analyzing BigSheets data in IBM Big SQL tables

In this optional lesson, you can use workbooks that you create in BigSheets as input to Big SQL tables. This lesson shows you how to export to a tab-separated value (TSV) format and a JSON array format and then use those outputs as input to a Big SQL table. Prerequisites This optional lesson assumes that you downloaded the data that was referred to in Lesson 3: Importing data by using the Distributed File Copy application, and have access to the blogs-data.txt file. If you gathered social media data and created workbooks from that data, you might find that further analysis in a table format is warranted. For example, assume that you examine the blogs data in the blogs-data.txt file. You can create a workbook and use that data for two new Big SQL tables. Note: You do not need to complete the BigSheets tutorial to do this lesson, but some knowledge of BigSheets is recommended. Procedures 1. Create and modify a BigSheets workbook from the blogs-data.txt.
54
a. b. c. d.
From the InfoSphere BigInsights Console, click the BigSheets tab. Click New Workbook. In the Name field, type WatsonBlogData. In the File field, expand the Hadoop Distributed File System (HDFS) folders to get to ..\bigsheets\blogs-data-txt file. Select that file. e. In the Preview area of the screen, select a new reader to map the data to the spreadsheet format. Click the edit icon that looks like a pencil. The data in the blogs-data.txt is formatted in a JSON Array structure. Select the JSON Array reader from the list, and click the check mark inside the Select a reader box to apply the reader. Since the data columns exceed the viewing space, click Fit column(s). The first eight columns display in the Preview area. Click the check mark to save the workbook. f. Click Build new workbook. Rename the workbook by clicking the edit icon, entering the new name of WatsonBlogDataRevised, and clicking the green check mark. g. To more easily see the columns, click Fit column(s). Now columns A through H fit within the width of the sheet. h. There are several columns that you do not need to use in your Big SQL table. Remove multiple columns by following these steps: 1) Click the down arrow in any column heading and select Organize columns. 2) Click the X next to the following columns to mark them for removal:
v Crawled v Inserted v IsAdult v PostSize 3) Click the green check mark to remove the marked columns. i. Click Save > Save to save the workbook. In the Save workbook dialog, click Save. Click Exit to start the run process. Click Run to run the workbook. 2. In the menu bar of the WatsonBlogDataRevised workbook, click Export as. 3. In the drop-down window, select the TSV type in the Format Type field. Big SQL can also use the output from the other selections, but extra steps might be needed. For example, if you select CSV, the quotation marks might be retained in the table. To use just the data without quotations, you can perform a SELECT or CREATE TABLE AS with the TRIM(both '"' from <column name>) function. You can see more information about this technique in a later step. 4. Specify Export to File. 5. Click Browse to select a destination directory in the HDFS file system. Select your path, and then type the name for the new file, such as WatsonBlogs. Click OK. 6. Make sure that the Include Headers check box is cleared. Click OK. 7. A message dialog shows that the workbook is successfully exported. Click OK to close that dialog. 8. Make a note of the column names and the type of data from the BigSheets workbook that you want to define in Big SQL. You exported these columns from BigSheets: v Country - contains a two-letter country identifier. v FeedInfo - contains information from web feeds, with varying lengths. v Language - contains the string that identifies the language of the feed.
55
Published - contains a date and timestamp. SubjectHtml - contains a subject that is of varying length. Tags - contains a string of varying length that provides categories. Type - contains the source of the web feed, whether a news blog or a public feed. v URL - contains the web address of the feed, with varying length. 9. In the InfoSphere BigInsights Eclipse environment, create a project and a new SQL script by following the steps in Lesson 2: Creating a project and an SQL script file on page 36. Name the project MyBigSheetsAnalysis, and name the file NewsBlogs. 10. In the NewsBlogs.sql file, copy or type the following code: v v v v
create database if not exists BigSheetsAnalysis; use BigSheetsAnalysis; create table BigSheetsAnalysis.sheetsOut (country char(2),FeedInfo varchar(300), countryLang char(25),published char(25), subject varchar(300),tags varchar(100), type char(20),url varchar(100)) row format delimited fields terminated by \t; load hive data inpath /<HDFS path>/WatsonBlogs.tsv overwrite into table BigSheetsAnalysis.sheetsOut; select * from BigSheetsAnalysis.sheetsOut;
Replace the HDFS path with a path from your own InfoSphere BigInsights environment. Learn more about trimming unwanted quotation marks from imported data: If you selected another format to which to export from BigSheets, or if you import data into BigSQL from a DBMS such as DB2, your values might retain quotation marks. Use the following technique to trim any unwanted quotation marks from your table. a. After you load the data into the BigSQL table, select your data by using the TRIM function, or create another table as a select with the TRIM function.
create table BigSheetsAnalysis.NewsheetsOut (country, feedinfo, alanguage, published, subject, tags, type, url) as select trim(both " from country), trim(both " from feedinfo), trim(both " from alanguage), trim(both " from published), trim(both " from subject), trim(both " from tags), trim(both " from type), trim(both " from URL) from BigSheetsAnalysis.sheetsOut;
The following tables show a portion of the output from the BigSheets output after it is used in Big SQL:
56
Table 5. Part 1: A portion of the output from the BigSheets workbook after it is selected from a Big SQL table country ... ES FeedInfo ... {"Title":"TechWeekEurope Espaa","Id":"32191721",... countryLang ... Spanish published ... 2011-08-30 10:31:10
Table 6. Part 2: A portion of the output from the BigSheets workbook after it is selected from a Big SQL table subject ... <Keyword>IBM Watson</Keyword> volver a competir en Jeopardy! tags ... NULL type ... blog url ... http:// www.eweekeurope.es/ noticias/ibmwatson-....
11. You can also export data from a BigSheets workbook as a JSON Array and make it available to a Big SQL table. By using a web search engine, locate and download a SerDe .jar file. Search for the string JSON Serde. In this step, you use a SerDe application (Serializer/Deserializer) to process JSON data, which can be used to transform a JSON record into something that Hive and subsequently Big SQL can process. By using the SerDe interface, you instruct Hive as to how a record should be processed. You can write your own SerDe for processing JSON data, or you can use a package that is available from the web. For the purposes of this example, you are going to download a JAR file that helps you with the conversion. SerDe applications (JAR files) can be downloaded from any open source host. 12. Add the SerDe .jar file to the $BIGSQL_HOME/userlib directory. a. Stop the Big SQL server. b. Copy the SerDe .jar file to $BIGSQL_HOME/userlib . c. Restart the Big SQL server. The .jar file is available to the Big SQL JVM and the map/reduce JVMs. Open the JAR file and note the class file name so that you can add that to your create statement. 13. Stop and restart the Big SQL service from the command line or from the InfoSphere BigInsights Console Cluster Status page. 14. From the same workbook that you used in Step 2, click Export as and in the drop-down window, select the JSON Array type in the Format Type field. Name the file WatsonBlogsData. 15. In the InfoSphere BigInsights Eclipse environment, create a table that accesses the appropriate data in the JSON output from BigSheets, and that references the SerDe class.
create table BigSheetsAnalysis.watson_json ( Country String, FeedInfo String, CountryLanguage String, Published String, SubjectHtml String, Tags String, Type String, Url String) row format serde org.apache.hadoop.hive.contrib.serde2.JsonSerde stored as textfile;
57
16.
Use the LOAD statement to populate the Big SQL table with the data from the JSON file, and then SELECT from the table.
load hive data inpath </hdfs_path>/WatsonBlogsData.json overwrite into table BigSheetsAnalysis.watson_json; select * from BigSheetsAnalysis.watson_json;
This example illustrates the flexibility and synergy of the InfoSphere BigInsights components. Your output should look similar to the output from the TSV file.
Lesson 9: Analyzing the Big SQL data in the Big SQL Console
In this lesson, you will learn how to run queries in the Big SQL Console of InfoSphere BigInsights. You can run one or more valid Big SQL statements from the Big SQL Console. Procedures 1. Open the InfoSphere BigInsights Console and open the Welcome page. 2. In the Quick Links pane, click Run BigSQL Queries. A Big SQL Console opens in your browser where you can enter one or more queries. 3. Copy and paste the following query, which is from one of the previous lessons, in the query entry field, the middle pane of the Big SQL window in the Big SQL Console.
SELECT pll.product_line_en AS Product, md.order_method_en AS Order_method, sum(sf.QUANTITY) AS total FROM gosalesdw.sls_order_method_dim AS md, gosalesdw.sls_product_dim AS pd, gosalesdw.sls_product_line_lookup AS pll, gosalesdw.sls_product_brand_lookup AS pbl, gosalesdw.sls_sales_fact AS sf WHERE pd.product_key = sf.product_key AND md.order_method_key = sf.order_method_key AND pll.product_line_code = pd.product_line_code AND pbl.product_brand_code = pd.product_brand_code GROUP BY pll.product_line_en, md.order_method_en;
4. Click Run. This Big SQL Console feature is useful for small queries or output. It is ideal for testing what if kinds of queries. The output appears in the lower half of the Console window, or the query result field. There is one Status tab, and a Result tab for each statement that you ran. Results are limited to 200 rows. 5. Copy the following statements into the top pane and then click Run.
select order_day_key from gosalesdw.sls_sales_fact; select sale_total from gosalesdw.sls_sales_fact;
The output shows one Status tab, which contains information for both of the statements. There are two result tabs, one for the output of each statement. 6. To see statements that you ran previously, expand the list box of previously run statements, the top pane that is above your current statement. Any previously run statements can be recalled from the drop-down list. You can rerun a previously run statement, or set of statements that are in the list, by clicking the
58
statement. This action copies the statement back into the current statement window. Click Run to perform the query again. 7. Since you can run any valid statement from the Big SQL Console, you can create a table from a query to save the output to the Hadoop Distributed File System (HDFS) file system. Type or copy the following statement into the Big SQL Console active statement pane:
CREATE TABLE gosalesdw.myprod_brand1 (Product, Total, Brand, Org_code) ROW FORMAT DELIMITED FIELDS TERMINATED BY \t AS WITH sales AS (SELECT sf.product_key AS prod_key, sf.quantity AS quantity, pll.product_line_en AS product, pbl.product_brand_en as brand FROM gosalesdw.sls_order_method_dim AS md, gosalesdw.sls_product_dim AS pd, gosalesdw.sls_product_line_lookup AS pll, gosalesdw.sls_product_brand_lookup AS pbl, gosalesdw.emp_employee_dim AS ed, gosalesdw.sls_sales_fact AS sf WHERE pd.product_key = sf.product_key AND md.order_method_key = sf.order_method_key AND ed.employee_key = sf.employee_key AND pll.product_line_code = pd.product_line_code AND pbl.product_brand_code = pd.product_brand_code ), inventory AS (SELECT distinct if.product_key AS prod_key, od.organization_code1 AS org_code, od.organization_level AS org FROM gosalesdw.go_branch_dim AS bd, gosalesdw.go_org_dim AS od, gosalesdw.dist_inventory_fact AS if WHERE if.branch_key = bd.branch_key AND od.organization_key = if.organization_key) SELECT sales.product AS Product, SUM(sales.quantity) AS total, sales.brand AS brand, inventory.org_code FROM sales, inventory WHERE sales.prod_key = inventory.prod_key GROUP BY sales.product, sales.brand, inventory.org_code;
This is the same statement that you ran in Lesson 7: Analyzing the IBM Big SQL data in BigSheets on page 51, so the table name must be changed to gosalesdw.myprod_brand1 to remain unique. Your output, which is a new table in the HDFS file system, can be used in BigSheets, as previously described.
Lesson 10: Analyzing the Big SQL data in a client spreadsheet application
In this optional lesson, you will learn how to examine data in a client spreadsheet application. There are many open source spreadsheet applications. For this lesson, you will use the Microsoft Excel application.
59
Tip: You might need to adjust some of the specific instructions in this lesson, such as the names of some of the user interface controls, depending on the spreadsheet application that you use. Big SQL provides connectivity for some applications through either a 32-bit or a 64-bit ODBC driver, on either Linux or Windows, that conforms to the Microsoft Open Database Connectivity 3.0.0 specification. Depending on the spreadsheet application that you use, you might need to select the ODBC driver that you install from the operating system, or from the spreadsheet application itself. Refer to information in your particular spreadsheet application about importing data from external data sources on the specific steps. Procedures 1. Download the 32-bit ODBC driver from InfoSphere BigInsights Console. a. Open the InfoSphere BigInsights Console. Tip: If you must attach to a remote ODBC client from your Linux machine, follow these steps: 1) From a Linux command line, type the following command to determine the IP address of your current InfoSphere BigInsights cluster:
cat /etc/hosts
2) Open a browser outside of the cluster environment by typing the following in the URL address field:
<ip address>:8080 or <ip address>:8443 if you are running secure protocol
The InfoSphere BigInsights Console opens in the non-cluster location. You can continue with the steps to download the driver and attach the ODBC driver to the correct location. b. In the Quick Links section, select Download the Big SQL Client drivers. c. From the Save As window, specify the file path and the name of the Big SQL client package (which is by default, big-sql-client-kit.zip). Click Save. d. Extract the contents of the client package under a directory of your choice. 2. Install a 32-bit ODBC driver to connect to your spreadsheet application. a. Navigate to the folder where you downloaded and extracted the big-sql-client-kit.zip. b. Open the odbc folder and then open the appropriate operating system folder (linux or windows). For the purposes of this lesson, open the windows folder. c. Start the IBM Big SQL ODBC Client Setup installation program by right-clicking the biginsights_odbc.exe file and selecting Run as administrator. d. Accept the license agreement, and then click Next. e. Select to install the 32-bit ODBC driver, and then click Next. f. Specify the folder into which you want to install the driver, and then click Next. g. In the ODBC DSN Creation window, in the Enter DSN Name field, type MyDSN.
60
h. In the Big SQL Server configuration window, complete the fields with the following information: Database gosalesdw Host Port <server_name.abc.com> 7052
User ID <user_name that you use to connect to the InfoSphere BigInsights cluster> Password <password that you use to connect to the InfoSphere BigInsights cluster> i. Click Next. j. Click Install. Click Finish. 3. Optional: If you want to add more DSN entries, do the following steps: a. Open the odbc.ini file as an administrator. b. Type the additional entries or modifications. The following example assumes that your new properties are identified as MyDSN2:
[MyDSN2] DATABASE=gosalesdw HOST=my.host2.com PORT=7052 UID=myUSERID2 PWD=myPASSWORD2
c. Save the odbc.ini file. d. For any modification or addition to the odbc.ini file, run the sql_dsn.vbs script file as an administrator from a Windows command line at the <ODBC driver installed path>\IBM Big SQL Driver\win32 path to register the modified DSN information.
sql_dsn.vbs MyDSN2
You should get a confirmation message that your DSN was added to the register successfully. 4. Open the spreadsheet application and click Data > Import External Data > Import Data. 5. In the Select Data Source window, click New Source. 6. In the Data Connection Wizard, select ODBC DSN. 7. In the Connect to ODBC Data Source window, select the ODBC data source that you defined when you installed the driver, MyDSN. 8. Click Next. 9. In the Select Database and Table window, select the table with which you want to work inside your client spreadsheet application. From Lesson 7: Analyzing the IBM Big SQL data in BigSheets on page 51, step 9 included an optional way of creating a table by using the CREATE TABLE AS structure. Select the table name that you created in that step, myprod_brand. 10. Click Finish. 11. From the Select Data Source window, make sure that the table you selected appears in the file name field. Click Open. 12. Select to import the data into a new worksheet or use the current worksheet. For the purposes of this lesson, select Existing worksheet.
61
13. Click OK. The data is imported. If you select another table, you might get a message that there is more data than can fit in your worksheet. Click OK. The spreadsheet application contains the result of the table that you selected. It is the equivalent of the query, select * from <table_name>. The table can be used in your client application as you would use any spreadsheet data for further analysis.
Lesson 11: Writing advanced Big SQL queries and including optimization hints
In this lesson, you create more optimized queries by adding Big SQL inline hints. Big SQL does not normally need hints to run queries properly. But they can be useful in more advanced situations where you want more control over how the statement is run. Procedures 1. Create an SQL file named advanced.sql in the myBigSQL project. You can use the steps in the previous lesson. 2. To open the advanced.sql file, double-click it. 3. To understand how the products that are sold rank in comparison with the products that are shipped, type the following statement into the advanced.sql file:
WITH sales AS (SELECT sf.* FROM gosalesdw.sls_order_method_dim AS md, gosalesdw.sls_product_dim AS pd, gosalesdw.emp_employee_dim AS ed, gosalesdw.sls_sales_fact AS sf WHERE pd.product_key = sf.product_key AND pd.product_number > 10000 AND pd.base_product_key > 30 AND md.order_method_key = sf.order_method_key AND md.order_method_code > 5 AND ed.employee_key = sf.employee_key AND ed.manager_code1 > 20), inventory AS (SELECT if.* FROM gosalesdw.go_branch_dim AS bd, gosalesdw.dist_inventory_fact AS if WHERE if.branch_key = bd.branch_key AND bd.branch_code > 20) SELECT sales.product_key AS PROD_KEY, SUM(CAST (inventory.quantity_shipped AS BIGINT)) AS INV_SHIPPED, SUM(CAST (sales.quantity AS BIGINT)) AS PROD_QUANTITY, RANK() OVER ( ORDER BY SUM(CAST (sales.quantity AS BIGINT)) DESC) AS PROD_RANK FROM sales, inventory WHERE sales.product_key = inventory.product_key GROUP BY sales.product_key;
Tip: In most cases, Big SQL joins tables together in the order that they are provided, so it is important to take care when you determine the order of the tables. Specifically, when you choose the order of the tables in the query, remember to eliminate rows as early as possible. Tables that are highly selective (such as those tables that use predicates that filter out many rows, or those tables with rows that are removed as a result of the join) should be located early in the query. Ordering the tables in this way reduces the number of rows that must be moved to the next step of the query.
62
4. Click the Run SQL icon. The result contains 165 rows. The output shows the product by its product key, and how many units were shipped and how many units were sold.
Table 7. Partial results that show how many units were shipped and how many units were sold PROD_KEY 30090 30107 30111 ... INV_SHIPPED 355119027 257820246 178931852 ... PROD_QUANTITY 121531134 114559248 67018510 ... PROD_RANK 1 2 3 ...
The INV_SHIPPED column is derived from an aggregate function, SUM, and a CAST function.
SUM(CAST (INVENTORY.QUANTITY_SHIPPED AS BIGINT))AS INV_SHIPPED
a. The original column, QUANTITY_SHIPPED is created as an integer. The CAST function converts the output to another data type, in this case a BIGINT. b. Then, the SUM function returns a single summed value for the column. The PROD_RANK column is derived from the RANK function.
RANK() OVER ( ORDER BY SUM(CAST (SALES.QUANTITY AS BIGINT)) DESC) AS PROD_RANK
a. The sales.quantity column is CAST from an integer to a BIGINT. b. The SUM function is used on that column. c. The sales.quantity column is then sorted in descending order. d. The RANK function produces a number that is based on the sorted order. For more information about other statistics functions within Big SQL, see Statistical functions. 5. Use the same query that you used in the previous step, but this time add some hints to the statements to better optimize the query. Query hints are special comments that appear next to the part of the query that they apply to. The following format is an example of how to write query hints: /*+ name=value[, name=value ..] +*/ . a. Find the following statements in the sales table expression:
gosalesdw.sls_product_dim AS pd , gosalesdw.emp_employee_dim AS ed,
b. Add hint comments to the end of each statement, so that your FROM clause looks like the following code:
gosalesdw.sls_product_dim /*+ tablesize=small +*/ AS pd, gosalesdw.emp_employee_dim /*+ tablesize=small +*/ AS ed,
This hint is an optimizer hint that indicates that the table is to be joined by using a memory join (hash join) if possible. The value small indicates that the table is sufficiently small enough in size to fit into memory and can be hashed. The small table hint can also be used if the predicates applied to the table remove enough rows so that it can fit into memory. 6. Save the advanced.sql file. 7. Click the Run SQL icon. This query restricts the output by using additional predicates, such as AND pd.product_number > 10000. It produces the same 165 rows with the same ranking by quantity sold. But, you have more control over the implementation of the statement by using the hints, and generally there is an improvement in efficiency.
63
Summary of developing Big SQL queries to analyze big data

This tutorial demonstrated some Big SQL statements and some techniques for analyzing data.
Lessons learned
You now have a good understanding of the following tasks: v How to create an InfoSphere BigInsights project in Eclipse. v How to create Big SQL tables and load data into them. v How to use the Big SQL editor in Eclipse to create queries. v How to see the results of queries. v How to export the output of a query for use in other applications. v How to use Big SQL data in BigSheets and how to use BigSheets data in Big SQL. v How to access Big SQL data in client spreadsheet applications.
Read the pertinent articles on IBM developerWorks: v What's the big deal about Big SQL? Introducing relational DBMS users to IBM's SQL technology for Hadoop. Related information: IBM Big SQL Reference Analyzing data with BigSheets
64
Chapter 7. Tutorial: Creating an extractor to derive valuable insights from text documents
Learn how to use IBM InfoSphere BigInsights Text Analytics, an information extraction system, to extract information from IBM quarterly reports. Using InfoSphere BigInsights Text Analytics you define programs written in Annotation Query Language (AQL) to extract structured information from unstructured and semi-structured documents. You can apply Text Analytics to big data at rest in InfoSphere BigInsights and big data in motion in IBM InfoSphere Streams. By using text analytics tooling, you can develop, run, and publish extractors that glean structured information from unstructured documents. The extracted information can then be analyzed, aggregated, joined, filtered, and managed by using other InfoSphere BigInsights tools. In this tutorial, you will extract business information from a series of IBM quarterly reports, such as the revenue for each IBM division. You can then use that information in other tools, such as BigSheets, to understand and analyze trends, and visualize the results in charts or graphs. The Welcome page of the InfoSphere BigInsights Console includes information about how to enable your Eclipse environment for developing Text Analytics. For more information, click Help in the InfoSphere BigInsights Eclipse tools. We will extract useful information from text documents by using a 5-step process. The tasks that are associated with this process are supported in the Extraction Tasks view in the Eclipse tools. This gives us a workflow that we can follow as we build extractors. The following steps are included in these lessons: 1. Identify the collection of documents from which you want to extract information. 2. Analyze the documents to identify examples of the information that you want to extract. 3. Write AQL statements to extract the identified information. 4. Test and refine the AQL statements. 5. Export the final extractor and deploy to a runtime environment such as InfoSphere BigInsights or IBM InfoSphere Streams. Lessons 1, 2, and 3 will introduce you to the Text Analytics features and tooling. These introductory lessons teach you how to use some basic AQL statements, and how to manipulate the Text Analytics Workflow perspective and the Extraction Plan. In the more advanced lessons (Lessons 4, 5, 6, and 7), you refine the AQL, finalize the extractor, and export the Text Analytics Module (TAM) so that it is ready to deploy to a runtime system.
Learning objectives
After you complete the lessons in this tutorial, you will understand the concepts and know how to do the following actions: v Navigate a Text Analytics project in Eclipse.
65
v v v v
Import documents into a project. Understand the Text Analytics development process. Use the tooling to write and test AQL statements. Export an extractor ready to be deployed to a runtime system.
Time required
Allow 30 minutes to complete the basic parts of this tutorial. Allow another 45 minutes to complete the more advanced lessons.
Lesson 1: Setting up your project

In this lesson, you explore the InfoSphere BigInsights Text Analytics Workflow perspective in Eclipse, including the Extraction Task view and the Extraction Plan view. You also create a Text Analytics project and set up the input data. Prerequisites 1. Make sure that you install Eclipse, and enable it to work with InfoSphere BigInsights. See Installing the InfoSphere BigInsights Tools for Eclipse 2. Download the sample data by following these steps: a. Open the Welcome page of the InfoSphere BigInsights Console. b. In the Quick Links section, click Download applications (Eclipse projects). c. Select SampleTextAnalyticsProject_eclipse.zip. Click Download. Select to save the file in your local system. Click Close to close the Map Reduce Sample Applications window. d. Navigate to the download location and extract the compressed file. You will now create a text analytics project and import the documents. Procedures 1. From your desktop, start Eclipse. Click OK to use the default workspace. The Task Launcher for Big Data opens. 2. Close the Help pane for now if it is visible. You can always get help by pressing F1, or by selecting Help > Help Contents from the menu bar. 3. In the Task Launcher for Big Data, click the Develop tab, and click Create a text extractor from the Tasks panel. 4. Create a project called TA_Training. a. In the New BigInsights Project window, specify TA_Training as the project name, then click Finish. b. Click Yes in the message box to switch to the InfoSphere BigInsights perspective. The Extraction Plan pane is usually visible on the right of the window. It is the design pane for Text Analytics projects. The Extraction Tasks pane is usually visible on the left of the window. It is the workflow for Text Analytics projects. The actual location of the views might depend on your Eclipse environment. Learn more about adding views: If the Extraction Tasks view is not visible, add that view. From the Eclipse menu click Window > Show view > Extraction Tasks. You can follow the same steps for the Extraction Plan if it is not visible. 5. Before you can start working with the sample documents, you must bring them into Eclipse. Open the Project Explorer and expand the TA_Training project.
66
a. From the Eclipse menu bar select Window > Show view > Project Explorer. b. Expand the project TA_Training and open the folder textAnalytics. c. Access the input documents in one of the following ways: Importing 1) Right-click the project TA_Training and click File > Import. 2) In the Select window, click General > File System. Click Next. 3) In the From directory field, click Browse. Locate and select the ibmQuarterlyReports folder that you downloaded and extracted at the beginning of this lesson. Click OK. 4) In the File System window, select the ibmQuarterlyReports to access all of the files within the folder, then click Finish. Dragging and dropping 1) In your local file system, navigate to the SampleTextAnalyticsProject_eclipse/data/ folder that you downloaded and extracted at the beginning of this lesson. 2) Open the folder and drag the ibmQuarterlyReports folder from your local file system onto the textAnalytics folder in your Eclipse project. Specify Copy files and folders and click OK in the File and Folder Operation dialog.
Lesson 2: Selecting input documents and labeling examples

In this lesson, you analyze the input documents to identify examples of the information to be extracted and to add examples to the Extraction Plan. By labeling examples, you also start creating an extraction plan, which is a view of the design of your extractor. In the extraction plan, you identify, organize, manage, and navigate elements of the extractor. As you create the extraction plan by labeling snippets of interest and their associated clues, you are developing an understanding of the input documents. It is a good idea to work with someone who is familiar with the documents during this part of the process. Since you are interested in extracting revenue by division, you must search the documents for snippets of text that contain this information. As you find and label examples, be aware of patterns and clues in the text that can help improve the accuracy of the extractor. An example that you might find is a phrase such as Revenues from Software were $3.9 billion. If you labeled this example, you might notice that it has three important features: v The term "Software", which is a division name. v The term $3.9 billion, which is a revenue amount. v The term revenue. You will use all of these features as context to identify instances of revenue by division. Labels are meaningful identifiers of the text that you want to extract. Labels also serve to categorize various clues that help you develop an extractor. There are two types of labels:
67
Top level or parent A snippet that contains the information that you want to extract. An example of a top level identifier is Revenues from Software were $3.9 billion, which contains clues to a division and the revenue that is associated with it. Clues You decompose the top level identifiers into features and clues. Basic features are usually parts of the top level or parent that you must extract. A clue is typically supporting text that provides additional context. In our example, we would consider the word revenue to be a clue and the division names and revenue amount would be features. The process of labeling the document is an iterative process. It helps if you can work with a subject matter expert who can help you decide if you have identified enough examples, features, and clues to reliably extract the required information. It would be unusual to find the same information presented the same way across a broad set of documents. More often than not, something causes things to change. It might be a change in the business, a change in regulations or reporting requirements, a change of writer or editor, a new template for the document, or simply a change in writing style. Ideally your subject matter expert can alert you to the changes and variations that you must deal with. When you read some of the sample input documents, you will see that you have two basic patterns to deal with: revenues for division were $x.x and division revenues were $x.x. There are a number of additional variations in the information around the basic features and clues, but only two basic patterns. Procedures 1. Before you start your analysis, you must set up the input documents in the Extraction Tasks view. a. Click the Extraction Tasks tab in the left pane of the Text Analytics Workflow perspective. b. Expand Step 1 of the Extraction Tasks wizard, Select Data Collection. Click Browse Workspace and navigate to the ibmQuarterlyReports folder in your project (TA_Training/textAnalytics/ibmQuarterlyReports). Select the ibmQuarterlyReports folder, and click OK. c. From the Language list, select en. d. Select 4Q2006.txt in the Extraction Tasks wizard. Click Open. 2. Examine the text in the document you just opened by looking for examples that report revenue by division. 3. Identify RevenueByDivision as the first clue in which you are interested. a. Search the file until you see the phrase Revenues from the Software segment were $5.6 billion. Highlight that phrase, right-click, and click Add example with New Label. b. In the Add New Label window, type RevenueByDivision in the Label Name field and leave a Parent Label field blank to make RevenueByDivision the top level label. c. Click Finish. 4. Look again at the text from the 4Q2006.txt file. Search for the phrase Revenues from the Systems and Technology Group (S&TG) segment totaled $7.1 billion. Add as another example. a. Right-click that phrase and click Label Example As. b. Select RevenueByDivision.
68
5. You have found two examples of the pattern revenues for division were $x.x. Now, find an example that refers to the other pattern in which you were interested. Search for and highlight Global Financing segment revenues increased 3 percent (flat, adjusting for currency) in the fourth quarter to $620 million. in the 4Q2006.txt file. a. Right-click that phrase and click Label Example As. b. Select RevenueByDivision. 6. If you look at the Extraction Plan view, you see the three examples that you labeled. If you click either of the examples under the parent label, such as Revenues from the Software segment were $5.6 billion, the text from which it came is highlighted. These snippets of text contain some useful clues for extraction, such as revenues, division names such as Systems and Technology Group (S&TG), and amounts such as $5.6 billion. Now you want to record clues from these examples as additional labels in the Extraction plan. Learn more about the Extraction Plan: You can think of the Extraction Plan as an interactive design view of your extractor. It helps you to identify, organize, and navigate the elements that you want to extract. It also helps you write the associated AQL statements, which makes the Extraction Plan a powerful part of the design and development process. a. In the 4Q2006.txt file, select the snippet Revenues from the Software segment were $5.6 billion. Highlight and right-click the term Revenues, and select Add Example with New Label. b. In the Add New Label window, type revenues in the Label Name field. Type RevenuebyDivision as the parent label. Click Finish. c. In the same snippet, find the phrase $5.6 billion. Right-click that phrase and click Add example with New Label. d. In the Add New Label window, type Money in the Label Name field. Type RevenueByDivision as the parent label. You can also double-click the RevenueByDivision parent label to use that name as the parent. Click Finish. 7. It is a good idea to decompose clues to the lowest level. In this way, you can let the powerful text analytics engine and optimizer do more of the work, rather than writing complex expressions in your code. This action of decomposing clues can also give you a more robust and flexible solution. Money, which you labeled in the previous step, is a good example. Money has three basic features: a currency sign, followed by a number, followed by a quantifier such as million or billion. Go ahead and create labels for these three features: a. In the 4Q2006.txt file, find the snippet $5.6 billion which was part of the original phrase in a previous step. You have already labeled this phrase Money. b. Right-click only the currency symbol, $, and click Add example with New Label. Type Currency in the Label Name field. In the Parent Label field, type Money. c. Right-click 5.6 of the same phrase, and select Add example with New Label and type Number in the Label Name field. In the Parent Label field, type or select Money. d. Right-click billion and select Add example with New Label and type Quantifier in the Label Name field. In the Parent Label field, type or select Money. 8. You would usually continue analyzing documents, labeling additional examples and clues until you had seen enough to be confident that you understood the features, clues, and patterns for which you will code. To save time with the
69
additional examples and clues that you should label, use Table 8 as a guide. Search the documents that are identified and add the labels, noting of which parent the label is a child. a. Open the document that is listed in the File column of the table in the editor. b. Press Ctrl+F to search for the string that is listed in the Search term column of the table. c. For each clue to add as a label, right-click the word or phrase and click Add example with New Label. Specify the suggested label name in the Label name column of the table, type the appropriate parent label name, and click Finish. If you already added the label and want to add an example of the label, click Label Example As. d. Close the file.
Table 8. Additional clues to strengthen your extractor Label name as child to RevenueByDivision unless otherwise noted Money Division Division Quantifier as a child to Money Number as a child to Money Number as a child to Money Metric Money Number as a child to Money Number as a child to Money
File 4Q2006.txt 4Q2006.txt 4Q2006.txt 4Q2006.txt 4Q2007.txt 4Q2009.txt 4Q2010.txt 4Q2010.txt 4Q2010.txt 4Q2010.txt
Search term $7.1 billion Systems and Technology Group (S&TG) Global Technology Services million 12.5 27.2 Revenue $29.0 billion 8.7 5.3
Lesson 3: Writing and testing AQL

In this lesson, you write AQL statements to extract the basic low-level features that you identified in the previous lesson. Now you are going to write AQL statements to extract the basic features that you identified during the document analysis process. You will see how you can use a simple pattern to put the basic features in context to give you candidates. In subsequent lessons, you use similar techniques to combine features to create concepts, and expand your AQL to further consolidate and filter the results. Extractors are written in the Annotation Query Language (AQL), which is the core of text analytics in InfoSphere BigInsights and InfoSphere Streams. You code custom extractors in AQL. Text Analytics also includes a library of pre-built extractors and a sophisticated set of tools. The AQL language was designed by using SQL-like expressions, which makes it familiar and easy to learn. Learn more about writing AQL: An extractor is a program that is written in AQL that extracts structured information from unstructured or semistructured text. AQL
70
is a declarative language, with a syntax that is similar to that of the Structured Query Language (SQL). For more information about writing AQL, see the AQL Reference. If you look at the labels that you created in the Extraction Plan, you see that the lowest level basic features that you labeled are the three elements of Money: the currency symbol, a number, and a quantifier. You are now going to write AQL statements to extract those elements by using simple extract statements that use dictionaries and regular expressions. As you will see, AQL involves the creation of views that use extract and select expressions. These statements are the three fundamental elements of AQL. So, it is worth repeating: by using AQL statements, your data is managed through views, and views are created by using extract and select expressions. Your input data set is referenced as a view called Document and its contents are referenced as a column called text. Procedures 1. You will now create views that use extract expressions. You create one view for each of the three basic features of Money. a. In the extraction plan, right-click the Currency label that you created in the previous lesson. b. From the menu, select New AQL Statement > Basic Feature AQL Statement. c. In the Create AQL Statement dialog, in the View Name field, specify Currency. d. In the AQL Module field, select RevenuebyDivision_BasicFeatures. e. In the AQL script, specify RevenueBasic.aql for the name of the AQL script that you will be writing. f. In the Type field, select Dictionary. g. Select the Output view check box. h. Click OK. 2. The RevenueBasic AQL file opens in the editor. The file is populated with templates to create a dictionary and a view. Learn more about views: Views are the primary data structures that are used with AQL statements. AQL statements create views by selecting, extracting, and transforming information from other views. AQL views are like the views in a relational database. They have rows and columns just like a database view. However, AQL views are not materialized by default. In other words, the result of the views is not viewable output. To see your output, you must include an output view statement. You reference input data as a view called Document with one column called text. Think of each document in your input data set as one row in the Document view with the document content mapped onto the text column. 3. Complete the AQL template to create the dictionary and the view. a. In the create dictionary line, type or copy the following code to replace the dictionary template:
create dictionary CurrencyDict as ($);
Make sure that you deleted the lines that begin from file and with language. Learn more about dictionaries:
71
To extract elements from text, you can use regular expressions and dictionaries. When you want to match text that is based on a pattern, you use a regular expression. When you can match on defined words, use a dictionary. AQL dictionaries are more efficient than regular expressions, so it is a good idea to use dictionaries whenever possible, even in cases where there is just a single entry. You usually create dictionaries with more entries that are stored in an external file, which makes it easier to add and change entries without having to open up the code. End each AQL statement with a semi-colon. You are changing the dictionary declaration to be a simple inline declaration. In the example, when the statement is run, the string is the entry in the CurrencyDict dictionary. Learn more about another way to add terms to a dictionary: Instead of typing each entry manually, you can use the features of the Extraction Plan view to add terms into a dictionary file: 1) In the Extraction Plan, expand the top-level label, RevenueByDivision, and expand Labels. Inside that label, expand the Currency label. 2) Click Examples to open that folder. You see the clues for Currency that you labeled in the previous lesson. 3) Select all of the entries in the Examples folder, and right-click, and select Add to dictionary. 4) In the Select Dictionary window, click Browse Workspace. 5) In the Select a file window, select the src/ RevenuebyDivision_BasicFeatures folder in the TA_Training project, and click Create Dictionary. The NewDictionary.dict file is created. 6) Click OK. Then, click OK to close the Select Dictionary window. The terms are now added into a dictionary file that you can use in an extract statement. 7) Save the file. 8) You can rename the file by selecting NewDictionary.dict in the Project Explorer, and clicking F2. In the Rename Resource window, type a new name for the file. By using a dictionary file instead of inline terms, you can more easily modify terms without modifying the code. The create dictionary statement would change as follows:
create dictionary CurrencyDict from file NewDictionary.dict;
b. In the create view template, replace the template with the following code:
create view Currency as extract dictionary CurrencyDict on R.text as match from Document R;
The create view statement uses an extract expression that finds all matches of terms in the dictionary that you created. The dictionary matches are stored in a column named match. 4. Do not change the output view line. The output view statement materializes the view. By default, views are not materialized. They are also likely to be removed when you optimize for better performance. But, during development, you are likely to want to look at the
72
contents of intermediate views like this one for debugging purposes, then later you can comment out the output view statements that are not required. 5. Click File > Save from the menu to save your changes. Verify that your AQL looks like the following code:
module RevenueByDivision_BasicFeatures; create dictionary CurrencyDict as ($); create view Currency as extract dictionary CurrencyDict on R.text as match from Document R; output view Currency;
6. You are now ready to test the extractor. You run the AQL queries and then view the results. There are three primary options to run the extractor. You can run against all of the documents, on selected documents only, or on documents with labeled information only. In the Extraction Plan, right-click RevenueByDivision and click Run > Run the extraction plan on the entire document collection. 7. When the run is complete the results are shown in the Annotation Explorer. The Annotation Explorer shows each extracted field in the Span Attribute Value column. You can also see the text from the left and right of the extracted text, which is known as the left and right context. Double-click one of the rows to see the extracted text in the original document in the edit pane. The Span Attribute Value column in the middle of the Annotation Explorer shows the basic features that are picked up by the extractor. The output that shows in the Annotation Explorer is by view name. 8. After you complete the code for the Currency view, right-click the Currency label in the Extraction Plan and select Mark Completed. The label icon changes to a check mark. This marker is a visual reminder that you have created a view for this label. This process of checking as you progress through the Extraction Plan is part of the workflow of text analytics. You are building up from the simpler labels by creating views that extract the information you need. 9. Enhance the RevenueBasic.aql file by adding views for the additional basic features of Money. Use the Number and Quantifier labels in new views within the same module. For each view that you create, begin by right-clicking the label that corresponds to the view that you want to create in the Extraction Plan. Then, from the menu, select New AQL Statement > Basic Feature AQL Statement.
Option create view Number Description View Name Number AQL Module RevenuebyDivision_BasicFeatures AQL script RevenueBasic.aql Type Regular expression
Output view Enabled

73
Option create view Quantifier
Description View Name Quantifier AQL Module RevenuebyDivision_BasicFeatures AQL script RevenueBasic.aql Type Dictionary
Output view Enabled
10. Modify the RevenueBasic.aql file to correct the two templates that were added. a. Update the Number view to look like the following code:
create view Number as extract regex /\d+(\.\d+)?/ on R.text as match from Document R; output view Number;
Learn more about another way to add regular expressions: Instead of typing the regular expression manually, you can use the features of the Extraction Plan view to add an expression into your statement: 1) In the Extraction Plan, expand the top-level label, RevenueByDivision, and expand Labels. Inside that label, expand the Number label. Click Examples to open that folder. You see the clues for numbers that you labeled in the previous lesson. 3) Select all of the entries in the Examples folder, and right-click, and select Generate Regular Expression. 4) In the Regular Expression Generator window, the samples you selected are already loaded in the Samples pane. Click Generate regular expression. 2) 5) You might get several suggestions, but in this case, there is one suggestion that is based on the samples:
(\d{1,2})?(\.)?\d
6) Click Next. In the Regular Expression Generator window, you can refine the expression. You might find that because of the clues that you labeled, the generated expression is more or less restrictive than you want. Experiment with the options, and when you are satisfied, click Finish. 7) A confirmation window shows the expression that was generated. The expression is placed on the clipboard. Click OK. 8) Navigate to the statement that begins extract regex in the RevenueBasic.aql file, and right-click, and select Paste to add the generated expression to your code. b. Update the Quantifier view by first using the Extraction Plan menu to create a dictionary file, instead of creating an inline dictionary. 1) In the Extraction Plan, expand the Quantifier label and then expand Examples. Highlight all of the entries in the Examples folder.
74
2) Right-click and select Add to Dictionary. 3) In the Select Dictionary window, click Browse Workspace. 4) Select the /textAnalytics/src/RevenuebyDivision_BasicFeatures, and click Create Dictionary. 5) A NewDictionary.dict file is created. Click OK. 6) In the Select Dictionary, click OK. The clues that you labeled as Quantifier entries are now in the dictionary file. There should be at least two entries from the labeling that you did in the previous lesson: million and billion. Save the file. 7) Open the Project Explorer view, and right-click the dictionary file. Click Rename. In the Rename Resource window, type Quantifier.dict in the New Name field. Click OK. 8) In the RevenueBasic.aql file, in the create dictionary template for QuantifierDict, replace the code <path to your dictionary here> with the name of the dictionary that you just created:
from file Quantifier.dict
If you put the dictionary file in a location outside of the module folder, then you must include the path relative to the project name. Complete the create view statement by pointing to the dictionary that you just created, and ensure that the view is case insensitive. Adding the IgnoreCase parameter ensures that the terms million and Million are both found. The create dictionary and the create view statements should look like the following code:
create dictionary QuantifierDict from file Quantifier.dict with language as en; create view Quantifier as extract dictionary QuantifierDict with flags IgnoreCase on R.text as match from Document R; output view Quantifier;
c. Click File > Save. 11. Run the extractor. a. In the Extraction Plan, right-click RevenueByDivision and click Run > Run the extraction plan on the entire document collection. b. View the results in the Annotation Explorer. When it successfully completes, there are three views that are output and are displayed in the Annotation Explorer. These views are the three views that you materialized with the output view statement. Select the view that you want to see in tabular form from the list in the header of the Annotation Explorer. 12. To mark the Number and Quantifier labels as complete in the extraction plan, right-click each label (Number and Quantifier) in the Extraction Plan and select Mark Completed. 13. You are now going to extract instances where these three basic features occur together, which gives you Money. You will do that extraction by using a pattern to extract candidates for revenue. In the Create AQL Statement dialog, complete the fields necessary to create a view: a. Right-click the Money label, select New AQL Statement > Candidate Generation AQL statement. AQL is modular, which means that you can
75
package your statements into modules that can then be packaged and reused. One way to modularize your code is by the type of AQL statement. By using this design, you would package all basic feature statements in one module, all candidate generation statements in another module. The Text Analytics tooling creates default modules to support this type of modularization. But since the extractor that you are building in this tutorial is simple, you will package all of your statements into the RevenuebyDivision_BasicFeatures module. Learn more about AQL modules: For more information about modules, see AQL modules Type Money in the View Name field. In the AQL Module field, make sure to specify RevenuebyDivision_BasicFeatures as the module name. In the AQL script field, type or select RevenueBasic. Specify Pattern in the Type field. Specify the Output view check box.
b. c. d. e. f.
g. Click OK. 14. You are going to use the Currency, Number and Quantifier views in this new view, and you will reference those views by assigning the variables C, N, and Q to the Currency, Number, and Quantifier views in the FROM clause. The pattern specification looks for the currency symbol, followed by a number, followed by a unit. As a result, the view contains the following code:
create view Money as extract pattern <C.match> <N.match> <Q.match> return group 0 as match from Currency C, Number N, Quantifier Q; output view Money;
Learn more about patterns: For more information about patterns in AQL, see Sequence patterns 15. Save the file and run the extractor in the usual way. a. In the Extraction Plan, right-click RevenueByDivision and click Run > Run the extraction plan on the entire document collection. b. View the results in the Annotation Explorer. You see the Money view, with sequential occurrences of a currency sign, followed by number, followed by a unit. You extracted entities by using a pattern over the input document and the existing annotations. The Money view returned 333 rows. 16. To mark the Money label as complete in the extraction plan, right-click the label Money in the Extraction Plan and select Mark Completed. 17. Optional: From the Annotation Explorer, you can export the extracted views as HTML or CSV files, and you can highlight any of the extracted entities in the annotated document view and get the drilldown of the views to which they belong. a. Click the Export Results icon in the Annotation Explorer. b. In the Export Results dialog, in the Path for the exported results field, type the name of a valid directory, or click Browse File System to designate a target output location, and click Finish.
76
In the target directory, a CSV folder and an HTML folder are created. The CSV folder contains a <view name>.csv file for each view. The HTML folder contains a <view name>.html file for each view. That file also contains the input document. You can upload these simple CSV files to an IBM InfoSphere BigInsights server and use them in another component of InfoSphere BigInsights. If you choose to continue with the advanced lessons, you will learn to create and publish a deployable extractor that can be used by anyone with access to your server.
Summary - the basic lessons

At this point, you have successfully extracted three basic features (currency, number, and quantifier) by using dictionaries and a regular expression. Then, you extracted instances of money by using a pattern to put the basic features in context. In the next lesson, you will use this same technique to extract candidates of revenue by division. You will do that by putting instances of money into context with instances of the basic features of revenue and division. You can use this pattern in many situations. If you decide to not continue with the more advanced lessons, you have learned how to extract basic features and how to use a pattern to extract candidates. In these first three lessons, you extracted the basic features of money: a currency symbol, a number, and a quantifier. You used dictionaries, regular expressions, and patterns. You created and output views and ran the extractor and examined the output in the annotation explorer. With these lessons, you were introduced to the fundamentals of Text Analytics and some key AQL statements. You have also successfully used the tools to identify instances of Money in IBM quarterly reports.
Lesson 4: Writing and testing AQL for candidates

In this lesson, you write AQL queries that add context to the basic features. You will build on the basic features that you defined in previous lessons to extract revenue by division that is based on the two patterns that you identified during your initial analysis: revenues for division were $x.x and division revenues were $x.x. So far, you successfully extracted all instances of Money. Now you will extract the basic features of revenue and division. To generate candidates, you use the extract pattern statement, and build on the code that you created in the previous lessons. Procedures 1. In the previous lesson you extracted Money. Now, you need to extract instances of revenue and divisions. You extract these basic features by using dictionaries: a. Right-click the revenues label and click New AQL Statement. Select Basic Features AQL statement. b. Type Revenue in the View Name field. c. In the AQL Module field, make sure to specify RevenuebyDivision_BasicFeatures as the module name.
77
d. e. f. g.
In the AQL script field, type RevenueBasic. Specify Dictionary in the Type field. Specify the Output view check box. Click OK.
Copy of paste the following code to replace the template:

create dictionary RevenueDict as (revenues, revenue); create view Revenue as extract dictionary RevenueDict with flags IgnoreCase on R.text as match from Document R; output view Revenue;
2. Run the extractor in the usual way. Your output from the Revenue view should be limited to those spans of information that contain the terms revenue or revenues in upper and lowercase. 3. Next, you want to use a dictionary to extract division names: a. Right-click the Division label and click New AQL Statement. Select Basic Features AQL statement. b. Type Division in the View Name field. c. In the AQL Module field, make sure to specify RevenuebyDivision_BasicFeatures as the module name. d. In the AQL script field, type RevenueBasic. e. Specify Dictionary in the Type field. f. Specify the Output view check box. g. Click OK. 4. Copy or type the following code to complete the Division view:
create dictionary DivisionDict as (Global Technology Services,Systems and Technology, S&TG,Software,Global Financing,Global Business Services ); create view Division as extract dictionary DivisionDict on R.text as match from Document R; output view Division;
5. Save the file and run the extractor in the usual way. The output from the Division view should contain references to the division names in the inline dictionary. This view contains 139 rows. 6. Notice in the Annotation Explorer view that in the Division view, the terms software and global financing are being picked up incorrectly as division names. Since these terms are in lowercase, the chances are good that they do not represent division names. This problem can be fixed by modifying the create dictionary statement to use the Exact flag to ensure that the text string matches the dictionary entry exactly, including case. Modify the create dictionary statement for Division in the RevenueBasic.aql script so that it looks like the following code:
78
... create dictionary DivisionDict with case exact as (Global Services,Global Technology Services, S&TG segment,Software,Global Financing, Systems and Technology Group); ...
7. Save the file, and run the extractor in the usual way. In the Annotation Explorer, the division names now look correct. There are now 95 rows returned. 8. Mark the labels revenues and Division as complete. 9. You have now extracted the three key basic features: money, revenue, and division. The next step is to extract candidates that match the two patterns that you identified earlier. 10. You will use patterns in your code to put the information from the three views MoneyRevenue and Division in context. If you remember, in Lesson 2: Selecting input documents and labeling examples on page 67, part of your goal was to find both of the following patterns: revenues for division were $x.x and division revenues were $x.x. The first pattern looks for examples where the word revenue is followed by a division name and then a money amount, with some number of tokens in between each basic feature. For example, Revenues from the System and Technology Group (S&TG) segment totaled $7.1 billon
extract pattern <R.match><Token>{1,2}<D.match><Token>{1,20}<M.match>
The second pattern looks for examples where a division name is followed by the word revenue and a money amount, with some number of tokens in between each basic feature. For example, Global Financing segment revenues increased 3 percent (flat, adjusting for currency) in the fourth quarter to $620 million.
extract pattern <D.match><Token>{1,3}<R.match><Token>{1,30}<M.match>
After you have matched both patterns and have a full set of candidates, you can union them together into a single view. a. Right-click the RevenueByDivision label and click New AQL Statement. Select Candidate Generation AQL statement. Complete the Create AQL Statement dialog with the following information: View name RevenueAndDivision Module name RevenuebyDivision_BasicFeatures Script name RevenueCandidate.aql Note: You will be using a new script to contain your candidate views, but you can continue to use the same module for all scripts. Type Pattern
Output view Specify the Output view check box. b. Click OK. c. Copy or type the following code to replace the template:
79
create view RevenueAndDivision as extract pattern <R.match> <Token>{1,2} (<D.match>) <Token>{1,20} (<M.match>) return group 0 as match and group 1 as Division and group 2 as Amount from Revenue R, Division D, Money M; output view RevenueAndDivision;
d. Save the file and run the extractor in the usual way. e. Create the view for the second pattern. Right-click the RevenueByDivision label and click New AQL Statement. Select Candidate Generation AQL statement. Complete the Create AQL Statement dialog with the following information: View name DivisionAndRevenue Module name RevenuebyDivision_BasicFeatures Script name RevenueCandidate.aql Type Pattern
Output view Specify the Output view check box. f. Click OK. g. Copy or type the following code to replace the template:
create view DivisionAndRevenue as extract pattern (<D.match>) <Token>{1,3} <R.match> <Token>{1,30} (<M.match>) return group 0 as match and group 1 as Division and group 2 as Amount from Revenue R, Division D, Money M; output view DivisionAndRevenue;
When you find all three clues (revenue, division, and money) in close proximity, there is a good chance that you have found your goal, which is revenue by division. The Token keyword and the minimum and maximum arguments limit the gaps between the revenue feature, the division feature, and the money feature. The specific words in the gaps are not important for the purposes of this lesson. But you do need to limit the number of tokens in the gap to make sure that the revenue feature, the division feature, and the money feature are in close proximity. For more information about tokenization, see Tokenization . For more information about patterns in AQL, see Sequence patterns h. Save the file, and run the extractor. 11. Now you are ready to see the results of the combination of the two patterns: a. Right-click the RevenueByDivision label and click New AQL Statement. Select Candidate Generation AQL statement. Complete the Create AQL Statement dialog with the following information: View name AllRevenueByDivision Module name RevenuebyDivision_BasicFeatures Script name RevenueCandidate.aql
80
Type
Union all
Output view Specify the Output view check box. b. Click OK. Copy or type the following code to replace the template:
create view AllRevenueByDivision as (select DR.* from DivisionAndRevenue DR) union all (select RD.* from RevenueAndDivision RD); output view AllRevenueByDivision;
12. Click File > Save. 13. Run the extractor. In the result, you see mentions of division names and their revenues. The next step in the development of this extractor is to finalize the output, such as removing duplicates and unnecessary numbers. Learn more about the value of the AQL templates: The AQL templates reduce the need to look up syntax, retype the same expressions multiple times, and debug spelling mistakes. 14. As you finalize the extractor, you no longer need the intermediate views. If users of your AQL module must materialize, or use output view statements in any of your externalized views, they can do so in their own code. You can comment out the intermediate views so that the optimizer knows that they do not need to be materialized. From the Project Explorer, edit the RevenueBasic.aql and the RevenueCandidate.aql files and comment out the output view statements: a. From the Project Explorer, find the RevenueBasic.aql file and open it. b. Add two dashes before the words output view for each of those statements. This comments the entire line and it is not compiled. c. Click File > Save. d. From the Project Explorer, find the RevenueCandidate.aql file and repeat the process of adding comments in front of the output view statements. e. Click File > Save. Learn more about extending your extractor with pre-built extractors: To see examples of extending your extractor with the pre-built extractors, see Pre-built extractor libraries. You can use the pre-built extractor libraries to enhance your custom extractors. For example, to use a view that is exported by the pre-built extractors, do the following steps: 1. Inside your InfoSphere BigInsights Eclipse environment, on a file system that is connected to the InfoSphere BigInsights cluster, right-click the TA_Training project, and select Properties. 2. In the Properties for TA_Training window, click BigInsights > Text Analytics. 3. On the General tab, click Browse File System to specify your extractor libraries. 4. Select the path to $TEXTANALYTICS_HOME/data/tam/ to select BigInsightsWesternNERMultilingual.jar. To find the $TEXTANALYTICS_HOME path, from your local file system that is associated with your InfoSphere BigInsights cluster, type echo $TEXTANALYTICS_HOME. 5. Click OK. 6. After you specify the pre-built extractor libraries, you can extend your Revenue extractor by using one of the Named entity views, such as Organization.
81
Include the following statement at the top of the RevenueCandidate.aql script, immediately after the module declaration:
import view Organization from module BigInsightsExtractorsExport as Organization;
For more information on the IMPORT statement, see The import statement. The Organization extractor identifies mentions of organization names. After importing the view, then add this view in your RevenueCandidate.aql script:
create view myOrg as select GetText (R.organization) as TheOrg from Organization R; output view myOrg;
The result shows you all of the organizations that are mentioned in the input text.
Lesson 5: Writing and testing final AQL

In this lesson, you further refine the AQL to deliver the required results. Procedures 1. The next step in AQL development is to consolidate and filter the candidates that you just extracted. The first part of consolidating the output is to get rid of duplicate information. From the Extraction Plan, right-click the RevenueByDivision label. Click New AQL Statement > Filter and Consolidate AQL Statement. 2. Complete the Create AQL Statement dialog to create a view: a. In the View Name field, type RevenuePerDivision. b. In the AQL Module field, make sure that the name is RevenueByDivison_BasicFeatures. c. In the AQL script field, type RevenueFilter as the new AQL file name. d. In the Type field, specify Consolidate. e. Enable the Output view check box. f. Click OK. The AQL file opens in the editor pane. 3. Type or copy the following code to replace the template:
create view RevenuePerDivision as select R.* from AllRevenueByDivision R consolidate on R.match using NotContainedWithin; output view RevenuePerDivision;
In this code, you are consolidating the output from the view AllRevenueByDivision to remove duplicate entries. Learn more about consolidation: You use consolidation strategies to refine candidate results by removing invalid annotations, and resolve overlap between annotations. The consolidate on clause specifies how overlapping spans are resolved across tuples that are output by a select or extract statement. For more information about consolidation strategies, see The consolidate on clause 4. Click File > Save. 5. From the Extraction Plan, right-click the parent label, and click Run the extraction plan on the entire document collection.
82
The output from this view shows some results for an entire year, which is duplicating the quarterly results. Now you need to need to filter by using a select statement with a predicate. 6. Create a view that filters out amounts that are not relevant for the quarterly numbers. a. From the Extraction Plan, right-click the RevenueByDivision label. Click New AQL Statement > Filter and Consolidate AQL Statement. b. In the View Name field, type RevenueByDivision. c. In the AQL Module field, make sure that the name is RevenueByDivision_BasicFeatures. d. In the AQL script field, select RevenueFilter as file name. e. In the Type field, specify Predicate-based filter. f. Enable the Output view check box. g. Click OK. The AQL file opens in the editor pane. Type or copy the following code to replace the template:
create view RevenueByDivision as select R.* from RevenuePerDivision R where Not(ContainsRegex(/Full-Year \d{4} Results/, LeftContextTok(R.Amount,1200))); output view RevenueByDivision;
7. Save the file and run the extractor as usual. The output should show 25 rows. This view contains exactly the information that you need for further analysis. When you apply text analytics to more complex documents, and when you are extracting more sophisticated information, you would expect to spend time improving the precision and recall of your extractor. You can also profile your extractor to understand and improve its performance characteristics. There are utilities in the Text Analytics Workflow perspective to help with both of these tasks. Learn more about some of the Text Analytics utilites: In the InfoSphere BigInsights Eclipse Text Analytics Workflow perspective, you can find help with several of the Text Analytics utilities. The following is a list of some of the utilities that you might want to explore in the Help contents: Annotation Difference Viewer Displays a side-by-side comparison of the extracted results from the same input file. You can use the Annotation Difference Viewer to understand how modifying the AQL statements in an extractor affects the results. Also, you can use the Annotation Difference Viewer to understand how the extracted results compare with a labeled data collection. Provenance View Displays the results from viewing the lineage of analysis results and is useful for understanding the results of an extractor. It explains in detail the provenance, or lineage of an output tuple, that is, how that output tuple is generated by the extractor. You access the Provenance View through the Result Table View. Profiler View Helps you to troubleshoot performance problems in the AQL code. The Profiler also calculates the throughput of the extractor (in KB/seconds) by dividing the size of the data that was processed by the total duration of the Profiler execution.
83
Pattern Discovery View Displays results from discovering patterns in text input. Pattern discovery identifies contextual clues from documents in the data collection that help you refine the accuracy and coverage of an extractor. Explain Module View Displays the metadata of the module and the compiled form of the extractor.
Lesson 6: Finalizing and exporting the extractor

In this lesson, you finalize and export your extractor. You now want to export the AQL module to make it available. Procedures 1. Before you export, in the RevenueFilter.aql file, comment out the output view statement in the RevenuePerDivision view. Keep the output view statement in the RevenueByDivision view. You now have one view that you want to materialize. Save the file. 2. Export the module. a. In Project Explorer, right-click the project TA_Training, and click Export. b. In the Export dialog, expand BigInsights. c. Click Export Text Analytics Extractor and click Next. d. In the Modules to be exported list, select the module RevenueByDivision_BasicFeatures. e. Select the Export dependent modules check box. f. Click Browse File System and select a destination for the export. g. Specify Export to a jar or zip archive under the destination directory. Type RevenueByDivision.zip in the File Name field. h. Click Finish. 3. If the export process was successful, you see a confirmation message. Click OK to close the message. You can now use this extractor in new modules that you create in the Eclipse environment. You can also publish and then deploy the extractor to the InfoSphere BigInsights Console, as you will learn in the next lesson.
Lesson 7: Publishing the AQL module

In this lesson, you will learn to publish your extractor to the InfoSphere BigInsights Console so that it can be deployed and run by anyone with access to that server. Prerequisites Complete Lesson 6: Finalizing and exporting the extractor, and have access to an IBM InfoSphere BigInsights server. Procedures 1. From the Text Analytics Workflow perspective, open the Project Explorer. 2. Right-click the TA_Training project. 3. Click BigInsights Application Publish.
84
4. In the BigInsights Application Publish wizard, complete the workflow information: a. In the Location page, specify an IBM InfoSphere BigInsights server as the destination of the extractor. If you did not register a server, you can click Create to create a connection. Click Next. b. In the Application page, specify Create New Application. c. In the Name field, type a unique application name, such as TA_TrainingV1. d. Optional: In the Description field, enter text that you want the users of your application to see. This text might be instructions or hints on how to start. e. Optional: Select an icon that you want to associate with your application. A default application icon is used if you select nothing. f. In the Categories field, type a keyword by which you can identify this application. For the purposes of this tutorial, type extractors. g. Click Next. h. On the Type page, specify Text Analytics as the application type. Click Next. i. In the Text Analytics page, select the AQL module to publish, and the output views to use. For the purposes of this tutorial, select RevenueByDivision_BasicFeatures in the Module field. Select RevenueByDivision_BasicFeatures.RevenueByDivision in the Output Views field. j. In the BigSheets page, specify the plug-in definitions. Accept the defaults, and click Next. k. In the Parameters page, locate any valid parameters that will be used in the InfoSphere BigInsights Console, and add values if they are needed as defaults. For the purposes of this tutorial, accept the defaults. Click Next. l. In the Publish page, click Add to select the TA_Training project. m. Click Finish. The application is placed in the IBM InfoSphere BigInsights server. Open the InfoSphere BigInsights Console and open the Applications tab. Click Manage to find your application.
Summary of creating your first text analytics application

In this module, you analyzed text about the IBM quarterly reports from various input files. These steps summarize what you did to complete your extractor. 1. You first identified the collection of documents from which you wanted to extract information. 2. You analyzed the documents to identify examples of the information from which you wanted to extract. 3. You wrote AQL statements to extract the identified information. 4. You tested and refined the AQL statements. 5. Then, you exported the final extractor and published the extractor to the InfoSphere BigInsights Console.
85
Lessons learned
You now have a good understanding of the following tasks: v How to create a Text Analytics project in Eclipse. v How to use the Text Analytics development process and the supporting tools. v How to analyze text documents to populate an extraction plan by identifying interesting text and clues. v How to create and test AQL scripts to extract candidates. v How to create AQL statements to filter the candidates to extract useful insights.
There are articles on IBM developerWorks that give you further information about Text Analytics. v Analyzing social media and structured data with InfoSphere BigInsights v Analyze text from social media sites with InfoSphere BigInsights: Use Eclipse-based tools to create, test, and publish text extractors . Related information: Text Analytics Lifecycle Text Analytics AQL Reference eclipse.org
86
Notices and trademarks

This information was developed for products and services offered in the U.S.A.
Notices
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: Intellectual Property Licensing Legal and Intellectual Property Law IBM Japan Ltd. 1623-14, Shimotsuruma, Yamato-shi Kanagawa 242-8502 Japan The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web
87
sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation J46A/G4 555 Bailey Avenue San Jose, CA 95141-1003 U.S.A. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to
88
IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs. Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved. If you are viewing this information softcopy, the photographs and color illustrations may not appear.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at www.ibm.com/legal/ copytrade.shtml. The following terms are trademarks or registered trademarks of other companies: Adobe is a registered trademark of Adobe Systems Incorporated in the United States, and/or other countries. Intel and Itanium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows and Windows NT are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. The United States Postal Service owns the following trademarks: CASS, CASS Certified, DPV, LACSLink, ZIP, ZIP + 4, ZIP Code, Post Office, Postal Service, USPS and United States Postal Service. IBM Corporation is a non-exclusive DPV and LACSLink licensee of the United States Postal Service. Other company, product or service names may be trademarks or service marks of others.
Notices and trademarks
89
90
Providing comments on the documentation

You can provide comments to IBM about this information or other documentation.
About this task

Your feedback helps IBM to provide quality information. You can use any of the following methods to provide comments:
Procedure
v Send your comments by using the online readers' comment form at www.ibm.com/software/awdtools/rcf/. v Send your comments by e-mail to comments@us.ibm.com. Include the name of the product, the version number of the product, and the name and part number of the information (if applicable). If you are commenting on specific text, include the location of the text (for example, a title, a table number, or a page number).
91
92
Printed in USA
GC19-4104-01

Big Insights

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Big Insights

Diunggah oleh

Hak Cipta:

Format Tersedia

IBM InfoSphere BigInsights Version 2.

IBM InfoSphere BigInsights Version 2.1

Chapter 6. Tutorial: Developing Big SQL queries to analyze big data . . . . 33

Chapter 3. Tutorial: Importing data for analysis . . . . . . . . . . . . . . . 7

Chapter 4. Tutorial: Analyzing big data with BigSheets . . . . . . . . . . . 13

Chapter 5. Tutorial: Developing your first big data application . . . . . . . 25

Notices and trademarks . . . . . . . 87 Providing comments on the documentation . . . . . . . . . . . 91

Copyright IBM Corp. 2013

IBM InfoSphere BigInsights Version 2.1: Tutorials

Chapter 1. InfoSphere BigInsights Tutorials

Related information: InfoSphere BigInsights, Version 2.1: Tutorials (PDF)

Copyright IBM Corp. 2013

IBM InfoSphere BigInsights Version 2.1: Tutorials

Chapter 2. Tutorial: Managing your big data environment

Lesson 1: Starting to use the InfoSphere BigInsights Console

Copyright IBM Corp. 2013

Option In an SSL installation

Tasks Quick Links

Lesson 2: Exploring the InfoSphere BigInsights Console

IBM InfoSphere BigInsights Version 2.1: Tutorials

Summary of managing your big data environment

Chapter 2. Tutorial: Managing your big data environment

IBM InfoSphere BigInsights Version 2.1: Tutorials

Chapter 3. Tutorial: Importing data for analysis

Lesson 1: Managing your data

Lesson 2: Importing data by using the BoardReader application

IBM InfoSphere BigInsights Version 2.1: Tutorials

Lesson 3: Importing data by using the Distributed File Copy application

IBM InfoSphere BigInsights Version 2.1: Tutorials

Summary of importing data to the distributed file system

Chapter 3. Tutorial: Importing data for analysis

IBM InfoSphere BigInsights Version 2.1: Tutorials

Chapter 4. Tutorial: Analyzing big data with BigSheets

Lesson 1: Creating master workbooks from social media data

IBM InfoSphere BigInsights Version 2.1: Tutorials

Lesson 2: Tailoring your data by creating child workbooks

IBM InfoSphere BigInsights Version 2.1: Tutorials

Column Column Column Column Column

C: Language D: Published E: SubjectHtml F: Tags G: Type

Lesson 3: Combining the data from two workbooks

IBM InfoSphere BigInsights Version 2.1: Tutorials

Lesson 4: Creating columns by grouping data

Lesson 5: Viewing data in BigSheets diagrams

Lesson 6: Visualizing and refining the results in charts

IBM InfoSphere BigInsights Version 2.1: Tutorials

IBM InfoSphere BigInsights Version 2.1: Tutorials

Lesson 7: Exporting data from your workbooks

Summary of analyzing data with BigSheets tutorial

IBM InfoSphere BigInsights Version 2.1: Tutorials

Chapter 5. Tutorial: Developing your first big data application

v Upgrade your application to accept input parameters.

Lesson 1: Creating an InfoSphere BigInsights project

Copyright IBM Corp. 2013

Lesson 2: Creating and populating a Jaql file with application logic

6. Save and close the MyJaql.jaql file.

IBM InfoSphere BigInsights Version 2.1: Tutorials

Lesson 3: Testing your application

Chapter 5. Tutorial: Developing your first big data application

Option In an SSL installation

Lesson 4: Publishing your application in the InfoSphere BigInsights applications catalog

IBM InfoSphere BigInsights Version 2.1: Tutorials

Lesson 5: Deploying and running your application on the cluster

Lesson 6: Upgrading your application

IBM InfoSphere BigInsights Version 2.1: Tutorials