Anda di halaman 1dari 5

Data Science on Bluemix

The following document is intended to outline the potential benefits and possible roadblocks to using
the Bluemix cloud platform to deliver Data Science as a service within the context of a Rapid Analytics
Results project. In turn the idea is to compare and draw conclusions on where can Bluemix add value
and what are the similarities in terms of functionality, deployability and flexibility with the current RAR
platform.

Proposed RAR – Bluemix Architecture

Bluemix Services
Bluemix has emerged in recent time as IBM´s proposition for cloud application development and it offers
a pay for usage model, this means that users pay for the resources used, be it Gigabytes transferred, API
calls made etc. Bluemix offers a fully managed cloud environment where developers can in principle
easily pull resources and services together through the use of API´s to quickly build applications.

Within the services available in Bluemix there are a number of services that are clearly geared towards
analytics and that offer functionality that resembles or even matches stand-alone IBM technologies and
open source software. A list of the main analytics oriented services in Bluemix include:

 Storage and Database technologies such as DashDB, Cloudant, MongoDB, Object Storage.

 Analytics runtimes such as: Node-RED, Spark, R and Python Notebooks. Linked to this point is the
new Data Science Experience, a platform to develop and share Spark, Python and R notebooks.
You can use this link to sign on to the Data Science Experience:
https://apsportal.ibm.com/analytics

 Scoring Services like the “Predictive Analytics” service which allows you to run an SPSS model for
scoring. Note that this functionality is quite limited to only scoring already developed models, all
model development still needs to be conducted in the SPSS Modeler client.

 Big Data Services like BigInsights for Hadoop.

 Virtualization Services like Docker Containers and Virtual Servers (Beta).

 Watson Services, Bluemix offers a wealth of Watson services that can be easily integrated in your
applications, some of the services include: speech to text (and vice versa), AlchemyAPI,
Personality Insights, Tone Analyzer and Natural Language Classifier.

For a complete list of all Bluemix Services and their availability by region refer to the following link:

https://console.ng.bluemix.net/docs/services/index.html#experimental_services

Using Bluemix´s Analytics Services


Bluemix offers quite detailed documentation on its many services and it is not the intention of this
document to present a detailed explanation of each one. The objective is to present an overview of the
services that are more likely to prove helpful in a Data Science project and evaluate in which context can
a Data Science team benefit from such services and whether it is possible to use them within a client
engagement.

Within the context of analytics one of the services that emerges like a clear alternative to the current
RAR setup (refer to the RAR overview documentation) is DashDB. DashDB can be deployed as a stand-
alone database or as a data warehouse as part of a Cloudant database deployment.

Some of the characteristics that make DashDB especially attractive are:

 It is largely based on DB2 and uses many of the same commands and offers the same basic
functionalities.

 Includes very similar features to the ones that come with the BLU acceleration option for DB2,
meaning that it offers columnar indexing.

 Because of its columnar indexing DashDB allows the user to use Netezza like operations to
conduct in-database analytics which result in increased speed and computation power.

 DashDB in Bluemix is closely linked to the R runtime and readily offers an interface where R
code can be directly executed against the database tables and processed directly in-database.

 It can be coupled and used as a data source in the SPSS Modeler client via an ODBC data source
configuration. The steps to do this are quite straight forward and can be found in the following
link: https://www.ibm.com/...dashDB.../connecting/connect_connecting_spss_statistics.html
DashDB – SPSS Integration
There are a number of considerations that must be taken into account if one wants to use DashDB in
combination with SPSS for projects that have a significant data load, and that is, that not all SPSS
operations can be pushed back to the database. In this case SPSS will pull all the necessary data from the
database and perform the operations outside (locally), this is ok if the amount of data is not too large
(think less than 5 GB of raw data), but if we are dealing with a significant data volume then it would be
necessary to have Modeler Server available to hold the data.

We are currently investigating if a Modeler Server can be instantiated and deployed using the Virtual
Server capability in Bluemix or if it would need to be deployed separately in Softlayer.

The following is a link to a PDF document that offers a comprehensive overview of the type of operations
that can be pushed back to the database from SPSS modeler and which are not. It is intended to cover
the SPSS – Netezza integration but since we have established that DashDB uses very similar processes
and in fact shares many packages with Netezza the expectation is that the same conditions and
constraints will apply.

Link: https://www.google.co.uk/url?
sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwitv92Mk-
vNAhXDVxQKHaMbCAcQFggjMAE&url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks
%2Fcommunity%2Ffiles%2Fform%2Fanonymous%2Fapi%2Flibrary%2F1b6a2624-dc86-4856-b4ed-
cdda6bfdecda%2Fdocument%2F575e9b6a-dd06-4b74-aec4-f15ae2c5d4b9%2Fmedia%2FOverview
%2520of%2520SPSS%2520Modeler%2520Integration%2520with
%2520IPDA.pdf&usg=AFQjCNEfFOTkiNNoznDUZUHsJNvyhj1SbA

Lastly the steps needed to create a Database connection between DashDB and the Modeler client are
available via the link in the previous section. The DashDB credentials needed to create the ODBC
connection are directly available under the “Connections” section in the main DashDB web Interface.

Moving Data from External sources into Bluemix


There are a number of ways to move data into Bluemix, the simplest of which is directly from the local
file system using simple drag and drop commands. However it will sometimes be impossible or
inconvenient to move large amounts of data using this method.

One of the options that can be used is the “Object Storage” Service from which data can then pulled by
other Bluemix Services. As an example the Spark Service in Bluemix mainly works using files that have
been stored in Object Storage.

However for many large customers, the core data that drives their business resides in established
database systems behind their firewall, accessed through classic middleware (i.e., an Oracle. MySQL or
DB2 database).
In this case The Secure Gateway Service can be used as a mean to open a communication channel from
behind the firewall to the IBM Cloud. The architecture high level architecture of such a solution would
look something like this:

A comprehensive tutorial on how to set up such a Gateway Service can be found through this link:
https://github.com/data-henrik/Bluemix-onprem-data

Extensively we are currently exploring other ways of moving data from external on-prem sources into
Bluemix that do not require us to have a direct connection, one such method would be to use a
configuration similar to the one currently used in the RAR platform with a Proxy and SFTP server that will
serve as a secure landing zone before moving the data into Bluemix.

Another alternative that is currently in research is the use of Aspera more info here: http://www-
03.ibm.com/software/products/en/high-speed-file-transfer.

Aspera is already used in Beta version within the DashDB service as one of the options for uploading
data from a local file system. A more advanced version of the service would include a client and server
packages and their integration would still need to be assessed.

Bluemix Security Overview


This Cloud Service follows IBM’s data security and privacy principles for IBM SaaS which are available at
https://www.ibm.com/cloud/resourcecenter/content/80 and any additional terms provided in this
section. Any change to IBM’s data security and privacy principles will not degrade the security of the
Cloud Service.

Client can review available Cloud Service regulatory compliance certifications at


https://www.ng.bluemix.net/docs/security/index.html#compliance. Except for available regulatory
compliance certifications, the Cloud Service is not designed to any specific security requirements for
regulated data, such as PI or SPI. Client is responsible to determine if this Cloud Service meets Client's
needs with regard to the type of content Client or Client's end users may use in connection with the
Cloud Service or any resulting application.
On the topic of data collection, the client agrees that IBM may, as part of the normal operation and
support of the Cloud Service, collect logon information from Client (Client’s employees and contractors)
related to the use of the Cloud Service, through tracking and other technologies.

IBM does so to gather usage statistics for billing and other purposes and information about effectiveness
of our Cloud Service for the purpose of improving user experience and/or tailoring interactions with
Client. All data is used solely for the purpose of supporting the specific environment, which includes
assuring the security, availability, performance, capacity and health of that environment. This data will
not be used or shared for other purposes.

Client is responsible to obtain or have obtained consent to allow IBM, other IBM companies and
subcontractors to process the collected personal information, such as name, email, or IP address, for the
above purpose wherever we and our subcontractors do business, in compliance with applicable law. IBM
will comply with requests from Client’s employees and contractors to access, update, correct or delete
their collected personal information.

IBM will not collect or access data stored by Client applications, services, or end users who access the
Client applications or other personally identifiable end customer information except as directed by
Client.

Anda mungkin juga menyukai