Anda di halaman 1dari 41

TABLE OF CONTENTS

1.ABOUT THE COMPANY 1

1.1. INTROCUTION 1

1.1.1. INTERGRATED SOLUTIONS 1

1.1.2. INCREASED AGILITY 1

1.1.3. A PAY-AS-YOU-USE MODEL 2

1.1.4. PERSONALIZED SOLUTIONS 2

1.1.5. AUTO UPGRADES 2

1.1.6. ENHANCED BUSSINESS CONTINUITY 2

2. ABOUT THE DEPARTMENT 3

2.1. EDUACTION 3

2.1.1. K12 3

2.1.2. HIGHER EDUCATION 3

2.1.3. PROFESSIONAL TRAINING 4

2.2. MANUFACTURING 4

2.2.1. AUTOMOTIVE COMPONENT 4

2.2.2. PHARMACEUTICAL 4

2.2.3. PROCESS INDUSTRY 5

2.3. EXAM BOARDS 5

3. OBJECTIVE OF THE INTERNSHIP 6

4. SCHEDULE-WEEK WISE 7

5. METHODS, SKILLS AND TECHNOLOGY DEVELOPMENT 8

5.1. STATISTICS 8
5.1.1. INFERENTIAL STATISTICS 9

5.1.2. NORMAL DISTRIBUTION 10

5.1.3. CONFIDENCE LEVEL 12

5.1.4. HYPOTHESIS TESTING 12

5.1.5. REGION OF REJECTION AND NON-REJECTION 13

5.2. PYTHON 13

5.2.1. FEATURES OF PYTHON 14

5.2.2. PYTHON DATA TYPES 15

5.2.3. PYTHON NUMBERS 15

5.2.4. PYTHON STRINGS 16

5.2.5. PYTHON LISTS 16

5.2.6. PYTHON TUPLES 17

5.2.7. PYTHON DICTIONARY 17

5.4. DATA ANALYTICS 18

5.5. MACHINE LEARNING 19

5.5.1. HOW ML WORKS? 20

5.5.2. CLASSIFICATION OF MACHINE LEARNING 20

5.6. PERCEPTRON: THE ARTIFICIAL NEURON 22

6. LEARNING OUTCOME OF INTERNSHIP 27

6.1.INTRODUCTION 27

6.2. PROPOSED METHOD 28

6.3.DATA EXPLORATION
30
7.CONCLUSION 35
REFERENCES 36
CHAPTER 1

ABOUT THE COMPANY

1.1. INTRODUCTION
TCS iON is a strategic unit of Tata Consultancy Services focused on Manufacturing
Industries (SMB), Educational Institutions and Examination Boards. TCS iON provides
technology by means of a unique IT-as-a-Service model, offering end-to-end business
solutions. It caters to the needs of multiple industry segments, through innovative, easy-to-
use, secured, integrated, hosted solutions in a build-as-you-grow, pay-as-you-use business
model. TCS iON serves its clients with the help of best practices gained through TCS' global
experience, domestic market reach, skills, and delivery capabilities. TCS iON's Cloud Based
Solution is highly modular, scalable and configurable giving businesses and educational
institutions the benefits of increased efficiencies, faster go to market, predictability of
technology as well as spend and better business results.

1.1.1.INTEGRATED SOLUTIONS

They as a cloud-based solution offer single-window IT with a pre-integrated suite of


hardware, network, software and services. TCS iON ensures that the functions are digitized,
automated and connected. For example, if you are using a CRM solution along with a core
ERP (e.g. a Manufacturing ERP) and have a document management system to organize
supporting files and an HRMS, we ensure that these solutions are connected and work as one.
So for the customers, it is simply one IT and not multiple applications. Integrated applications
thus provide a comprehensive view of business enabling better decisions.

1.1.2.INCREASED AGILITY

TCS iON brings in the agility to keep pace with changing processes or a new line of
business. They help you configure the processes to work as you currently do or the software
recommends and allows you to choose industry best practices based on your business
parameters. TCS iON gives you increased convenience allowing you to perform various tasks
from your mobile device, no matter where you are. Being automatically compliant with

Department Of ECE, EPCET. Page | 1


statutory requirements, the solution ensures your company is always audit ready and legally
compliant

1.1.3. A PAY-AS-YOU-USE MODEL

Their model eliminates capital investment up front as we facilitate


procurement of the IT infrastructure and software on rent for the duration of the
contract. Additionally, you only pay for the number of users who actually use the
solution. Thus, you pay as you use on a monthly basis which includes maintenance
and training. Typically, with the TCS iON cloud-based solutions, the ROI exceeds
rental within three months, when best practices are well followed.

1.1.4. PERSONALIZED SOLUTIONS

Although TCS iON is a cloud service for education, exam boards and
manufacturing, the software is configurable to each sector. You will always get the
flavour of your business by picking and choosing what processes you would need.
Furthermore, the multilingualcapability of the solution allows you to customize the
solution label names to read invernacular languages (like Hindi, Marathi, Tamil etc)
enabling users to learn and operate the solution with ease.

1.1.5. AUTO UPGRADES

They continuously invest in our cloud-based solutions to incorporate best


practices. The solution is constantly enriched based on user feedback and industry and
statutory changes. You will get the upgrades without disrupting your business
operations or any additional cost. Being in perpetual beta ensures that there is no
technology obsolescence.

1.1.6. ENHANCED BUSINESS CONTINUITY

Their solution offers optimal performance in normal broadband connectivity


along with a stringent security mechanism to ensure your data privacy is maintained.
The capacity of the TCS iON cloud-based solution grows with your increasing
computing needs and reduces the need for IT staff. The solution is resilient to failures
as the service works from back-up data centers in the event of a disaster, ensuring
continuity of business operations.

CHAPTER 2

ABOUT THE DEPARTMENT

2.1. EDUCATION

2.1.1. K12

In the competitive world, every institution faces varied challenges while


working on its main goal of imparting quality education. As an institute head, you not
only have to ensure the admissions for the financial year are conducted on time, but
also obtain accreditation for your institution, seek affiliation, and ensure compliance.
As an educator, you have to create lesson plans, keep track of students' attendance,
prepare the daily timetable, and monitor the quality of content delivered to learners.

Addressing the Challenges

 To help educational institutions overcome the challenges, we begin by


digitising all student lifecycle processes such as admissions, exams and
grading, lesson planning, and community collaboration.
 This allows key stakeholders to view the entire transactional history of the
student from admission to separation on a single screen.
 The solution's availability on the cloud allows stakeholders to have seamless
access to such information from anywhere across the globe.
 It also enables institutions to conduct digital evaluations and offers
communications tools to facilitate seamless communication between students
and the institute.

2.1.2. HIGHER EDUCATION

Department Of ECE, EPCET. Page | 3


With an increasing emphasis on personalised learning, the education sector is
witnessing a radical transformation in the technology landscape. The adoption of
online learning platforms and the move to cloud-based solutions are some of the
changes. In the process of imparting progressive and quality education, institutions
face several challenges including:

 Streamlining the admission process.


 Achieving accreditation, affiliation, and regulatory compliance.
 Ensuring accountability.
 Optimising costs.
 Creating an active learning environment to improve learning outcomes.

2.1.3. PROFESSIONAL TRAINING

Professional education prepares students to work in the real world. It aims to


hone the practical skills of students, which will help them pursue jobs in the fields of
their choosing unlike classroom training, where the focus is generally on academics.
However, professional education has not yet evolved enough to effectively address
these challenges:

 Lack of job-ready youth with updated skills


 Lack of trainers who can provide quality training
 An insufficient number of career service centres that can provide career
counselling, disseminate information and guide job seekers.

2.2. MANUFACTURING

2.2.1.AUTOMOTIVE COMPONENT
Automotive component manufacturers need to improve operational
efficiencies and implement best-in-class shop floor practices- to meet high quality
standards and ensure faster turn-around and JIT delivery. These parameters need to be
met even while working on high volumes.
The iON Manufacturing Solution for the automotive component industry helps:

Department Of ECE, EPCET. Page | 4


 Automate your production planning and control process (PPC) and enable
material requirement planning
 Trigger job orders and purchase orders to run just-in-time inventory
 Reduce cost by making your production faster and leaner
 Improve product quality through stringent quality control at different levels,
including raw material, semi-finished goods, and processes.

2.2.2.PHARMACEUTICAL
The pharmaceutical industry in India is faced with challenges such as cheap
imports and increased scrutiny from US FDA. As a result, the industry requires a
balanced mix of imported and domestically purchased active pharmaceutical
ingredients (API), efficient supply chain management and on-going R&D. The TCS
iON Manufacturing Solution is a domain-centric, analytics-driven cloud solution
which offers offers an integrated view of operations acrossthe organization.
The iON Manufacturing Solution for the pharmaceutical industry helps

 Manage procurement efficiently and maintain the right level of inventory


 Get real-time information and complete visibility into demand cycles
 Facilitate better planning
 Manage compliance through automatic generation of certificate of analysis
 Track licensing status.

2.2.3. PROCESS INDUSTRY

Manufacturing Solution offers consulting-driven ERP implementation to


overcome these challenges and accelerate business performance.

Process manufacturers face dynamic industry challenges such as:

 Variations in bills of material (BoM)


 Complex inventory management processes
 Inaccurate raw material consumption measurement
 Insufficient safety stock

Department Of ECE, EPCET. Page | 5


2.3. EXAM BOARDS
Exam boards comprise school boards and examination bodies that are tasked
with the crucial role of conducting large scale and high-stake assessments. Some of
the common challenges faced by examination boards include:

 Time consuming manual processes


 Lack of subject matter experts to prepare question banks
 Question paper leakage or damage during transit
 Complex and time-consuming compliance processes

CHAPTER 3

OBJECTIVE OF THE INTERNSHIP

An internship provides a variety of benefits for young workers who want to


broaden their chances for landing a job and jump-starting their careers. Internships give
you a taste of what a profession is like, help you build your resume and let you meet
people who can help you in your career. Don’t be passive during an internship and miss
opportunities to expand your business background. Take advantage of the many benefits
of holding an internship.

1. . Apply business concepts and theories to real-world decision-making

2. Increase proficiency in specific business disciplines; such as human


resources management, operations management, marketing, accounting,
statistics, economics, finance, and business law.

3. Develop and improve business skills in communication, technology,


quantitative reasoning, and teamwork.

4. Observe and participate in business operations and decision-making.

Department Of ECE, EPCET. Page | 6


5. Meet professional role models and potential mentors who can provide
guidance, feedback, and support

6. Expand network of professional relationships and contacts

This internship helped me to learn about the practical scenarios of -

 Statistics for Data Science


 How to work in Jupyter Notebook
 How to analyse the dataset
 Python for Machine Learning
 Machine Learning Algorithms

CHAPTER 4

SCHEDULE-WEEK WISE

1. Week 1 Statistics for Data Science

I. Differential Statistics
II. Inferential Statistics

2. Week 2 Python for Machine Learning

I. Python Basics
II. Data Types
III. Python Libraries

Department Of ECE, EPCET. Page | 7


IV. Plotting

3. Week 3 Machine Learning Algorithms

I. Linear Regression
II. Logistic Regression
III. Decision Tree
IV. Random Forest
V. Clustering

4. Week 4 Mini-Project

CHAPTER 5

METHODS, SKILLS AND TECHNOLOGY


DEVELOPMENT

5.1. STATISTICS
Statistics is the science of collecting, organizing, analysing, and interpreting
data in order to make decisions.

Statistics is branched into Descriptive and Inferential Statistics.

Department Of ECE, EPCET. Page | 8


Descriptive statistics comprise a collection of techniques for understanding
what a group of people looks like in terms of the measure or measures you’re
interested in. In general, there are four classes of descriptive techniques.

First, frequency distributions are used to display information about where


the scores in a data set fall along the scale going from the lowest score to the highest
score. Second, measures of central tendency, or averages, provide the best single
numbers to use in representing all of the scores on a particular measure. Third,
measures of variability provide information about the amount of dispersion or
scattering of values away from a central value. Fourth, the original raw scores one
collects are often transformed to other types of scores in order to provide the
investigator with different types of information about the research participants in a
study. As standard score is a very good example of a transformed score that
provides much more information about an individual subject than a raw score can.

Figure 5.1: Types of data.

5.1.1. INFERENTIAL STATISTICS

POPULATION AND SAMPLE

The “population” in statistics includes all members of a defined group that we


are studying or collecting information on for data driven decisions.

A “sample” was defined as the portion of a population that has been selected
for analysis. Rather than selecting every item in the population, statistical sampling
procedures focus on collecting a small representative group of the larger population

Department Of ECE, EPCET. Page | 9


Figure 5.2: Population and Sample.

SAMPLING METHODS-

Figure 5.3: Types ofSamples.

5.1.2. NORMAL DISTRIBUTION

Department Of ECE, EPCET. Page | 10


Whenever a random experiment is replicated, the random variable that equals
the average (or total) result over the replicates tends to have a normal distribution as
the number of replicates tends to have a normal distribution as the number of
replicates becomes large. It is one of the cornerstones of probability theory and
statistics, because of the role it plays in the Central Limit Theorem, and because many
real-world phenomena involve random quantities that are approximately normal (e.g.,
errors in scientific measurement).

Figure 5.4: Normal distribution.

It can be observed from the above graph that the distribution is symmetric
about its center, which is also the mean (0 in this case). This makes the probability of
events at equal deviations from the mean, equally probable. The density is highly
centered around the mean, which translates to lower probabilities for values away
from the mean.

The probability density function of the general normal distribution is given as:

In the above formula, all the symbols have their usual meanings, 𝜎 is the Standard
Deviation and µ is the Mean.It is easy to get overwhelmed by the above formula
while trying to understand everything in one glance, but we can try to break it down
into smaller pieces so as to get an intuition as to what is going on.

Department Of ECE, EPCET. Page | 11


The z-score is a measure of how many standard deviations away a data point
is from the mean. Mathematically,

The exponent of e in the above formula is the square of the z-score times -1/2.
This is actually in accordance to the observations that we made above. Values away
from the mean have a lower probability compared to the values near the mean. Values
away from the mean will have a higher z-score and consequently a lower probability
since the exponent is negative. The opposite is true for values closer to the mean.

This gives way for the 68-95-99.7 rule, which states that the percentage of
values that lie within a band around the mean in a normal distribution with a width of
two, four and six standard deviations, comprise 68%, 95% and 99.7% of all the
values. The figure given below shows this rule-

Figure 5.5: Division of Normal Distribution

THE CENTRAL LIMIT THEOREM

Department Of ECE, EPCET. Page | 12


The central Limit Theorem states that as the sample size gets large enough, the
sampling distribution of the mean is approximately normally distributed. This is true
regardless of the shape of the distribution of the individual values in the population

5.1.3. CONFIDENCE LEVEL

The confidence interval is a type of interval estimate from the sampling


distribution which gives a range of values in which the population statistic may lie.
Let us understand this with the help of an example.

We know that 95% of the values lie within 2 (1.96 to be more accurate)
standard deviation of a normal distribution curve. So, for the above curve, the blue
shaded portion represents the confidence interval for a sample mean of 0.

Figure 5.6: Confidence Level

5.1.4. HYPOTHESIS TESTING

Hypothesis testing lets us identify that. It lets a sample statistic to be checked


against a population statistic or statistic of another sample to study any intervention
etc. Hypothesis testing is defined in two terms – Null Hypothesis and Alternate
Hypothesis. Null Hypothesis being the sample statistic to be equal to the population
statistic.

Department Of ECE, EPCET. Page | 13


Alternate Hypothesis- Hypothesis Testing is done on different levels of confidence
and makes use of z-score to calculate the probability. So, for a 95% Confidence
Interval, anything above the z-threshold for 95% would reject the null hypothesis.

Points to be noted:

 We cannot accept the Null hypothesis, only reject it or fail to reject it.
 As a practical tip, Null hypothesis is generally kept which we want to
disprove.
 For e.g. You want to prove that students performed better after taking extra
classes on their exam. The Null Hypothesis, in this case, would be that the
marks obtained after the classes are same as before the classes.

5.1.5. REGION OF REJECTION AND NON-REJECTION

Figure 5.7: Region of R and NR

The region of rejection consists of values of the test statistic that are unlikely
to occur if the null hypothesis is true. These values are more likely to occur if the null
hypothesis is false. Therefore, if a value of the test statistic falls into rejection region,
you reject the null hypothesis because that value is unlikely if the null hypothesis is
true.

5.2. PYTHON
Python is a high-level, interpreted, interactive and object-oriented scripting
language. Python is designed to be highly readable. It uses English keywords
frequently where as other languages use punctuation, and it has fewer syntactical
constructions than other languages.

Department Of ECE, EPCET. Page | 14


 Python is Interpreted − Python is processed at runtime by the interpreter.
You do not need to compile your program before executing it. This is similar
to PERL and PHP.
 Python is Interactive − You can actually sit at a Python prompt and interact
with the interpreter directly to write your programs.
 Python is Object-Oriented − Python supports Object-Oriented style or
technique of programming that encapsulates code within objects.
 Python is a Beginner's Language − Python is a great language for the
beginner-level programmers and supports the development of a wide range of
applications from simple text processing to WWW browsers to games.

5.2.1. FEATURES OF PYTHON

 Easy-to-learn − Python has few keywords, simple structure, and a clearly


defined syntax. This allows the student to pick up the language quickly.

 Easy-to-read − Python code is more clearly defined and visible to the eyes.

 Easy-to-maintain − Python's source code is fairly easy-to-maintain.

 A broad standard library − Python's bulk of the library is very portable and
cross-platform compatible on UNIX, Windows, and Macintosh.

 Interactive Mode − Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.

 Portable − Python can run on a wide variety of hardware platforms and has
the same interface on all platforms.

 Extendable − You can add low-level modules to the Python interpreter.


These modules enable programmers to add to or customize their tools to be
more efficient.

 Databases − Python provides interfaces to all major commercial databases.

 GUI Programming − Python supports GUI applications that can be created


and ported to many system calls, libraries and windows systems, such as
Windows MFC, Macintosh, and the X Window system of Unix.

Department Of ECE, EPCET. Page | 15


 Scalable − Python provides a better structure and support for large programs
than shell scripting.

Apart from the above-mentioned features, Python has a big list of good features, few
are listed below −

 It supports functional and structured programming methods as well as OOP.

 It can be used as a scripting language or can be compiled to byte-code for


building large applications.

 It provides very high-level dynamic data types and supports dynamic type
checking.

 It supports automatic garbage collection.

 It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

5.2.2. PYTHON DATA TYPES

The data stored in memory can be of many types. For example, a person's age
is stored as a numeric value and his or her address is stored as alphanumeric
characters.

Python has various standard data types that are used to define the operations
possible on them and the storage method for each of them.

Python has five standard data types −

1. Numbers
2. String
3. List
4. Tuple
5. Dictionary

5.2.3. PYTHON NUMBERS-

Number data types store numeric values. Number objects are created when
you assign a value to them. For example −

var1 = 1

Department Of ECE, EPCET. Page | 16


var2 = 10

You can also delete the reference to a number object by using the del statement. The
syntax of the del statement is –

del var1[,var2[,var3[....,varN]]]]

You can delete a single object or multiple objects by using the del statement. For
example −

del var

del var_a, var_b

Python supports four different numerical types −

 int (signed integers)


 long (long integers, they can also be represented in octal and hexadecimal)
 float (floating point real values)
 complex (complex numbers)

5.2.4. PYTHON STRINGS-


Strings in Python are identified as a contiguous set of characters represented in
the quotation marks. Python allows for either pairs of single or double quotes. Subsets
of strings can be taken using the slice operator ([ ] and [:] ) with indexes starting at 0
in the beginning of the string and working their way from -1 at the end.

The plus (+) sign is the string concatenation operator and the asterisk (*) is the
repetition operator.

5.2.5.PYTHON LISTS-

Lists are the most versatile of Python's compound data types. A list contains
items separated by commas and enclosed within square brackets ([]). To some extent,
lists are similar to arrays in C. One difference between them is that all the items
belonging to a list can be of different data type. The values stored in a list can be
accessed using the slice operator ([ ] and [:]) with indexes starting at 0 in the
beginning of the list and working their way to end -1. The plus (+) sign is the list
concatenation operator, and the asterisk (*) is the repetition operator.

Department Of ECE, EPCET. Page | 17


5.2.6. PYTHON TUPLES-

A tuple is another sequence data type that is similar to the list. A tuple
consists of a number of values separated by commas. Unlike lists, however, tuples
are enclosed within parentheses.

The main differences between lists and tuples are: Lists are enclosed in brackets ( [ ]
) and their elements and size can be changed, while tuples are enclosed in
parentheses ( ( ) ) and cannot be updated. Tuples can be thought of as read-
only lists.

5.2.7.PYTHON DICTIONARY-

Python's dictionaries are kind of hash table type. They work like associative
arrays or hashes found in Perl and consist of key-value pairs. A dictionary key can be
almost any Python type, but are usually numbers or strings. Values, on the other
hand, can be any arbitrary Python object.

Dictionaries are enclosed by curly braces ({ }) and values can be assigned and
accessed using square braces ([]).

5.3. JUPYTER NOTEBOOK


Jupyter Notebook is an open source web interface that enables to include text,
video, audio, images, along with the possibility to execute code from different
programming languages. This execution is accomplished by way of communication to
a computational core (Kernel). By default, Jupyter Notebook only includes the Python
kernel. However, being an open source project, it has been possible to increase the
number of available kernels [1]. These include kernels for Octave, Julia, R, Haskell,
Ruby, C/C++, Fortran, Java, SageMath, Scala, MATLAB and Mathematica. Thus,
this interface can mark a milestone in the standardization of ways to deliver scientific
content, without being circumscribed to a single language. In fact, the name Jupyter is
a combination of the most popular, open source programming languages for scientific
computing: Ju-lia, Py-thon and R. This versatility has allowed to spread its use into

Department Of ECE, EPCET. Page | 18


both educational and research environments. In this last area, it is becoming more and
more common to include Jupyter Notebooks with calculations, data and additional
figures to scientific papers. For instance, the LIGO team, who experimentally
discovered gravitational waves in October 2017, have provided their data and
calculations in this format In the educational area, the course AeroPython by Lorena
Barba from The George Washington University, stands out as a course entirely
performed in Jupyter Notebooks.

5.4. DATA ANALYTICS


Data analytics refers to qualitative and quantitative techniques and processes
used to enhance productivity and business gain. Data is extracted and categorized to
identify and analyse behavioural data and patterns, and techniques vary according to
organizational requirements. Data analytics is also known as data analysis.

Data analytics is primarily conducted in business-to-consumer (B2C)


applications. Global organizations collect and analyse data associated with customers,
business processes, market economics or practical experience.

Evolving data facilitates thorough decision-making. For example, a social


networking website collects data related to user preferences, community interests and
segment according to specified criteria such as demographics, age or gender. Proper
analysis reveals key user and customer trends and facilitates the social network's
alignment of content, layout and overall strategy.

Department Of ECE, EPCET. Page | 19


Figure 5.8: Data Science Process.

Data Analytics

Descriptive Predictive Prescriptive


Analytics Analytics Analytics
What has happened or is What could happen in the What should we do to make
happening? future? that happen or not happen?

Figure 5.9: Data Analytics Divisions.

5.5.MACHINE LEARNING

Department Of ECE, EPCET. Page | 20


Machine Learning (ML) can be explained as “Field of study that gives
computers the capability to learn without being explicitly programmed”. That is
automating and improving the learning process of computers based on their
experiences without being actually programmed i.e. without any human assistance.
The process starts with feeding a good quality data and then training our
machines(computers) by building machine learning models using the data and
different algorithms. The choice of algorithms depends on what type of data do we

have and what kind of task we are trying to automate.

Figure 5.10: Basics of ML

Basic Difference in ML and Traditional Programming?

 Traditional Programming: We feed in DATA (Input) + PROGRAM (logic),


run it on machine and get output.

 Machine Learning: We feed in DATA(Input) + Output, run it on machine


during training and the machine creates its own program(logic), which can be
evaluated while testing.

5.5.1. HOW ML WORKS?

 Gathering past data in the form of text file, excel file, images or audio data.
The better the quality of data, the better will be the model learning.

 Data Processing – Sometimes, the data collected is in the raw form and it
needs to be rectified.
Example: if data has some missing values, then it has to be rectified. If data is in

Department Of ECE, EPCET. Page | 21


the form of text or images then converting it to numerical form will be required,
be it list or array or matrix. Simply, Data is to be made relevant and
understandable by the machine

 Building up models with suitable algorithms and techniques and then training
it.

 Testing our prepared model with data which was not feed in at the time of
training and so evaluating the performance – score, accuracy with high level of
precision.

5.5.2. CLASSIFICATION OF MACHINE LEARNING

There are various categories in the field of artificial intelligence. The


classifications of machine learning systems are:

1. SUPERVISED MACHINE LEARNING:

Supervised learning as the name indicates a presence of supervisor as teacher.


Basically, supervised learning is a learning in which we teach or train the machine
using data which is well labelled that means some data is already tagged with correct
answer. After that, machine is provided with new set of examples(data) so that
supervised learning algorithm analyses the training data (set of training examples) and
produces a correct outcome from labelled data.

For instance, suppose you are given a basket filled with different kinds of
fruits. Now the first step is to train the machine with all different fruits one by one like
this:

 If shape of object is rounded and depression at top having color Red then
it will be labelled as –Apple.

 If shape of object is long curving cylinder having color Green-


Yellow then it will be labelled as –Banana.

Now suppose after training the data, you have given a new separate fruit say
Banana from basket and asked to identify it.

Department Of ECE, EPCET. Page | 22


Since machine has already learnt the things from previous data and this time have to
use it wisely. It will first classify the fruit with its shape and colour, and would
confirm the fruit name as BANANA and put it in Banana category. Thus, machine
learns the things from training data (basket containing fruits) and then apply the
knowledge to test data (new fruit).

Supervised learning classified into two categories of algorithms:

 Classification: A classification problem is when the output variable is a


category, such as “Red” or “blue” or “disease” and “no disease”.
 Regression: A regression problem is when the output variable is a real value,
such as “dollars” or “weight”.

2. UNSUPERVISED LEARNING:

Unsupervised learning is the training of machine using information that is


neither classified nor labelled and allowing the algorithm to act on that information
without guidance. Here the task of machine is to group unsorted information
according to similarities, patterns and differences without any prior training of data.

Unlike supervised learning, no teacher is provided that means no training will


be given to the machine. Therefore, machine is restricted to find the hidden structure
in unlabelled data by our-self.

For instance, suppose it is given an image having both dogs and cats which
have not seen ever.Thus, machine has no any idea about the features of dogs and cat
so it can’t categorize dogs and cats. But it can categorize them according to their
similarities, patterns and differences i.e., we can easily categorize the above picture
into two parts. First may contain all pics having dogs in it and second part may
contain all pics having cats in it. Here you didn’t learn anything before, means no
training data or examples.

Unsupervised learning classified into two categories of algorithms:

 Clustering: A clustering problem is where you want to discover the inherent


groupings in the data, such as grouping customers by purchasing behaviour.

Department Of ECE, EPCET. Page | 23


 Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that buy
X also tend to buy Y.

5.6. PERCEPTRON: THE ARTIFICIAL NEURON


It is very well known that the most fundamental unit of deep neural networks is
called an artificial neuron/perceptron. But the very first step towards
the perceptron we use today was taken in 1943 by McCulloch and Pitts, by mimicking
the functionality of a biological neuron.

BIOLOGICAL NEURONS: AN OVERLY SIMPLIFIED ILLUSTRATION:

Figure 5.11:Neuron

Dendrite: Receives signals from other neurons

Soma: Processes the information

Axon: Transmits the output of this neuron

Synapse: Point of connection to other neurons

Basically, a neuron takes an input signal (dendrite), processes it like the CPU
(soma), passes the output through a cable like structure to other connected neurons

Department Of ECE, EPCET. Page | 24


(axon to synapse to another neuron’s dendrite). Now, this might be biologically
inaccurate as there is a lot more going on out there but on a higher level, this is what is
going on with a neuron in our brain — takes an input, processes it, throws out an
output. Our sense organs interact with the outer world and send the visual and sound
information to the neurons.

The first computational model of a neuron was proposed by Warren McColloch

(neuroscientist) and Walter Pitts (logician) in 1943.

Figure 5.12: MP Neuron

It may be divided into 2 parts. The first part, g takes an input (dendrite),
performs an aggregation and based on the aggregated value the second part, f makes a
decision.

Let’s suppose that I want to predict my own decision, whether to watch a


random football game or not on TV. The inputs are all Boolean i.e., {0,1} and my
output variable is also Boolean {0: Will watch it, 1: Won’t watch it}.

 So, x_1 could be isPremierLeagueOn


 x_2 could be isItAFriendlyGame
 x_3 could be isNotHome
 x_4 could be isManUnitedPlaying

Department Of ECE, EPCET. Page | 25


These inputs can either be excitatory or inhibitory. Inhibitory inputs are those
that have maximum effect on the decision making irrespective of other inputs i.e.,
if x_3 is 1 (not home) then my output will always be 0 i.e., the neuron will never fire,
so x_3 is an inhibitory input. Excitatory inputs are NOT the ones that will make the
neuron fire on their own but they might fire it when combined together. Formally, this

is what is going on:

We can see that g(x) is just doing a sum of the inputs — a simple aggregation.
And theta here is called thresholding parameter. For example, if I always watch the
game when the sum turns out to be 2 or more, the theta is 2 here. This is called the
Thresholding Logic.

Perceptron is an essential upgrade to the McCulloch – Pitts Neuron. Frank


Rosenblatt, an American psychologist, proposed the classical perceptron model in
1958. The most fundamental unit of a deep neural network is called an artificial
neuron, which takes an input, processes it, passes it through an activation function,
return the activated output.

The perceptron model, proposed by Minsky-Papert, is a more general

Department Of ECE, EPCET. Page | 26


computational model than McCulloch-Pitts neuron. It overcomes some of the
limitations of the MP neuron by introducing the concept of numerical weights (a
measure of importance) for inputs, and a mechanism for learning those weights. Inputs
are no longer limited to Boolean values like in the case of an M-P neuron, it supports
real inputs as well which makes it more useful and generalized.

Now, this is very similar to an M-P neuron but we take a weighted sum of the
inputs and set the output as one only when the sum is more than an arbitrary threshold
(theta). However, according to the convention, instead of hand coding the thresholding
parameter theta, we add it as one of the inputs, with the weight -theta like shown
below, which makes it learn-able.

Consider the task of predicting whether I would watch a random game of


football on TV or not (the same example from my M-P neuron post) using the
behavioural data available. And let's assume my decision is solely dependent on 3

Department Of ECE, EPCET. Page | 27


binary inputs (binary for simplicity).

Here, w_0 is called the bias because it represents the prior (prejudice). A
football freak may have a very low threshold and may watch any football game
irrespective of the league, club or importance of the game [theta = 0]. On the other
hand, a selective viewer like me may only watch a football game that is a premier
league game, featuring Man United game and is not friendly [theta = 2]. The point is,
the weights and the bias will depend on the data (my viewing history in this case).
Based on the data, if needed the model may have to give a lot of importance (high
weight) to the is ManUnitedPlaying input and penalize the weights of other inputs.

CHAPTER 6

LEARNING OUTCOME OF INTERNSHIP

6.1. INTRODUCTION

Introduction In today's era, the impact of the social networking media such as
Facebook, Google Plus, YouTube, Blogs, and Twitter is increasing rapidly day by
day. Millions of people are connected with each other on social networking sites and
express their sentiments and opinion through tweets, and comments . This motivates
the automatic mining and classification of views, emotions, opinions, and feeling of
people on social networking websites. Sentiment analysis is the process of analyzing
the data in order to extract sentiment or opinions. It is also known as subjectivity
analysis, opinion mining and sentiment classification. An example of sentiment
analysis is Experiencing US Airlines. Before getting into a airlines, a customer read
reviews about that airlines services. With the help of sentiment analysis, customers
can find opinions of other people, whether they are satisfied or not with the quality of
the services. Sentiment analysis is a type of natural language processing task that
tracks the views of people about a certain thing or topic and categorizes these views
into two classes i.e. positive and negative. In positive class, positive opinion of the

Department Of ECE, EPCET. Page | 28


authors is reflected like “this is a great pleasure.” In negative class, the negative
opinion of the authors is reflected like “this journey was worst.”

Sentiment analysis can be done either using supervised learning techniques or


unsupervised learning techniques . The Supervised learning also known as machine
learning approach use the data set that is divided into training set and test set. Naïve
Bayes, Maximum Entropy, and Support Vector Machine are various techniques that
come under the domain of supervised techniques. In Unsupervised learning, there is
no need of dividing the data set into training and test data set. Among these
techniques, Support Vector Machine is proven to be an excellent approach for text
categorization . The main aim of this work is to apply Support Vector Machine for
analysis of tweets about US Airlines and categorizing the tweets into positive and
negative categories. Feature extraction has also been employed with support vector
machine to improve the performance of classifier.

6.2 PROPOSED METHOD

Department Of ECE, EPCET. Page | 29


The following steps will expound the process of the proposed system which is
discussed below:
1. Retrieval of tweets
2. Pre-processing of extracted data
3. Parallel processing
4. Sentiment scoring module
5. Output sentiment
These steps are explained below:

1. Retrieval of tweets :As twitter is the most exaggerated part of social networking
site, it consists of various blogs which are related to various topics worldwide. Instead
of taking whole blogs, we will rather search on particular topic and download all its
web pages then extracted them in the form of text files by using mining tool i.e. Weka
which provides sentiment classifier.

2. Pre-processing of extracted data:After retrieval of tweets Sentiment analysis tool


is applied on raw tweets but in most of cases results to very poor performance.
Therefore, preprocessing techniques are necessary for obtaining better results . We
extract tweets i.e. short messages from twitter which are used as raw data. This raw
data needs to be preprocessed.

So, preprocessing involves following steps:

A. Filtering

B. Tokenization

C. Removal of stopwords

3.Parallel processing:Sentiment classifier which classifies the sentiments builds


using multinomial Naïve Bayes Classifier or Support Vector Machines (SVMs).
Training of classifier data is the main motive of this step. Every database has hidden
information which can
be used for decision-making. Classification and prediction are two forms of data
analysis which can be used to extract models describing important data and future

Department Of ECE, EPCET. Page | 30


trends. Classification is process of finding a set of models or functions that describe
and distinguish data classes or concepts, for the purpose of being able to use the
model for predicting the class of objects whose class label is unknown.The derived
model is based on the analysis of a set of training data. Training data consists of data
objects whose class labels are known.The derived model can be represented invarious
forms, such as classification (IF-THEN) rules, decision trees,mathematical formulae.

4.Sentiment scoring module:Prior polarity of words is the basic of our number of


features. The dictionary is used in which English language words assigns a score to
every word, between Negative to positive. So, this scoring module is going to
determine score of sentiments in the data.

5.Output sentiment:Based on the dictionary assignment of score, the proposed


system interprets whether the tweet is positive or negative.

6.3 DATA EXPLORATION

Toolkits used

 Numpy. Adds Python support for large, multi-dimensional arrays and


matrices, along with a large library of high-level mathematical functions to
operate on these arrays.

 Pandas. Software library written for data manipulation and analysis in


Python. Offers data structures and operations for manipulating numerical
tables and time series.

 Scikit-learn is a Python module for machine learning built on top of


SciPy and distributed under the 3-Clause BSD license.

 NLTK is a leading platform for building Python programs to work with


human language data.It provides easy-to-use interfaces to over 50 corpora
and lexical resources such as WordNet, along with a suite of text processing
libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning, wrappers for industrial-strength NLP libraries, and an
active discussion forum.

Department Of ECE, EPCET. Page | 31


Reading the csv file by importing it from the specified path

We can observe that the data set contain 14640 rows and 15
columns.'airline_sentiment' is the column we are going to predict , which says
positive , negative and neutral.

We take only the tweets we are very confident with. We use the BeautifulSoup library
to process html encoding present in some tweets because scrapping.

Department Of ECE, EPCET. Page | 32


We are going to distinguish two cases: tweets with negative sentiment and
tweets with non-negative sentiment

We split the data into training and testing set

I suspect that tokenization is even more important in sentiment analysis than it is in


other areas of NLP, because sentiment information is often sparsely and unusually
represented

We are going to use cross validation and grid search to find good hyperparameters for
our SVM model. We need to build a pipeline to don't get features from the validation
folds when building each training model.

Department Of ECE, EPCET. Page | 33


Finding out the results with accuracy for the test data using svm model

A confusion matrix is a table that is often used to describe the performance of a


classification model (or “classifier”) on a set of test data for which the true values are
known. It allows the visualization of the performance of an algorithm.

Department Of ECE, EPCET. Page | 34


An ROC curve demonstrates several things:

1. sensitivity will be accompanied by a decrease in specificity.


2. The closer the curve follows the left-hand border and then the top border of
the ROC space, the more accurate the test.
3. The closer the curve comes to the 45-degree diagonal of the ROC space, the
less accurate the test.
4. The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for
that value of the test.

Department Of ECE, EPCET. Page | 35


This type of graph is called a Receiver Operating Characteristic curve (or ROC
curve.) It is a plot of the true positive rate against the false positive rate for the
different possible cut points of a diagnostic test.

CHAPTER 7

CONCLUSION

This internship program helped me to gain important knowledge on Machine


Learning algorithms, Python programming and Statistics. I was able to implement this
knowledge on a mini-project which is SENTIMENT ANALYSIS OF TWITTER
DATA using SVM algorithm. We face the problem of predicting tweets sentiment.

Department Of ECE, EPCET. Page | 36


We have coded the text as Bag of Words and applied an SVM model. We have built a
pipeline to check different hyperparameters using cross-validation. At the end, we
have obtained a good model which achieve an AUC of 0.92

REFERENCES

[1] https://www.kaggle.com/learn/deep-learning

[2] http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29

[3] https://github.com

[4] https://padhai.onefourthlabs.in

Department Of ECE, EPCET. Page | 37


[5] https://www.geeksforgeeks.org

Department Of ECE, EPCET. Page | 38

Anda mungkin juga menyukai