Anda di halaman 1dari 12

SUMMER PROJECT REPORT

PROGRAMMING LANGUAGE RECOGNITION USING NLP

Submitted By:

KATABATTULA BHOGESWARA SAI KUMAR


BATHULA DURGA PRASAD
CHENNUPALLI SASIDHAR
BOLLA MAHESH
SHAIK AKHIBASHA
KANCHUMARTHI NAGENDRA KRISHNA
Certificate

This is to certify that K BHOGESWARA SAI KUMAR , B DURGAPRASAD , CH SASIDHAR ,


B MAHESH , SK AKHILBASHA , K NAGENDRA KRISHNA enrolled in the SUMMER
INTERNSHIP PROGRAM 2019 provided by the SMARTBRIDGE in collaboration with IBM has
successfully completed the project entitled “ PROGRAMMING LANGUAGE RECOGNITION
USING NLP ” from time period 27th MAY 2019 to 26th JUNE 2019 under the guidance of
______________________________

(Signature)
Acknowledgment

I hereby thank SMARTBRIDGE who has given me the great opportunity to


work on this project. They have been a great source of inspiration and their
timely support and guidance has helped in the successful completion of the
same. I also thank Mr. Anil Reddy whose creative suggestions and timely
interventions helped us a lot during the process.

Then I would like to thank Mr. Charan Reddy for his enthusiastic approach
and dedication which has indeed been a great source of inspiration and
support to us. He deserves a real acknowledgment at this juncture.
TABLE OF CONTENTS

1. Introduction...........................................................06 2.
Objectives......................................................07
3.Methods and Concepts.........................................10
4.Results and Discussions........................................12
5.Conclusion......................................................16
6.Future Perspective.............................................16
7.References......................................................17
Introduction

Natural language processing (NLP) is a subfield of computer science, information


engineering, and artificial intelligence concerned with the interactions between computers
and human (natural) languages, in particular how to program computers to process and
analyze large amounts of natural language data.

Though natural language processing tasks are closely intertwined, they are frequently
subdivided into categories for convenience. A coarse division is given below.

Grammar induction[13]

Generate a formal grammar that describes a language's syntax.


Lemmatization
The task of removing inflectional endings only and to return the base dictionary form of a
word which is also known as a lemma.
Morphological segmentation
Separate words into individual morphemes and identify the class of the morphemes. The
difficulty of this task depends greatly on the complexity of the morphology (i.e. the
structure of words) of the language being considered. English has fairly simple
morphology, especially inflectional morphology, and thus it is often possible to ignore
this task entirely and simply model all possible forms of a word (e.g. "open, opens,
opened, opening") as separate words. In languages such as Turkish or Meitei,[14] a highly
agglutinatedIndian language, however, such an approach is not possible, as each
dictionary entry has thousands of possible word forms.
Part-of-speech tagging
Given a sentence, determine the part of speech (POS) for each word. Many words,
especially common ones, can serve as multiple parts of speech. For example, "book" can
be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb
or adjective; and "out" can be any of at least five different parts of speech. Some
– discuss]
languages have more such ambiguity than others.[dubious Languages with little
inflectional morphology, such as English, are particularly prone to such ambiguity.
Chinese is prone to such ambiguity because it is a tonal language during verbalization.
Such inflection is not readily conveyed via the entities employed within the orthography
to convey intended meaning.
Parsing
Determine the parse tree (grammatical analysis) of a given sentence. The grammar for
natural languages is ambiguous and typical sentences have multiple possible analyses. In
fact, perhaps surprisingly, for a typical sentence there may be thousands of potential
parses (most of which will seem completely nonsensical to a human). There are two
primary types of parsing, Dependency Parsing and Constituency Parsing. Dependency
Parsing focuses on the relationships between words in a sentence (marking things like
Primary Objects and predicates), whereas Constituency Parsing focuses on building out
the Parse Tree using a Probabilistic Context-Free Grammar (PCFG). See also: Stochastic
grammar.
Sentence breaking (also known as sentence boundary disambiguation)
Given a chunk of text, find the sentence boundaries. Sentence boundaries are often
marked by periods or other punctuation marks, but these same characters can serve
other purposes (e.g. marking abbreviations).
Stemming
The process of reducing inflected (or sometimes derived) words to their root form. (e.g.
"close" will be the root for "closed", "closing", "close", "closer" etc).

As NLP deals with processing of languages our project motto is to detect type of
Programming language that a given code belongs to.
The code may be of any programming language and it was identified by the Built in
stack_overflow dataset
Objectives

The main objective was to recognize different programming languages that the given input code
belongs to.
The goal of natural language processing (NLP) is to design and build computer systems that are
able to analyze natural languages like German or English, and that generate their outputs in a
natural language, too. Typical applications of NLP are information retrieval, language
understanding, and text classification. The development of statistical approaches for these
applications is one of the research activities at Lehrstuhl für Informatik 6.
Information retrieval (IR) deals with the representation, storage, organization of, and access to
information items. Given a query the goal is to extract a subset of documents from a large data
collection that satisfies a user's information need. Besides written texts the database may also
contain multimedia documents, e.g. audio and video data.
In natural language understanding, the objective is to extract the meaning of an input sentence
or an input text. Usually, the meaning is represented in a suitable formal representation language
so that it can be processed by a computer.
The goal in text classification is to assign a text document to one out of several text classes. For
newspaper articles, such classes are sports reports, finances, and politics.
Information Retrieval
Natural Language Understanding
Spoken Dialogue Systems
Text Classification and Clustering
Methods and Concepts

The methods and concepts used in this project are briefly explained below.
The concept of pandas , keras and its preprocessing models are used.
train : It is used to train the dataset given,attributes with common features will be
grouped together.
Numpy is the fundamental package for scientific computing with python.
The project starts by importing the required modules from the anaconda as
mentioned below
Import pandas as pd
This command imports the package pandas likewise other packages are also
imported
Next step we have to download the stack_overflow dataset from the provided URL
We have to specify the path of the dataset downloaded and convert it to csv form
data.to_csv(stack-overflow-data.csv)
By this ,the dataset will be converted to csv format,we can look the head of the data
by
df.head(),it will display the head portion of the data in the post and tag format
Df.post() the given dataset contains posts and tags, the given command will give the
posts that are found in the dataset.
Tkinter :this was a python’s de-facto standard GUI package.It is a thin object
oriented layer on top of Tcl/Tk.
def predict():
print("Prediction on progress..")
entered_input=E1.get()
print("Entered Input",entered_input)
sample=tokenize.texts_to_matrix([entered_input])
k=model.predict_classes(new)
#entered_input=cv.transform([entered_input])
#y_pred=model.predict(entered_input)
print(k)
l=encoder.classes_[k]
L2 = Label(top, text="Prediction: "+l)
L2.pack()

B = Button(top, text ="Predict", command = predict)


B.pack(pady=10)

top.mainloop()
print(k)
The above predict() function will give the required output i.e the type of code given
Tokenization() :it was used to convert a paragraphs to sentences and sentences to
words,finally words are converted to literals.
sample=tokenize.texts_to_matrix([entered_input]) :this was used to make a
matrix form with the converted words.
RESULTS

As of given the model should process the input and have to predict the type of
programming language that a given input belongs to.

The predicted output will be as shown above.


CONCLUSION

Natural Language Processing (NLP) was useful for the human-machine interaction
As machine can only understand the binary language NLP convenrts the given input
Into binary form and sends it to the system, according input the system will generate
the output .
The output was then coverted to user understanble format by the NLP.this process
was very useful and commonly implemented .The conclusion of the project is to
Determine/predict the exact programming language that a given input code belongs.
FUTURE PERSPECTIVE

1.Practical implications
– A limited literature is available on the classification of maintenance optimization
models and on its associated case studies. The paper classifies the literature on
maintenance optimization models on different optimization techniques and based on
emerging trends it outlines the directions for future research in the area of maintenance
optimization.
2.information extraction purpose
3.industry monitoring
4.educational perspective
REFERENCES

1.stack-overflow-data:## https://storage.googleapis.com/tensorflow-workshop-
examples/stack-overflow-data.csv
2.Hands-on Data Science With Anaconda by James Yan,Dr.yuxing Yan
3.Data analysis with numpy and pandas by curtis miller

Anda mungkin juga menyukai