Anda di halaman 1dari 53


INTRODUCTION Machine Translation is a great example of how cutting edge research and world class infrastructure come together at Google. We focus our research efforts towards developing statistical translation techniques that improve with more data and generalize well to new languages. Our large scale computing infrastructure allows us to rapidly experiment with new models trained on web-scale data to significantly improve translation quality. This cuttingedge research backs the translations served at, allowing our users to translate text, web pages and even speech. Deployed within a wide range of Google services like GMail, Books, Android and web search, Google Translate is a high impact, research driven product that bridges the language barrier and makes it possible to explore the multilingual web in 63 languages. Exciting research challenges abound as we pursue human quality translation and develop machine translation systems for new languages. However, adopting an always-on strategy has several disadvantages. In this work we present two extensions to the well-known dynamic programming beam search in phrase-based statistical machine translation (SMT), aiming at increased ef- ciency of decoding by minimizing the number of language model computations and hypothesis expansions. Our results show that language model based pre-sorting yields a small improvement in translation quality and a speedup by a factor of 2. Two look-ahead methods are shown to further increase translation speed by a factor of 2 without changing the search space and a factor of 4 with the side-effect of some additional search errors. We compare our approach with Moses and observe the same performance, but a substantially better trade-off between translation quality and speed. At a speed of roughly 70 words per second, Moses reaches 17.2% BLEU, whereas our approach yields 20.0% with identical models. When trained on very large parallel corpora, the phrase table component of a machine translation system grows to consume vast computational resources. In this paper, we introduce a novel pruning criterion that places phrase table pruning on a sound theoretical foundation. Systematic experiments on four language pairs under various data conditions show that our principled approach is superior to existing ad hoc pruning methods. We propose an unsupervised method for clustering the translations of a word, such that the translations in each cluster share a common semantic sense. Words are assigned to clusters based on their usage distribution in large monolingual and parallel corpora using the soft KMeans algorithm. In addition to describing our approach, we formalize the task of translation sense clustering and describe a procedure that leverages WordNet for evaluation. By

comparing our induced clusters to reference clusters generated from WordNet, we demonstrate that our method effectively identifies sense-based translation clusters and benefits from both monolingual and parallel corpora. Finally, we describe a method for annotating clusters with usage examples. Our Contributions We present a simple and effective infrastructure for domain adaptation for statistical machine translation (MT). To build MT systems for different domains, it trains, tunes and deploys a single translation system that is capable of producing adapted domain translations and preserving the original generic accuracy at the same time. The approach unites automatic domain detection and domain model parameterization into one system. Experiment results on 20 language pairs demonstrate its viability. Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the compound parts and morphological operations needed to split compounds into their compound parts. The method uses a bilingual corpus to learn the morphological operations required to split a compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set of compound part candidates. We evaluate our method within a machine translation task and show significant improvements for various languages to show the versatility of the approach. 1.1 Overview Machine Translation is an important technology for localization, and is particularly relevant in a linguistically diverse country like India. In this document, we provide a brief survey of Machine Translation in India. Human translation in India is a rich and ancient tradition. Works of philosophy, arts, mythology, religion, science and folklore have been translated among the ancient and modern Indian languages. Numerous classic works of art, ancient, medieval and modern, have also been translated between European and Indian languages since the 18th century. Problem Statement In the current era, human translation finds application mainly in the administration, media and education, and to a lesser extent, in business, arts and science and technology.

India has a linguistically rich areait has 18 constitutional languages, which are written in 10 different scripts. Hindi is the official language of the Union. English is very widely used in the media, commerce, science and technology and education. Many of the states have their own regional language, which is either Hindi or one of the other constitutional languages. Only about 5% of the population speaks English. In such a situation, there is a big market for translation between English and the various Indian languages. Currently, this translation is essentially manual. Use of automation is largely restricted to word processing. Two specific examples of high volume manual translation aretranslation of news from English into local languages, translation of annual reports of government departments and public sector units among, English, Hindi and the local language. As is clear from above, the market is largest for translation from English into Indian languages, primarily Hindi. Hence, it is no surprise that a majority of the Indian Machine Translation (MT) systems are for English-Hindi translation. As is well known, natural language processing presents many challenges, of which the biggest is the inherent ambiguity of natural language. MT systems have to deal with ambiguity, and various other NL phenomena. In addition, the linguistic diversity between the source and target language makes MT a bigger challenge. This is particularly true of widely divergent languages such as English and Indian languages. The major structural difference between English and Indian languages can be summarized as follows. English is a highly positional language with rudimentary morphology, and default sentence structure as SVO. Indian languages are highly inflectional, with a rich morphology, relatively free word order, and default sentence structure as SOV. In addition, there are many stylistic differences. For example, it is common to see very long sentences in English, using abstract concepts as the subjects of sentences, and stringing several clauses together (as in this sentence!). Such constructions are not natural in Indian languages, and present major difficulties in producing good translations. As is recognized the world over, with the current state of art in MT, it is not possible to have Fully Automatic, High Quality, and General-Purpose Machine Translation. Practical systems need to handle ambiguity and the other complexities of natural language processing, by relaxing one or more of the above dimensions.

1.2 LITERATURE SURVEY Natural Language Processing (NLP) is an area of research and application that explores how computers can be used to understand and manipulate natural language text or speech to do useful things. NLP researchers aim to gather knowledge on how human beings understand and use language so that appropriate tools and techniques can be developed to make computer systems understand and manipulate natural languages to perform the desired tasks. The foundations of NLP lie in a number of disciplines, viz. computer and information sciences, linguistics, mathematics, electrical and electronic engineering, artificial intelligence and robotics, psychology, etc. Applications of NLP include a number of fields of studies, such as machine translation, natural language text processing and summarization, user interfaces, multilingual and cross language information retrieval (CLIR), speech recognition, artificial intelligence and expert systems, and so on. One important area of application of NLP that is relatively new and has not been covered in the previous ARIST chapters on NLP has become quite prominent due to the proliferation of the world wide web and digital libraries. Several researchers have pointed out the need for appropriate research in facilitating multi- or cross-lingual information retrieval, including multilingual text processing and multilingual user interface systems, in order to exploit the full benefit of the www and digital libraries. Scope Several ARIST chapters have reviewed the field of NLP. The most recent ones include that by Warner in 1987, and Haas in 1996. Reviews of literature on large-scale NLP systems, as well as the various theoretical issues have also appeared in a number of publications (see for example, Jurafsky & Martin, 2000; Manning & Schutze, 1999; Mani & Maybury, 1999; Sparck Jones, 1999; Wilks, 1996). Smeaton (1999) provides a good overview of the past research on the applications of NLP in various information retrieval tasks. Several ARIST chapters have appeared on areas related to NLP, such as on machinereadable dictionaries (Amsler, 1984;Evans, 1989), speech synthesis and recognition (Lange, 1993), and cross-language information retrieval (Oard & Diekema, 1998). Research on NLP is regularly published in a number of conferences such as the annual proceedings of ACL (Association of Computational Linguistics) and its European counterpart EACL, biennial proceedings of the International Conference on Computational Linguistics (COLING), annual

proceedings of the Message Understanding Conferences (MUCs), Text Retrieval Conferences (TRECs) and ACM-SIGIR (Association of Computing Machinery Special Interest Group on Information Retrieval) conferences. The most prominent journals reporting NLP research are Computational Linguistics and Natural Language Engineering. Articles reporting NLP research also appear in a number of information science journals such as Information Processing and Management, Journal of the American Society for Information Science and Technology, and Journal of Documentation. Several researchers have also conducted domainspecific NLP studies and have reported them in journals specifically dealing with the domain concerned, such as the International Journal of Medical Informatics and Journal of Chemical Information and Computer Science. Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: i. natural language text processing systems text summarization, information extraction, information retrieval, etc., including domain-specific applications; ii. iii. iv. natural language interfaces; NLP in the context of www and digital libraries ; and evaluation of NLP systems. Linguistic research in information retrieval has not been covered in this review, since this is a huge area and has been dealt with separately in this volume by David Blair. Similarly, NLP issues related to the information retrieval tools (search engines, etc.) for web search are not covered in this chapter since a separate chapter on indexing and retrieval for the Web has been written in this volume by Edie Rasmussen. Tools and techniques developed for building NLP systems have been discussed in this chapter along with the specific areas of applications for which they have been built. Although machine translation (MT) is an important part, and in fact the origin, of NLP research, this paper does not cover this topic with sufficient detail since this is a huge area and demands a separate chapter on its own. Similarly, cross-language information retrieval (CLIR), although is a very important and big area in NLP research, is not covered in great detail in this chapter. A separate chapter on CLIR research appeared in ARIST (Oard & Diekema, 1998). However, MT and CLIR have become two important areas of research in the context of the www digital libraries. This chapter reviews some works on MT and CLIR in the context of NLP and IR in digital

libraries and www. Artificial Intelligence techniques, including neural networks etc., used in NLP have not been included in this chapter.

Some Theoretical Developments Previous ARIST chapters (Haas, 1996; Warner, 1987) described a number of theoretical developments that have influenced research in NLP. The most recent theoretical developments can be grouped into four classes: (i) statistical and corpus-based methods in NLP, (ii) recent efforts to use WordNet for NLP research, (iii) the resurgence of interest in finite-state and other computationally lean approaches to NLP, and (iv) the initiation of collaborative projects to create large grammar and NLP tools. Statistical methods are used in NLP for a number of purposes, e.g., for word sense disambiguation, for generating grammars and parsing, for determining stylistic evidences of authors and speakers, and so on. Charniak (1995) points out that 90% accuracy can be obtained in assigning part-of-speech tag to a word by applying simple statistical measures. Jelinek (1999) is a widely cited source on the use of statistical methods in NLP, especially in speech processing. Rosenfield (2000) reviews statistical language models for speech processing and argues for a Bayesian approach to the integration of linguistic theories of data. Mihalcea & Moldovan (1999) mention that although thus far statistical approaches have been considered the best for word sense disambiguation, they are useful only in a small set of texts.

Figure 1 Langauge Translation - Using NLP

They propose the use of WordNet to improve the results of statistical analyses of natural language texts. WordNet is an online lexical reference system developed at Princeton University. This is an excellent NLP tool containing English nouns, verbs, adjectives and adverbs organized into synonym sets, each representing one underlying lexical concept. Details of WordNet is available in Fellbaum (1998) and on the web

( WordNet is now used in a number of NLP research and applications. One of the major applications of WordNet in NLP has been in Europe with the formation EuroWordNet in 1996. EuroWordNet is a multilingual database with WordNets for several European languages including Dutch, Italian, Spanish, German, French, Czech and Estonian, structured in the same way as the WordNet for English

( The finite-state automation is the mathematical device used to implement regular expressions the standard notation for characterizing text sequences. Variations of automata such as finite-state transducers, Hidden Markov Models, and n-gram grammars are important components of speech recognition and speech synthesis, spellchecking, and information extraction which are the important applications of NLP. Different applications of the Finite State methods in NLP have been discussed by Jurafsky & Martin (2000), Kornai (1999) and Roche & Shabes (1997). The work of NLP researchers has been greatly facilitated by the availability of large-scale grammar for parsing and generation. Researchers can get access to large-scale grammars and tools through several websites, for




Computational and



Phonetics project




( Another significant development in recent years is the formation of various national and international consortia and research groups that can facilitate, and help share expertise, research in NLP. LDC (Linguistic Data Consortium) ( at the University of Pennsylvania is a typical example that creates, collects and distributes speech and text databases, lexicons, and other resources for research and development among universities, companies and government research laboratories. The Parallel Grammar project is another example of international cooperation. This is a collaborative effort involving researchers from Xerox PARC in California, the University of Stuttgart and the University of Konstanz in Germany, the University of Bergen in Norway, Fuji Xerox in Japan. The aim of this project is to produce wide coverage grammars for English, French, German, Norwegian, Japanese, and Urdu which are written collaboratively with a commonly-agreed-upon set of grammatical features ( The recently formed Global WordNet Association is yet another example of cooperation. It is a noncommercial organization that provides a platform for discussing, sharing and connecting WordNets for all languages in the world. The first international WordNet conference to be held in India in early 2002 is expected to address various problems of NLP by researchers from different parts of the world. 2. SYSTEM ANALYSIS 2.1 Existing System At the core of any NLP task there is the important issue of natural language understanding. The process of building computer programs that understand natural language involves three major problems: the first one relates to the thought process, the second one to the representation and meaning of the linguistic input, and the third one to the world knowledge. Thus, an NLP system may begin at the word level to determine the morphological structure, nature (such as part-of-speech, meaning) etc. of the word and then may move on to the sentence level to determine the word order, grammar, meaning of the entire sentence, etc. and then to the context and the overall environment or domain. A given word or a sentence may have a specific meaning or connotation in a given context or domain, and may be related to many other words and/or sentences in the given context.

Liddy (2011) and Feldman (2012) implemented that in order to understand natural languages, it is important to be able to distinguish among the following seven interdependent levels, that people use to extract meaning from text or spoken languages: phonetic or phonological level that deals with pronunciation morphological level that deals with the smallest parts of words, that carry a meaning, and suffixes and prefixes lexical level that deals with lexical meaning of words and parts of speech analyses syntactic level that deals with grammar and structure of sentences semantic level that deals with the meaning of words and sentences discourse level that deals with the structure of different kinds of text using document structures and Pragmatic level that deals with the knowledge that comes from the outside world, i.e., from outside the contents of the document. A natural language processing system may involve all or some of these levels of analysis.

2.1.1 Disadvantages When translation is required from one language to another, for example from French to English, there are three basic methods that can be employed: the translation of each phrase on a word for word basis, hiring someone who speaks both languages, or using translation software.

Using simple dictionaries for a word by word translation is very time consuming and can often result in errors. Many words have different meanings in various contexts. And if the reader of the translated material finds the wording funny, that can be a bad reflection on your business. Allowing the gist of your material to be lost in translation can therefore mean the loss of clients. Hiring someone who speaks a couple of languages generally leads to much better results. Therefore this option can be fine for small projects with an occasional need for translations. When you need your information translated to several different languages

however, things get more complicated. In that situation you will probably need to find more than one translator. Moreover; the translated sentences have: No meaning full Sentences Uses garbage collected variables for special characters Not much reliable in Document Conversions Much time to load in lower bandwidth Minimal support in oldest web browsers Required third-party scripting language to process 2.2 Proposed System An apparatus for translating a series of source words in a first language to a series of target words in a second language. For an input series of source words, at least two target hypotheses, each including a series of target words, are generated. Each target word has a context comprising at least one other word in the target hypothesis. For each target hypothesis, a language model match score including an estimate of the probability of occurrence of the series of words in the target hypothesis. At least one alignment connecting each source word with at least one target word in the target hypothesis is identified. For each source word and each target hypothesis, a word match score including an estimate of the conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context in the target hypothesis of the target word which is connected to the source word. For each target hypothesis, a translation match score including a combination of the word match scores for the target hypothesis and the source words in the input series of source words. A target hypothesis match score including a combination of the language model match score for the target hypothesis and the translation match score for the target hypothesis. The target hypothesis having the best target hypothesis match score is output. The technique of creating interpolated language models for different contexts has been used with success in a number of conversational interfaces [1, 2, 3] In this case, the pertinent context is the systems dialogue state, and its typical to group transcribed utterances by dialogue state and build one language model per state. Typically, states with little data are merged, and the state-specific language models are interpolated, or otherwise merged. Language models corresponding to multiple states may also be interpolated, to share information across similar states. The technique we develop here differs in two key respects.

First, we derive interpolation weights for thousands of recognition contexts, rather than a handful of dialogue states. This makes it impractical to create each interpolated language model offline and swap in the desired one at runtime. Our language models are large, and we only learn the recognition context for a particular utterance when the audio starts to arrive. Second, rather than relying on transcribed utterances from each recognition context to train state-specific language modes, we instead interpolate a small number of language models trained from large corpora. 2.2.1 Advantage 1. Understandable. For instance, if we translate an English text to our mother language which is Malay, it is much more understandable by us. 2. Gain knowledge. Some say, no pain no gain. So, translating a literary text is no pain. But just to say that we've to put effort on doing that. Literature is one of the branches in learning a language. Therefore, we can know more or less on the literary texts like Shakespeare's poems and such. 3. Widen vocabulary. Yada yada, we know that literary texts use all the Shakespeare's bombastic classic English words, and by translating it into other languages might also use those super-bombastic words, hence increasing our vocabulary indirectly. 4. Discipline your mind. As for those who are in a literature field, they can discipline their minds by studying, researching and discovering new words and even cultures that are in the texts that they translate. As a result, we will have our own experts on translating literary texts that we do not have to import them. 5. Knowing history. We can learn and know the history in the literary text itself. For example, the foreigners can know more on history of Malaysia by reading the Hikayat Tun Sri Lanang and so forth and vice versa, we can know the other countries' cultures by learning their literary texts. This will also lead to the knowledge of cultures, politics and customs. 6. An efficient packet classification algorithm was introduced by hashing flow IDs held in digest caches for reduced memory requirements at the expense of a small amount of packet misclassification. 2.3 Feature Work Register or context of situation is set of vocabularies and their meanings, configuration of semantic pattern, along with words and structures such as double negative

(among black American)

used in realization of these meanings.It relates variation of

language use to variations of social context. Every context has its distinctive vocabularies. You can see a great difference in vocabularies used by mechanics in a garage and that of doctors. Selection of meanings constitute variety to which a text belongs. Halliday discusses the term Register in detail. This term refers to the relationship between language (and other semiotic forms) and the features of the context. All along, we have been characterizing this relationship (which we can now call register) by using the descriptive categories of Field, Tenor, and Mode. Registers vary. There are clues or indices in the language that help us predict what the register of a given text (spoken or written) is. Halliday uses the example of the phrase "Once upon a time" as an indexical feature that signals to us that we're about to read or hear a folk tale. Halliday also distinguishes between register and another kind of language variety, dialect. For Halliday, dialect variety is a variety according to the user. Dialects can be regional or social. Register is a variety according to use, or the social activity in which you are engaged. Halliday says, "dialects are saying the same thing in different ways, whereas registers are saying different things."

Register Variables delineate relationship between language function and language form. To have a clear understanding of language form and function, we have an example here. Consider words cats and dogs. Final s in both has the same written form. In cats it is pronounced /s/, but in dogs it is pronounced /z/, so they have different spoken form. It functions the same in both because it turns them into plural form. Language functions are also of great importance.Some of language functions are vocative, aesthetic, phatic, metalingual, informative, descriptive, expressive and social. Among them the last four ones are more important here, so let's take a brief look at them. Descriptive function gives actual information. You can test this information, then accept or reject it.(It's 10 outside. If it's winter it can be accepted. But in summer it will be rejected in normal situation.). Expressive function supplies information about speaker and his/her fleeing.(I don't invite her again. It is implied that the speaker didn't like her in the first meeting.). Newmark believes the core of expressive function is the mind of speaker, the writer or the originator of the utterance. He uses the utterance to express his feelings irrespective of any response.

Social function shows particular relationship between speaker and listener.(Will that be all sir? The sentence implies the context of a restaurant.). Informative function Newmark believes the core of informative function of language is external situation, the facts of a topic, reality outside language, including reported ideas or theories. The format of an informative text is often standard: a textbook, a technical report, an article in newspaper or a periodical, a scientific paper, a thesis, minutes or agenda of a meeting 2.4 Feasibility Study A feasibility study, also known as feasibility analysis, is an analysis of the viability of an idea. It describes a preliminary study undertaken to determine and document a projects viability. The results of this analysis are used in making the decision whether to proceed with the project or not. This analytical tool used during the project planning phrase shows how a business would operate under a set of assumption, such as the technology used, the facilities and equipment, the capital needs, and other financial aspects. The study is the first time in a project development process that show whether the project create a technical and economically feasible concept. As the study requires a strong financial and technical background, outside consultants conduct most studies.

A feasible project is one where the project could generate adequate amount of cash flow and profits, withstand the risks it will encounter, remain viable in the long-term and meet the goals of the business. The venture can be a start-up of the new business, a purchase of the existing business, and expansion of the current business. Consequently, costs and benefits are estimated with greater accuracy at this stage. Feasibility Considerations: Three key considerations are involved in the feasibility study. 1. Economic feasibility 2. Technical feasibility 3. Operational feasibility 2.4.1 Economic Feasibility

Economic analysis could also be referred to as cost/benefit analysis. It is the most frequently used method for evaluating the effectiveness of a new system. In economic analysis the procedure is to determine the benefits and savings that are expected from a candidate system and compare them with costs. If benefits outweigh costs, then the decision is made to design and implement the system. An entrepreneur must accurately weigh the cost versus benefits before taking an action. Possible questions raised in economic analysis are: Is the system cost effective? Do benefits outweigh costs? The cost of doing full system study The cost of business employee time Estimated cost of hardware Estimated cost of software/software development Is the project possible, given the resource constraints? What are the savings that will result from the system? Cost of employees' time for study Cost of packaged software/software development Selection among alternative financing arrangements (rent/lease/purchase)

The concerned business must be able to see the value of the investment it is pondering before committing to an entire system study. If short-term costs are not overshadowed by long-term gains or produce no immediate reduction in operating costs, then the system is not economically feasible, and the project should not proceed any further. If the expected benefits equal or exceed costs, the system can be judged to be economically feasible. Economic analysis is used for evaluating the effectiveness of the proposed system. The economic feasibility will review the expected costs to see if they are in-line with the projected budget or if the project has an acceptable return on investment. At this point, the projected costs will only be a rough estimate. The exact costs are not required to determine economic feasibility. It is only required to determine if it is feasible that the project costs will fall within the target budget or return on investment. A rough estimate of the project schedule is required to determine if it would be feasible to complete the systems project within a required timeframe. The required timeframe would need to be set by the organization.

2.4.2 Technical Feasibility A large part of determining resources has to do with assessing technical feasibility. It considers the technical requirements of the proposed project. The technical requirements are then compared to the technical capability of the organization. The systems project is considered technically feasible if the internal technical capability is sufficient to support the project requirements. The analyst must find out whether current technical resources can be upgraded or added to in a manner that fulfils the request under consideration. This is where the expertise of system analysts is beneficial, since using their own experience and their contact with vendors they will be able to answer the question of technical feasibility. The essential questions that help in testing the operational feasibility of a system include the following: Is the project feasible within the limits of current technology? Does the technology exist at all? Is it available within given resource constraints? Is it a practical proposition? Manpower- programmers, testers & debuggers Software and hardware Are the current technical resources sufficient for the new system? Can they be upgraded to provide to provide the level of technology necessary for the new system? Do we possess the necessary technical expertise, and is the schedule reasonable? Can the technology be easily applied to current problems? Does the technology have the capacity to handle the solution? Do we currently possess the necessary technology?

2.4.3 Operational Feasibility Operational feasibility is dependent on human resources available for the project and involves projecting whether the system will be used if it is developed and implemented. Operational feasibility is a measure of how well a proposed system solves the problems, and takes advantage of the opportunities identified during scope definition and how it satisfies the requirements identified in the requirements analysis phase of system development.

Operational feasibility reviews the willingness of the organization to support the proposed system. This is probably the most difficult of the feasibilities to gauge. In order to determine this feasibility, it is important to understand the management commitment to the proposed project. If the request was initiated by management, it is likely that there is management support and the system will be accepted and used. However, it is also important that the employee base will be accepting of the change. The essential questions that help in testing the operational feasibility of a system include the following: Does current mode of operation provide adequate throughput and response time? Does current mode provide end users and managers with timely, pertinent, accurate and useful formatted information? Does current mode of operation provide cost-effective information services to the business? Could there be a reduction in cost and or an increase in benefits? Does current mode of operation offer effective controls to protect against fraud and to guarantee accuracy and security of data and information? Does current mode of operation make maximum use of available resources, including people, time, and flow of forms? Does current mode of operation provide reliable services? Are the services flexible and expandable? Are the current work practices and procedures adequate to support the new system? If the system is developed, will it be used? Manpower problems; Labour objections; Manager resistance Organizational conflicts and policies Social acceptability; Government regulations Does management support the project? Are the users not happy with current business practices? Will it reduce the time (operation) considerably? Have the users been involved in the planning and development of the project? Will the proposed system really benefit the organization? Does the overall response increase? Will accessibility of information be lost?

Will the system affect the customers in considerable way? How do the end-users feel about their role in the new system? What end-users or managers may resist or not use the system? How will the working environment of the end-user change? Can or will end-users and management adapt to the change?

3 INTRODUCTION 3.1 Hardware Requirements Processor - Intel Pentium dual core Ram - 1GB Hard disk - 80GB Monitor - 17inchs Keyboard - Logitech Mouse - optical mouse (Logitech)

3.2 Software Requirements Front end - Java Back End - MS SQL SERVER Operating System - Windows-7 Tools Used - Net Beans

4 SOFTWARE DESCRIPTION 4.1 FRONT END 4.1.1 Java introduction Java is an object-oriented programming language developed by Sun Microsystems and it is also a powerful internet programming language. Java is a high-level programming language which has the following features: 1. Object oriented 2. Portable 3. Architecture-neutral 4. High-performance 5. Multithreaded 6. Robust 7. Secure

Java is an efficient application programming language. It has APIs to support the GUI based application development. The following features of java, makes it more suitable for implementing this project. Initially the languages were called as OAK but it was renamed as Java in 1995. The primary motivation of this language was the need for a platform independent language that could be used to create software to be embedded in various consumer electronic devices. Java is programmers language. Java is cohesive and consistent. Except for those constraints imposed by the internet environment, Java gives the programmer, full control. The excitement of the Internet attracted software vendors such that Java development tools from many vendors quickly became available. That same excitement has provided the impetus for a multitude of software developers to discover Java and its many wonderful features. With most programming languages, you either compile or interpret a program so that you can run it on your computer. The Java programming language is unusual in that a program is both compiled and interpreted. With the compiler, first you translate a program into an intermediate language called Java byte codes the platform-independent codes interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java byte code instruction on the computer. Compilation happens just once; interpretation occurs each time the program is executed.

You can think of Java byte codes as the machine code instructions for the Java Virtual Machine (Java VM). Every Java interpreter, whether its a development tool or a Web browser that can run applets, is an implementation of the Java VM. Java byte codes help make write once, run anywhere possible. You can compile your program into byte codes on any platform that has a Java compiler. The byte codes can then be run on any implementation of the Java VM. That means that as long as a computer has a Java VM, the same program written in the Java programming language can run on Windows 2000, a Solaris workstation, or on an iMac.

The Java Platform A platform is the hardware or software environment in which a program runs. Weve already mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and MacOS. Most platforms can be described as a combination of the operating system and hardware. The Java platform differs from most other platforms in that its a software-only platform that runs on top of other hardware-based platforms. The Java platform has two components: The Java Virtual Machine (Java VM) The Java Application Programming Interface (Java API)

Youve already been introduced to the Java VM. Its the base for the Java platform and is ported onto various hardware-based platforms. The Java API is a large collection of readymade software components that provide many useful capabilities, such as graphical user interface (GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these libraries are known as packages. The next section, What Can Java Technology Do? Highlights what functionality some of the packages in the Java API provide. The following figure depicts a program thats running on the Java platform. As the figure shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware platform. As a platform-independent environment, the Java platform can be a bit slower than

native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code compilers can bring performance close to that of native code without threatening portability. What Can Java Technology Do? The most common types of programs written in the Java programming language are applets and applications. If youve surfed the Web, youre probably already familiar with applets. An applet is a program that adheres to certain conventions that allow it to run within a Java-enabled browser. However, the Java programming language is not just for writing cute, entertaining applets for the Web. The general-purpose, high-level Java programming language is also a powerful software platform. Using the generous API, you can write many types of programs. An application is a standalone program that runs directly on the Java platform. A special kind of application known as a server serves and supports clients on a network. Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another specialized program is a servlet. A servlet can almost be thought of as an applet that runs on the server side. Java Servlets are a popular choice for building interactive web applications, replacing the use of CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications. Instead of working in browsers, though, servlets run within Java Web servers, configuring or tailoring the server. How does the API support all these kinds of programs? It does so with packages of software components that provides a wide range of functionality. Every full implementation of the Java platform gives you the following features: The essentials: Objects, strings, threads, numbers, input and output, data structures, system properties, date and time, and so on. Applets: The set of conventions used by applets. Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol) sockets, and IP (Internet Protocol) addresses. Internationalization: Help for writing programs that can be localized for users worldwide. Programs can automatically adapt to specific locales and be displayed in the appropriate language. Security: Both low level and high level, including electronic signatures, public and private key management, access control, and certificates.

Software components: Known as JavaBeans architectures.


, can plug into existing component

Object serialization: Allows lightweight persistence and communication via Remote Method Invocation (RMI). Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration, telephony, speech, animation, and more. The following figure depicts what is included in the Java 2 SDK.

How Will Java Technology Change My Life? We cant promise you fame, fortune, or even a job if you learn the Java programming language. Still, it is likely to make your programs better and requires less effort than other languages. We believe that Java technology will help you do the following:

Get started quickly: Although the Java programming language is a powerful objectoriented language, its easy to learn, especially for programmers already familiar with C or C++.

Write less code: Comparisons of program metrics (class counts, method counts, and so on) suggest that a program written in the Java programming language can be four times smaller than the same program in C++.

Write better code: The Java programming language encourages good coding practices, and its garbage collection helps you avoid memory leaks. Its object orientation, its JavaBeans component architecture, and its wide-ranging, easily extendible API let you reuse other peoples tested code and introduce fewer bugs.

Develop programs more quickly: Your development time may be as much as twice as fast versus writing the same program in C++. Why? You write fewer lines of code and it is a simpler programming language than C++.

Avoid platform dependencies with 100% Pure Java: You can keep your program portable by avoiding the use of libraries written in other languages. The 100% Pure JavaTM Product Certification Program has a repository of historical process manuals, white papers, brochures, and similar materials online.

Write once, run anywhere: Because 100% Pure Java programs are compiled into machine-independent byte codes, they run consistently on any Java platform.

Distribute software more easily: You can upgrade applets easily from a central server. Applets take advantage of the feature of allowing new classes to be loaded on the fly, without recompiling the entire program.

ODBC Microsoft Open Database Connectivity (ODBC) is a standard programming interface for application developers and database systems providers. Before ODBC became a defacto standard for Windows programs to interface with database systems, programmers had to use proprietary languages for each database they wanted to connect to. Now, ODBC has made the choice of the database system almost irrelevant from a coding perspective, which is as it should be. Application developers have much more important things to worry about than the syntax that is needed to port their program from one database to another when business needs suddenly change. Through the ODBC Administrator in Control Panel, you can specify the particular database that is associated with a data source that an ODBC application program is written to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to a particular database. For example, the data source named Sales Figures might be a SQL Server database, whereas the Accounts Payable data source could refer to an Access database. The physical database referred to by a data source can reside anywhere on the LAN. The ODBC system files are not installed on your system by Windows 95. Rather, they are installed when you setup a separate database application, such as SQL Server Client or Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-

alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program and each maintains a separate list of ODBC data sources. From a programming perspective, the beauty of ODBC is that the application can be written to use the same set of function calls to interface with any data source, regardless of the database vendor. The source code of the application doesnt change whether it talks to Oracle or SQL Server. We o nly mention these two as an example. There are ODBC drivers available for several dozen popular database systems. Even Excel spreadsheets and plain text files can be turned into data sources. The operating system uses the Registry information written by ODBC Administrator to determine which low-level ODBC drivers are needed to talk to the data source (such as the interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the ODBC application program. In a client/server environment, the ODBC API even handles many of the network issues for the application programmer. The advantages of this scheme are so numerous that you are probably thinking there must be some catch. The only disadvantage of ODBC is that it isnt as efficient as talking directly to the native database interface. ODBC has had many detractors make the charge that it is too slow. Microsoft has always claimed that the critical factor in performance is the quality of the driver software that is used. In our humble opinion, this is true. The availability of good ODBC drivers has improved a great deal recently. And anyway, the criticism about performance is somewhat analogous to those who said that compilers would never match the speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers get faster every year. JDBC In an effort to set an independent database standard API for Java; Sun Microsystems developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access mechanism that provides a consistent interface to a variety of RDBMSs. This consistent interface is achieved through the use of plug-in database connectivity modules, or drivers. If a database vendor wishes to have JDBC support, he or she must provide the driver for each platform that the database and Java run on. To gain a wider acceptance of JDBC, Sun based JDBCs framework on ODBC. As you discovered earlier in this chapter, ODBC has widespread support on a variety of

platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much faster than developing a completely new connectivity solution. JDBC was announced in March of 1996. It was released for a 90 day public review that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was released soon after. The remainder of this section will cover enough information about JDBC for you to know what it is about and how to use it effectively. This is by no means a complete overview of JDBC. That would fill an entire book. JDBC Goals Few software packages are designed without goals in mind. JDBC is one that, because of its many goals, drove the development of the API. These goals, in conjunction with early reviewer feedback, have finalized the JDBC class library into a solid framework for building database applications in Java. The goals that were set for JDBC are important. They will give you some insight as to why certain classes and functionalities behave the way they do. The eight design goals for JDBC are as follows: 1. SQL Level API The designers felt that their main goal was to define a SQL interface for Java. Although not the lowest database interface level possible, it is at a low enough level for higher-level tools and APIs to be created. Conversely, it is at a high enough level for application programmers to use it confidently. Attaining this goal allows for future tool vendors to generate JDBC code and to hide many of JDBCs complexities from the end user. 2. SQL Conformance SQL syntax varies as you move from database vendor to database vendor. In an effort to support a wide variety of vendors, JDBC will allow any query statement to be passed through it to the underlying database driver. This allows the connectivity module to handle nonstandard functionality in a manner that is suitable for its users. 3. JDBC must be implemental on top of common database interfaces

The JDBC SQL API must sit on top of other common SQL level APIs. This goal allows JDBC to use existing ODBC level drivers by the use of a software interface. This interface would translate JDBC calls to ODBC and vice versa. 4. Provide a Java interface that is consistent with the rest of the Java system Because of Javas acceptance in the user community thus far, the designers feel that they should not stray from the current design of the core Java system. 5. Keep it simple This goal probably appears in all software design goal listings. JDBC is no exception. Sun felt that the design of JDBC should be very simple, allowing for only one method of completing a task per mechanism. Allowing duplicate functionality only serves to confuse the users of the API. 6. Use strong, static typing wherever possible Strong typing allows for more error checking to be done at compile time; also, less error appear at runtime. 7. Keep the common cases simple Because more often than not, the usual SQL calls used by the programmer are simple SELECTs, INSERTs, DELETEs and UPDATEs, these queries should be simple to perform with JDBC. However, more complex SQL statements should also be possible. Finally we decided to proceed the implementation using Java Networking. And for dynamically updating the cache table we go for MS Access database. Java has two things: a programming language and a platform. You can think of Java byte codes as the machine code instructions for the, Java Virtual Machine (Java VM). Every Java interpreter, whether its a Java development tool or a Web browser that can run Java applets, is an implementation of the Java VM. The Java VM can also be implemented in hardware. Java byte codes help make write once, run anywhere possible. You can compile your Java program into byte codes on my platform that has a Java compiler. The byte codes can then be

run any implementation of the Java VM. For example, the same Java program can run Windows NT, Solaris, and Macintosh. Networking TCP/IP stack The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless protocol. IP datagrams The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagram must be supplied by the higher layers. The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses. The IP layer handles routing through an Internet. It is also responsible for breaking up large datagram into smaller ones for transmission and reassembling them at the other end. UDP UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents of the datagram and port numbers. These are used to give a client/server model - see later.

TCP TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a virtual circuit that two processes can use to communicate. Internet addresses In order to use a service, you must be able to find it. The Internet uses an address scheme for machines so that they can be located. The address is a 32 bit integer which gives the IP address. This encodes a network ID and more addressing. The network ID falls into various classes according to the size of the network address. Network address Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all 32. Subnet address Internally, the UNIX network is divided into sub networks. Building 11 is currently on one sub network and uses 10-bit addressing, allowing 1024 different hosts. Host address 8 bits are finally used for host addresses within our subnet. This places a limit of 256 machines that can be on the subnet. Total address

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses A service exists on a host, and is identified by its port. This is a 16 bit number. To send a message to a server, you send it to the port for that service of the host that it is running on. This is not location transparency! Certain of these ports are "well known". Sockets A socket is a data structure maintained by the system to handle network connections. A socket is created using the call socket. It returns an integer that is like a file descriptor. In fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h> #include <sys/socket.h> int socket(int family, int type, int protocol); Here "family" will be AF_INET for IP communications, protocol will be zero, and type will depend on whether TCP or UDP is used. Two processes wishing to communicate over a network create a socket each. These are similar to two ends of a pipe - but the actual pipe does not yet exist. JFree Chart JFreeChart is a free 100% Java chart library that makes it easy for developers to display professional quality charts in their applications. JFreeChart's extensive feature set includes: A consistent and well-documented API, supporting a wide range of chart types. A flexible design that is easy to extend, and targets both server-side and client-side applications; Support for many output types, including Swing components, image files (including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG); JFreeChart is "open source" or, more specifically, free software. It is distributed under the terms of the GNU Lesser General Public License (LGPL), which permits use in proprietary applications.

1. Map Visualizations Charts showing values that relate to geographical areas. Some examples include: population density in each state of the United States, (b) income per capita for country in Europe, (c) life expectancy in each country of the (a) each world.

The tasks in this project include: Sourcing freely redistributable vector outlines for the countries of the world, states/provinces in particular countries (USA in particular, but also other areas); Creating an appropriate dataset interface (plus default implementation), a rendered, and integrating this with the existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting some more. 2. Time Series Chart Interactivity Implement a new (to JFreeChart) feature for interactive time series charts --- to display a separate control that shows a small version of ALL the time series data, with a sliding "view" rectangle that allows you to select the subset of the time series data to display in the main chart. 3. Dashboards There is currently a lot of interest in dashboard displays. Create a flexible dashboard mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars, and lines/time series) that can be delivered easily via both Java Web Start and an applet. 4. Property Editors The property editor mechanism in JFreeChart only handles a small subset of the properties that can be set for charts. Extend (or re-implement) this mechanism to provide greater enduser control over the appearance of the charts. J2ME (Java 2 Micro edition):Sun Microsystems defines J2ME as "a highly optimized Java run-time environment targeting a wide range of consumer products, including pagers, cellular phones, screen-phones, digital set-top boxes and car navigation systems." Announced in June 1999 at the Java One Developer Conference, J2ME brings the cross-platform functionality of the Java language to

smaller devices, allowing mobile wireless devices to share applications. With J2ME, Sun has adapted the Java platform for consumer products that incorporate or are based on small computing devices. 1. General J2ME architecture J2ME uses configurations and profiles to customize the Java Runtime Environment (JRE). As a complete JRE, J2ME is comprised of a configuration, which determines the JVM used, and a profile, which defines the application by adding domain-specific classes. The configuration defines the basic run-time environment as a set of core classes and a specific JVM that run on specific types of devices. We'll discuss configurations in detail in the The profile defines the application; specifically, it adds domain-specific classes to the J2ME configuration to define certain uses for devices. The following graphic depicts the relationship between the different virtual machines, configurations, and profiles

It also draws a parallel with the J2SE API and its Java virtual machine. While the J2SE virtual machine is generally referred to as a JVM, the J2ME virtual machines, KVM and CVM, are subsets of JVM. Both KVM and CVM can be thought of as a kind of Java virtual machine -- it's just that they are shrunken versions of the J2SE JVM and are specific to J2ME. 2. Developing J2ME applications Introduction In this section, we will go over some considerations you need to keep in mind when developing applications for smaller devices. We'll take a look at the way the compiler

is invoked when using J2SE to compile J2ME applications. Finally, we'll explore packaging and deployment and the role pre-verification plays in this process. 3. Design considerations for small devices Developing applications for small devices requires you to keep certain strategies in mind during the design phase. It is best to strategically design an application for a small device before you begin coding. Correcting the code because you failed to consider all of the "gotchas" before developing the application can be a painful process. Here are some design strategies to consider: * Keep it simple. Remove unnecessary features, possibly making those features a separate, secondary application. * Smaller is better. This consideration should be a "no brainer" for all developers. Smaller applications use less memory on the device and require shorter installation times. Consider packaging your Java applications as compressed Java Archive (jar) files. * Minimize run-time memory use. To minimize the amount of memory used at run time, use scalar types in place of object types. Also, do not depend on the garbage collector. You should manage the memory efficiently yourself by setting object references to null when you are finished with them. Another way to reduce run-time memory is to use lazy instantiation, only allocating objects on an as-needed basis. Other ways of reducing overall and peak memory use on small devices are to release resources quickly, reuse objects, and avoid exceptions. 4. Configurations overview The configuration defines the basic run-time environment as a set of core classes and a specific JVM that run on specific types of devices. Currently, two configurations exist for J2ME, though others may be defined in the future: * Connected Limited Device Configuration (CLDC) is used specifically with the KVM for 16-bit or 32-bit devices with limited amounts of memory. This is the configuration (and the virtual machine) used for developing small J2ME applications. Its size limitations make CLDC more interesting and challenging (from a development point of view) than CDC.

CLDC is also the configuration that we will use for developing our drawing tool application. An example of a small wireless device running small applications is a Palm hand-held computer. * Connected Device Configuration (CDC) is used with the C virtual machine (CVM) and is used for 32-bit architectures requiring more than 2 MB of memory. An example of such a device is a Net TV box. J2ME profiles What is a J2ME profile? As we mentioned earlier in this tutorial, a profile defines the type of device supported. The Mobile Information Device Profile (MIDP), for example, defines classes for cellular phones. It adds domain-specific classes to the J2ME configuration to define uses for similar devices. Two profiles have been defined for J2ME and are built upon CLDC: KJava and MIDP. Both KJava and MIDP are associated with CLDC and smaller devices. Profiles are built on top of configurations. Because profiles are specific to the size of the device (amount of memory) on which an application runs, certain profiles are associated with certain configurations. A skeleton profile upon which you can create your own profile, the Foundation Profile, is available for CDC. Profile 1: KJava KJava is Sun's proprietary profile and contains the KJava API. The KJava profile is built on top of the CLDC configuration. The KJava virtual machine, KVM, accepts the same byte codes and class file format as the classic J2SE virtual machine. KJava contains a Sun-specific API that runs on the Palm OS. The KJava API has a great deal in common with the J2SE Abstract Windowing Toolkit (AWT). However, because it is not a standard J2ME package, its main package is com.sun.kjava. We'll learn more about the KJava API later in this tutorial when we develop some sample applications. Profile 2: MIDP MIDP is geared toward mobile devices such as cellular phones and pagers. The MIDP, like KJava, is built upon CLDC and provides a standard run-time environment that allows new

applications and services to be deployed dynamically on end user devices. MIDP is a common, industry-standard profile for mobile devices that is not dependent on a specific vendor. It is a complete and supported foundation for mobile application development. MIDP contains the following packages, the first three of which are core CLDC packages, plus three MIDP-specific packages. * java.lang * * java.util * * javax.microedition.lcdui * javax.microedition.midlet * javax.microedition.rms

5 PROJECT DESCRIPTION 5.1 System Architecture Information retrieval has been a major area of application of NLP, and consequently a number of research projects, dealing with the various applications on NLP in IR, have taken place throughout the world resulting in a large volume of publications. Lewis and Sparck Jones (2013) comment that the generic challenge for NLP in the field of IR is whether the necessary NLP of texts and queries is doable, and the specific challenges are whether nonstatistical and statistical data can be combined and whether data about individual documents and whole files can be combined. They further comment that there are major challenges in making the NLP technology operate effectively and efficiently and also in conducting appropriate evaluation tests to assess whether and how far the approach works in an environment of interactive searching of large text files. Feldman (2013) suggests that in order to achieve success in IR, NLP techniques should be applied in conjunction with other technologies, such as visualization, intelligent agents and speech recognition.

Arguing that syntactic phrases are more meaningful than statistically obtained word pairs, and thus are more powerful for discriminating among documents, Narita and Ogawa (2012) use a shallow syntactic processing instead of statistical processing to automatically identify candidate phrasal terms from query texts. Comparing the performance of Boolean and natural language searches, Paris and Tibbo (2013) found that in their experiment, Boolean searches had better results than freestyle (natural language) searches. However, they concluded that neither could be considered as the best for every query. In other words, their conclusion was that different queries demand different techniques.

Variations in presenting subject matter greatly affect IR and hence linguistic variation of document texts is one of the greatest challenges to IR. In order to investigate how consistently newspapers choose words and concepts to describe an event, Lehtokangas & Jarvelin (2011) chose articles on the same news from three Finnish newspapers. Their experiment revealed that for short newswire the consistency was 83% and for long articles 47%. It was also revealed that the newspapers were very consistent in using concepts to represent events, with a level of consistency varying between 92-97%. Natural Language Interfaces A natural language interface is one that accepts query statements or commands in natural language and sends data to some system, typically a retrieval system, which then results in appropriate responses to the commands or query statements. A natural language interface should be able to translate the natural language statements into appropriate actions for the system. A large number of natural language interfaces that work reasonably well in narrow domains have been reported in the literature. Much of the efforts in natural language interface design to date have focused on handling rather simple natural language queries. A number of question answering systems are now being developed that aim to provide answers to natural language questions, as opposed to documents containing information related to the question. Such systems often use a variety of IE and IR operations using NLP tools and

techniques to get the correct answer from the source texts. Breck et al. (2012) report a question answering system that uses techniques from knowledge representation, information retrieval, and NLP. The authors claim that this combination enables domain independence and robustness in the face of text variability, both in the question and in the raw text documents used as knowledge sources. Research reported in the Question Answering (QA) track of TREC (Text Retrieval Conferences) show some interesting results. The basic technology used by the participants in the QA track included several steps. First, cue words/phrase like who (as in who is the prime minister of Japan), when (as in When did the Jurassic period end) were identified to guess what was needed; and then a small portion of the document collection was retrieved using standard text retrieval technology. This was followed by a shallow parsing of the returned documents for identifying the entities required for an answer. If no appropriate answer type was found then best matching passage was retrieved. This approach works well as long as the query types recognized by the system have broad coverage, and the system can classify questions reasonably accurately. In TREC-8, the first QA track of TREC, the most accurate QA systems could answer more than 2/3 of the questions correctly. In the second QA track (TREC-9), the best performing QA system, the Falcon system from Southern Methodist University, was able to answer 65% of the questions (Voorhees, 2000). These results are quite impressive in a domain-independent question answering environment. However, the questions were still simple in the first two QA tracks. In the future more complex questions requiring answers to be obtained from more than one documents will be handled by QA track researchers. The Natural Language Processing Laboratory, Centre for Intelligent Information Retrieval at the University of Massachusetts, distributes source codes and executables to support IE system development efforts at other sites. Each module is designed to be used in a domain-specific and task-specific customizable IE system. Available software includes (Natural Language , n.d.)

MARMOT Text Bracketing Module, a text file translator which segments arbitrary text blocks into sentences, applies low-level specialists such as date recognizers, associates words with part-of-speech tags, and brackets the text into annotated noun phrases, prepositional phrases, and verb phrases.

BADGER Extraction Module that analyses bracketed text and produces case frame instantiations according to application-specific domain guidelines.

CRYSTAL Dictionary Induction Module, which learns text extraction rules, suitable for use by BADGER, from annotated training texts.

ID3-S Inductive Learning Module, a variant on ID3 which induces decision trees on the basis of training examples.


Section Splitter Section Filter Text Tokenizer Part-of-Speech (POS) Tagger

Noun Phrase Finder UMLS Concept Finder Negation Finder Regular Expression-based Concept Finder Sentence Splitter N-Gram Tool Classifier (e.g. Smoking Status Classifier)

5.2.1 Section Splitter For this project, you will write the lexical analysis phase (i.e., the "scanner") of a simple compiler for a subset of the language "Tubular". We will start with only one type of variable ("int"), basic math, and the print command to output results; basically it will be little more than a standard calculator. Over the next two projects we will turn this into a working compiler, and in the four projects following that, we will expand the functionality and efficiency of the language. The program you turn in must load in the source file (as a command-line argument) and process it line-by-line, removing whitespace and comments and categorizing each word or symbol as a token. A token is a type of unit that appears in a source file. You will then output (to standard out) a tokenized version of the file, as described in more detail below. A pattern is a rule by which a token is identified. It will be up to you as part of this project to identify the patterns associated with each token. A lexeme is an instance of a pattern from a source file. For example, "42" is a lexeme that would be categorized as the token STATIC_INT. On the other hand "my_var" is a lexeme that would get identified as a token of the type ID. When multiple patterns match the text being processed, choose the one that produces the longest lexeme that starts at the current position. If two different patterns produce lexemes of the same length, choose the one that comes first in the list above. For example, the string "print" might be incorrectly read as the ID "pr" followed by the TYPE "int", but "print" is longer than "pr", so it should be chosen. Likewise, the lexeme "print" could match either the pattern or COMMAND or the pattern for ID, but COMMAND should be chosen since it comes first in the list in the table. Each student must write this project on their own, with no help from other students or any other individuals, though you may use whatever pre-existing web resources you like. I will use your score on this project and your programing style as factors in assembling groups

for future projects. As such, it's well worth putting extra effort into this project. Your lexer should be able to identify each of the following tokens that it encounters: Token TYPE Description Data types: currently just "int", but more types will be introduced in future projects. COMMAND Any built-in commands: currently just "print". ID A sequence beginning with a letter or underscore ('_'), followed by a sequence of zero or more characters that may contain letters, numbers and underscores. Currently just variable names. STATIC_INT Any static interger. We will implement static floating point numbers in a future project, as well as other static types. OPERATOR Math operators: + - * / % ( ) = += -= *= /= %= SEPARATOR List separation: , ENDLINE Signifies the end of a statement -- semicolon: ; WHITESPACE Any number of consecutive spaces, tabs, or newlines. COMMENT Everything on a line following a pound-sign, '#'. UNKNOWN An unknown character or a sequence that does not match any of the tokens above. 5.2.2 Section Filter The fundamental component of a tagger is a POS (part-of-speech) tagset, or list of all the word categories that will be used in the tagging process. The tagsets used to annotate large corpora in the past have usually been fairly extensive. The pioneering Brown Corpus distinguishes 87 simple tags. Subsequent projects have tended to elaborate the Brown Corpus tagset. For instance, the Lancaster-Oslo/Bergen(LOB) Corpus uses about 135 tags, the Lancaster UCREL group about 165 tags, and the London-Lund Corpus of Spoken English 197 tags. The rationale behind developing such large, richly articulated tagsets is to approach "the ideal of providing distinct codings for all classes of words having distinct grammatical behaviour". Choosing an appropriate set of tags for an annotation system has direct influence on the accuracy and the usefulness of the tagging system. The larger the tag set, the lower the accuracy of the tagging system since the system has to be capable of making finer distinctions. Conversely, the smaller the tag set, the higher the accuracy. However, a very small tag set tends to make the tagging system less useful since it provides less information. So, there is a trade-off here. Another issue in tag-set design is the consistency of the tagging system. Words of the same meaning and same functions should be tagged with the same tags.

5.2.3 Text Tokenizer Machine Learning methods usually require supervised data to learn a concept. Labelling data is time consuming, tedious, error prone and expensive. The research community has looked at semi-supervised and unsupervised learning techniques in order to obviate the need of labelled data to a certain extent. In addition to the above mentioned problems with labelled data, all examples are not equally informative or equally easy to label. For instance, the examples similar to what the learner has already seen are not as useful as new examples. Moreover, deferent examples may require deferent amount of user's labelling report, for instance, a longer sentence is likely to have more ambiguities and hence would be harder to parse manually. Active learning is the task of reducing the amount of labelled data required to learn the target concept by querying the user for labels for the most informative examples so that the concept is learnt with fewer examples. An active learning problem setting typically consists of a small set of labelled examples and a large set of unlabelled examples. An initial classier is trained on the labelled examples and/or the unlabelled examples. From the pool of unlabelled examples, selective sampling is used to create a small subset of examples for the user to label. This iterative process of training, selective sampling and annotation is repeated until convergence. Cross-cultural communication, as in many scholarly fields, is a combination of many other fields. These fields include anthropology, cultural studies, psychology and communication. The field has also moved both toward the treatment of interethnic relations, and toward the study of communication strategies used by co-cultural populations, i.e., communication strategies used to deal with majority or mainstream populations.

Software-oriented mechanisms are less expensive and more flexible in filter lookups when compared with their hardware-centric counterparts. Such mechanisms are

abundant, commonly involving efficient algorithms for quick packet classification with an aid of caching or hashing. Their Classification speeds rely on efficiency in search over the rule set using the keys constituted by corresponding header fields. Several

representative software classification techniques are reviewed in sequence. 5.2.4 Part-of-Speech (POS) Tagger An active learning experiment is usually described by ve properties: number of bootstrap examples, batch size, supervised learner, data set and a stopping criterion. The supervised learner is trained on the bootstrap examples which are labelled by the user initially. Batch size is the number of examples that are selectively sampled from the unlabelled pool and added to the training pool in each iteration. The stopping criterion can be either a desired performance level or the number of iterations. Performance is evaluated on the test set in each iteration. Active learners are usually evaluated by plotting a learning curve of performance vs. number of labelled examples as shown in gore 1. Success of an active learner is demonstarted by showing that it achieves better performance than a traditional learner given the same number of labelled examples; i.e., for achieving the desired performance, the active learner needs fewer examples than the traditional learner.

5.2.5 Noun Phrase Finder Active learning aims at reducing the number of examples required to achieve the desired accuracy by selectively sampling the examples for user to label and train the classier with. Several deferent strategies for selective sampling have been explored in the literature. In this review, we present some of the selective sampling techniques used for active learning in NLP. Uncertainty-based sampling selects examples that the model is least certain about and

presents them to the user for correction/verification. A lot of work on active learning has used uncertainty-based sampling. In this section, we describe some of this work.

5.2.6 UMLS Concept Finder Parsers that parameterize over wider scopes are generally more accurate than edgefactored models. For graph-based non-projective parsers, wider factorizations have so far implied large increases in the computational complexity of the parsing problem. This paper introduces a crossing-sensitive generalization of a third-order factorization that trades off complexity in the model structure (i.e., scoring with features over multiple edges) with complexity in the output structure (i.e., producing crossing edges). Under this model, the optimal 1-Endpoint-Crossing tree can be found in O(n^4) time, matching the asymptotic runtime of both the third-order projective parser and the edge-factored 1-Endpoint-Crossing parser. The crossing-sensitive third-order parser is significantly more accurate than the thirdorder projective parser under many experimental settings and significantly less accurate on none. 5.2.7 Negation Finder Although many NLP systems are moving toward entity-based processing, most still identify important phrases using classical keyword-based approaches. To bridge this gap, we introduce the task of entity salience: assigning a relevance score to each entity in a document. We demonstrate how a labeled corpus for the task can be automatically generated from a corpus of documents and accompanying abstracts. We then show how a classifier with features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally, we outline initial experiments on further improving accuracy by leveraging background knowledge about the relationships between entities.

5.2.8 Regular Expression-based Concept Finder Frame semantics is a linguistic theory that has been instantiated for English in the FrameNet lexicon. We solve the problem of frame-semantic parsing using a two-stage statistical model that takes lexical targets (i.e., content words and phrases) in their sentential contexts and predicts frame-semantic structures. Given a target in context, the first stage disambiguates it to a semantic frame. This model employs latent variables and semi-supervised learning to improve frame disambiguation for targets unseen at training time. The second stage finds the target's locally expressed semantic arguments. At inference time, a fast exact dual decomposition algorithm collectively predicts all the arguments of a frame at once in order to respect declaratively stated linguistic constraints, resulting in qualitatively better structures than nave local predictors. Both components are feature-based and discriminatively trained on a small set of annotated frame-semantic parses. On the SemEval 2007 benchmark dataset, the approach, along with a heuristic identifier of frame-evoking targets, outperforms the prior state of the art by significant margins. Additionally, we present experiments on the much larger Frame Net 1.5 dataset. We have released our frame-semantic parser as open-source software. 5.2.9 Sentence Splitter Mobile is poised to become the predominant platform over which people are accessing the World Wide Web. Recent developments in speech recognition and understanding, backed by high bandwidth coverage and high quality speech signal acquisition on smartphones and tablets are presenting the users with the choice of speaking their web search queries instead of typing them. A critical component of a speech recognition system targeting web search is the language model. The chapter presents an empirical exploration of the query stream with the end goal of high quality statistical language modeling for mobile voice search. Our experiments show that after text normalization the query stream is not as ``wild'' as it seems at first sight. One can achieve out-of-vocabulary rates below 1% using a one million word vocabulary, and excellent n-gram hit ratios of 77/88% even at high orders such as n=5/4, respectively. A more careful analysis shows that a significantly larger vocabulary (approx. 10 million words) may be required to guarantee at most 1% out-ofvocabulary rate for a large percentage (95%) of users. Using large scale, distributed language models can improve performance significantly---up to 10% relative reductions in word-errorrate over conventional models used in speech recognition. We also find that the query stream

is non-stationary, which means that adding more past training data beyond a certain point provides diminishing returns, and may even degrade performance slightly. Perhaps less surprisingly, we have shown that locale matters significantly for English query data across USA, Great Britain and Australia. In an attempt to leverage the speech data in voice search logs,
we successfully build large-scale discriminative N-gram language models and derive small but significant gains in recognition performance.

5.2.10 N-Gram Tool Empty categories (EC) are articial elements in Penn Treebanks motivated by the government-binding (GB) theory to explain certain language phenomena such as pro-drop. ECs are ubiquitous in languages like Chinese, but they are tacitly ignored in most machine translation (MT) work because of their elusive nature. In this paper we present a comprehensive treatment of ECs by rst recovering them with a structured MaxEnt model with a rich set of syntactic and lexical features, and then incorporating the predicted ECs into a Chinese-to-English machine translation task through multiple approaches, including the extraction of EC-specic sparse features. We show that the recovered empty categories not only improve the word alignment quality, but also lead to signicant improvements in a large-scale state-of-the-art syntactic MT system.

5.2.11 Classifier (e.g. Smoking Status Classifier)

Many highly engineered NLP systems address the benchmark tasks using linear statistical models applied to task-specic features. In other words, the researchers themselves discover intermediate representations by engineering ad-hoc features. These features are often derived from the output of pre-existing systems, leading to complex runtime dependencies. This approach is eective because researchers leverage a large body of linguistic knowledge. On the other hand, there is a great temptation to over-engineer the system to optimize its performance on a particular benchmark at the expense of the broader NLP goals. In this contribution, we describe a unied NLP system that achieves excellent performance on multiple benchmark tasks by discovering its own internal representations. We have avoided engineering features as much as possible and we have therefore ignored a large body of linguistic knowledge. Instead we reach state-of-the-art performance levels by transferring intermediate representations discovered on massive unlabelled datasets. We call this approach almost from scratch to emphasize this reduced (but still important) reliance on a priori NLP knowledge.


Software testing is a critical element of software quality assurance and represents the ultimate review of specification, design and coding. In fact, testing is the one step in the software engineering process that could be viewed as destructive rather than constructive.

A strategy for software testing integrates software test case design methods into a well-planned series of steps that result in the successful construction of software. Testing is the set of activities that can be planned in advance and conducted systematically. The underlying motivation of program testing is to affirm software quality with methods that can economically and effectively apply to both strategic to both large and small-scale systems. 8.2. STRATEGIC APPROACH TO SOFTWARE TESTING The software engineering process can be viewed as a spiral. Initially system engineering defines the role of software and leads to software requirement analysis where the information domain, functions, behavior, performance, constraints and validation criteria for software are established. Moving inward along the spiral, we come to design and finally to coding. To develop computer software we spiral in along streamlines that decrease the level of abstraction on each turn. A strategy for software testing may also be viewed in the context of the spiral. Unit testing begins at the vertex of the spiral and concentrates on each unit of the software as implemented in source code. Testing will progress by moving outward along the spiral to integration testing, where the focus is on the design and the construction of the software architecture. Talking another turn on outward on the spiral we encounter validation testing where requirements established as part of software requirements analysis are validated against the software that has been constructed. Finally we arrive at system testing, where the software and other system elements are tested as a whole.


8.3. UNIT TESTING Unit testing focuses verification effort on the smallest unit of software design, the module. The unit testing we have is white box oriented and some modules the steps are conducted in parallel. 1. WHITE BOX TESTING

This type of testing ensures that All independent paths have been exercised at least once All logical decisions have been exercised on their true and false sides All loops are executed at their boundaries and within their operational bounds All internal data structures have been exercised to assure their validity. To follow the concept of white box testing we have tested each form .we have created independently to verify that Data flow is correct, All conditions are exercised to check their validity, All loops are executed on their boundaries.

2. BASIC PATH TESTING Established technique of flow graph with Cyclomatic complexity was used to derive test cases for all the functions. The main steps in deriving test cases were: Use the design of the code and draw correspondent flow graph. Determine the Cyclomatic complexity of resultant flow graph, using formula: V(G)=E-N+2 or V(G)=P+1 or V(G)=Number Of Regions Where V(G) is Cyclomatic complexity, E is the number of edges, N is the number of flow graph nodes, P is the number of predicate nodes. Determine the basis of set of linearly independent paths. 3. CONDITIONAL TESTING In this part of the testing each of the conditions were tested to both true and false aspects. And all the resulting paths were tested. So that each path that may be generate on particular condition is traced to uncover any possible errors. 4. DATA FLOW TESTING

This type of testing selects the path of the program according to the location of definition and use of variables. This kind of testing was used only when some local variable were declared. The definition-use chain method was used in this type of testing. These were particularly useful in nested statements. 5. LOOP TESTING In this type of testing all the loops are tested to all the limits possible. The following exercise was adopted for all loops: All the loops were tested at their limits, just above them and just below them. All the loops were skipped at least once. For nested loops test the inner most loop first and then work outwards. For concatenated loops the values of dependent loops were set with the help of connected loop. Unstructured loops were resolved into nested loops or concatenated loops and tested as above. Each unit has been separately tested by the development team itself and all the input have been validated.

System Security The protection of computer based resources that includes hardware, software, data, procedures and people against unauthorized use or natural Disaster is known as System Security. System Security can be divided into four related issues: Security Integrity Privacy Confidentiality SYSTEM SECURITY refers to the technical innovations and procedures applied to the hardware and operation systems to protect against deliberate or accidental damage from a defined threat. DATA SECURITY is the protection of data from loss, disclosure, modification and destruction. SYSTEM INTEGRITY refers to the power functioning of hardware and programs, appropriate physical security and safety against external threats such as eavesdropping and wiretapping. PRIVACY defines the rights of the user or organizations to determine what information they are willing to share with or accept from others and how the organization can be protected against unwelcome, unfair or excessive dissemination of information about it. CONFIDENTIALITY is a special status given to sensitive information in a database to minimize the possible invasion of privacy. It is an attribute of information that characterizes its need for protection. 9.3 SECURITY SOFTWARE System security refers to various validations on data in form of checks and controls to avoid the system from failing. It is always important to ensure that only valid data is entered and only valid operations are performed on the system. The system employees two types of checks and controls: CLIENT SIDE VALIDATION Various client side validations are used to ensure on the client side that only valid data is entered. Client side validation saves server time and load to handle invalid data. Some checks imposed are: VBScript in used to ensure those required fields are filled with suitable data only. Maximum lengths of the fields of the forms are appropriately defined. Forms cannot be submitted without filling up the mandatory data so that manual mistakes of submitting empty fields that are mandatory can be sorted out at the client side to save the server time and load.

Tab-indexes are set according to the need and taking into account the ease of user while working with the system.

SERVER SIDE VALIDATION Some checks cannot be applied at client side. Server side checks are necessary to save the system from failing and intimating the user that some invalid operation has been performed or the performed operation is restricted. Some of the server side checks imposed is: Server side constraint has been imposed to check for the validity of primary key and foreign key. A primary key value cannot be duplicated. Any attempt to duplicate the primary value results into a message intimating the user about those values through the forms using foreign key can be updated only of the existing foreign key values. User is intimating through appropriate messages about the successful operations or exceptions occurring at server side. Various Access Control Mechanisms have been built so that one user may not agitate upon another. Access permissions to various types of users are controlled according to the organizational structure. Only permitted users can log on to the system and can have access according to their category. User- name, passwords and permissions are controlled o the server side. Using server side validation, constraints on several restricted operations are imposed.

We use two orthogonal methods to utilize automatically detected human attributes to significantly improve content-based face image retrieval. Attribute-enhanced sparse coding exploits the global structure and uses several human attributes to construct semantic-aware code words in the offline stage. Attribute-embedded inverted indexing further considers the local attribute signature of the query image and still ensures efficient retrieval in the online stage. The experimental results show that using the code words generated by the proposed coding scheme, we can reduce the quantization error and achieve salient gains in face retrieval on two public datasets; the proposed indexing scheme can be easily integrated into inverted index, thus maintaining a scalable framework. During the experiments, we also discover certain informative attributes for face retrieval across different datasets and these attributes are also promising for other applications. Current methods treat all attributes as equal. We will investigate methods to dynamically decide the importance of the attributes and further exploit the contextual relationships between them.