Anda di halaman 1dari 8

Natural Browsing

Pierpaolo Basile and Luigi Intonti

NLP

Bridging the gap between NLP and Semantic Web!!!

Natural Browsing
Pierpaolo Basile (basilepp@di.uniba.it)
Dipartimento di InformaticaUniversit degli Studi di Bari Aldo Mdo Via E. Orabona, 470125 Bari (Italy)

Luigi Intonti (intontil@sudsistemi.it)


Sud Sistemi s.r.l. Via A. Omodeo, 570125 Bari (Italy)

la citt pi anticadella Contea del Tirolo, infatti fu fondata nel 901, ben prima che fosse costituita la Contea. Durante il corso dei secoli stata sotto dominazione ecclesiastica (Principe ve-

Integration and Normalization

User Editing and User Generated Content Semantic Enrichment Natural Language Processing Semantic Space Clustering Cluster Labelling Ontology SUMO + Linguistic Ontology + Domain Ontology

Browsing
This research was funded by Regione Puglia under the contract POR PUGLIA 2007-2013 - Asse I Linea 1.1 Azione 1.1.2 - Bando 'Aiuti agli Investimenti in Ricerca per le PMI Fondo per le Agevolazioni alla Ricerca, project title: Natural Browsing.

Natural Browsing is a research project funded by POR PUGLIA 2007/2013 Asse I Linea 1.1 Azione 1.1.2 Bando Aiuti agli Investimenti in Ricerca per le PMI

In every organization the biggest part of its knowledge is held by persons and edocuments (web pages, spreadsheets, word documents, database, emails, etc.). Such heterogeneous data are hardly usable by the organization members. The Natural Browsing project comes from the idea of extracting and formalizing all this knowledge, making it accessible to all the organization members, so that many advantages can be taken from the users-system interaction in order to refine and increase the knowledge base. Scenario Browsing and searching are made through search engines which returns links that are strictly connected to the word without emphasizing the user participation. The most part of these information has an unstructured form so, if there is a common substratum for any data sources, it will be represented by human language. Nowadays there are different technologies for formalizing information extracted from textual documents in a data structure capable of describing semantics of every single word. So that, the only possible common element is a formalization of semantics through ontologies.
This project has also the fundamental academic contribution of: University of Bari, Informatics Department; University of Bari, Michelangelo Merlin Physics Department; Polytechnic of Bari - Mechanical and Management engineering Department

20110419_en_002_NaturalBrowsing

Application

Specific cases

Web

Knowledge
(Organization)

Database

Application data Data from external organizations

Document

Folders monitoring Acquisition from memory devices

Making comprehensible and easy a consultation of knowledge in any organization.

Targets The planning proposal consists in searching and structuring innovative processes of data extraction from heterogeneous sources, semantic enrichment and browsing. In the near future these actions will allow users to use internet by means of natural language. Natural Browsing aims are: state of art analysis; studying of a gathering process for data, information and contents coming from heterogeneous sources (documental repository, forum, web pages, blog, etc.) following a reference standard model (normalization in a common semantic format); realization of a consolidation system for semantic information in a repository; creation of a semantic dictionary, namely a semantic enrichment process, which acts like a navigation infrastructure based on semantics; convoy in a single repository; individuation of a typical navigation model on semantic net; selection of processes for collecting new contributions from users; realization of demonstrative prototypes.

Data are usually heterogeneous: Documents Monitoring folders and memory devices Web Blog, Forum, Intranet applications Database Application data, data from external organizations Applications Specific cases

web data documents

edit insert correction

semantic clustering full text

view search

It will be possible to have another interaction level between the portal and its users and this will add other elements to the information found, for example locations, events and their georeferentiation.

Working procedures The Natural Browsing projects supposes the definition of four processes: 1. finding information from different specifici generalmente casi informative sources, then normalization Applicazioni in RDF language of these data following semantics rules, and, last, consolidation in a repository; 2. semantic enrichment on the informative unit relations and creation of the Web semantic net; 3. users navigation in the extracted data and further informative enrichment; Conoscenza (organizzazione) 4. cyclic repetition of the second process, in order to allow semantic relation dati applicativi and semantic net to grow, considering also the users enriched contents dati acquisiti da Database organizzazioni esterne (process number 3). The first, second and fourth process are totally clear for users because the monitoraggio cartelle knowledge base that can be consulted keeps enriching with new and better Documenti acquisizione da supporto structured information thanks to system processes that have no interaction with the human component. These processes are services prototypes, instead of web services or agents ones. The user navigation (process number 3) is improved by a split in data elaboration. Each user profile has a precise access to an application web prototype for navigating in data, preferring a classic method, for example consulting categories, instead of being guided by the ad hoc semantic net built since the initial request in natural language.

The Region of Puglia offers an excellent field of study for Natural Browsing future developments.

Application fields From the beginning the planning proposal has been directed towards an applicability in different contexts and for supporting different situations in which its important to emphasize data sources and the different contributions coming from navigation and consultation processes: managing and highlighting knowledge in an organization; bettering decision procedure in Public Administrations which want to test different ways of partecipatory democracy; support for all the organization members in acquiring knowledge through a single access point. Developments News and improvements are available on web site: www.naturalbrowsing.it

Natural Browsing
Pierpaolo Basile1 and Luigi Intonti2
1

Dept. of Computer Science, University of Bari, via Orabona, I-70125, Bari, Italy basilepp@di.uniba.it 2 Sud Sistemi S.r.l., via A. Omodeo 5, I-70125, Bari, Italy intontil@sudsistemi.it

Abstract. Natural Browsing is an ongoing industrial research project3 which aims to develop a framework able to automatically build a knowledge base from unstructured data. The project relies on NLP methods and Semantic Web technologies in order to mine facts from data.

Background and Motivation

Nowadays, the World Wide Web (WWW) takes the form of a huge set of linked information. An essential characteristic of the WWW is its universality; it is possible to link any concept to any other one building a network which oers endless possibilities of navigation. We can identify two dierent kinds of link: hyperlink, encoded into the WWW page using a formal language (generally these links refer to an URI); semantic link, implicit link which hides a semantic reasoning. Generally, only humans are able to discover semantic link, in fact when we surf the WWW and we decide to connect a content whit another one, we follow a semantic reasoning which is not easy to understand by a machine. This involve that humans browse the WWW using their experience and their ability to reasoning about words and concepts. The Semantic Web tries to nd an answer to this kind of problem providing a set of technologies able to understand the meaning of information on the WWW. The main idea behind Semantic Web is to provide machine-readable metadata which enable agents and other software to access the WWW in an intelligent way. Many Semantic Web technologies already exist and are used in several projects and software, however, its applicability to the WWW is largely unrealized. Natural Browsing was conceived as a tool able to mine semantic link, from a large collection of heterogeneous information sources, such as: textual documents, web pages, RSS feeds and Blogs. The project is enterprise oriented and its approach involves several areas: Natural Language Processing, Machine Learning and Semantic Web.
3

This research was funded by Regione Puglia under the contract POR PUGLIA 20072013 - Asse I Linea 1.1 Azione 1.1.2 - Bando Aiuti agli Investimenti in Ricerca per le PMI - Fondo per le Agevolazioni alla Ricerca, project title: Natural Browsing.

Pierpaolo Basile and Luigi Intonti

The project in a nutshell

The core of the project is the knowledge base (KB) which contains the mined facts. Moreover, the content of the KB is used during the knowledge extraction process. The KB is dened in the OWL language and it consists of three parts: 1. Upper-level ontology: contains very general concepts that are the same across all knowledge domains. The main function is to provide a semantic interoperability layer between Linguistic ontology and Domain ontology. We utilize SUMO [1] as upper-level ontology, because SUMO concepts are mapped to WordNet synsets. Indeed, it is easy to link the Linguistic ontology concepts to the Upper-level ontology ones. 2. Linguistic ontology: provides information about words and meanings. We build the Linguistic ontology using information provided by WordNet. 3. Domain ontology: contains information about the domain in which Natural Browsing is used. Concepts in the Domain ontology are linked to concepts in the Upper-level ontology. We identify four main processes to mine the unstructured data and to provide tools for the KB browsing: Integration and Normalization : in this step, unstructured data are collected from heterogeneous information sources. The goal of this step is to query and retrieve data from each information source, then the data are normalized according to a dened structure. Semantic Enrichment : the goal of this step is to mine facts from textual data using NLP techniques. We adopt a Named Entities Recognition tool and then we map the recognized entities to ontological concepts. Moreover, we apply an unsupervised method, based on space-reduction and clustering, to mine cluster of entities. Finally, we try to map each cluster to a set of ontological concepts exploiting an automatic cluster labelling based on the ontology structure. This process is developed by a pipeline which allows to add further semantic enrichment tasks. Browsing : in this step, the users can query the KB using statement in natural language or structured form. Moreover, this process provides tools to easily navigate and modify the KB. The KB browsing is mainly developed using a graph structure which allows to show concepts close to the query relevant ones. It is important to underline that only some kinds of user will be able to modify the KB. Only expert users can supervise the facts extracted by the Semantic Enrichment process. User Enrichment : this process allows to add user generated content to the KB: for example a user can add a new concept or remove a relation between concepts, in this step it is possible to add tags to the retrieved resources.

References
1. Niles, I., Pease, A.: Towards a standard upper ontology. In: Welty, C., Smith, B. (eds.) Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001) (2001)

Anda mungkin juga menyukai