Anda di halaman 1dari 4

Volume 2, Issue 3, March 2012

ISSN: 2277 128X

International Journal of Advanced Research in Computer Science and Software Engineering


Research Paper Available online at: www.ijarcsse.com

Web Mining An Integrated Approach


N. Senthil Kumar
Assistant Professor (Junior) School of Informat ion Technology & Engineering VIT University, Vellore, India
Mail: senthilkumar.n@vit.ac.in Abstract With the proliferation of the web into a prevalent tool for most of the e -activities such as e-commerce, e-learning, egovernment, e-science, its purpose has pervaded to the helm of every day work. The Web is enormous, widely scattered, global source for information services, hyperlink information, access and usage of information and website contents and organizations. Besi des, The Web had made a huge impact and revolution in the role and availability of information. In all the way, the Web is enormous and growing at a staggering rate, which has made it increasingly intricate and crucial for both people and programs to have quick and accurate access to Web information and services. Buried in the enormous, multi -dimensional, heterogeneous and distributed information on the Web is the knowledge having great potential value. With the rapid development of the Web, it is imperative to provide users with tools for efficient and effective resource and knowledge discovery. Search engines have assumed a central role in the World Wide Webs infrastructure as its scale and impact have escalated Keywords personalization, recommendation, interested domains, collaborative filtering.

P.M. Durai Raj Vincent


Assistant Professor School of Information Technology & Engineering VIT University, Vellore, India

I. INTRODUCTION The rapid growth of documents in the web turns out to be difficult to estimate the most relevant documents to specified communit ies when there would be a general query. But the recent search engines calculate and rank the pages in considering the vector space link analysis techniques based on the hypertext structure of the web. The searching would be more co mp lex to some extent in the web and have few problems to denote: Indexing all the pages in the web is more cumbersome and some times, tend to be more complicated. The dynamic change and exponential growth of the web can lost the update in time. Due to the exponential growth, search engines must have hardware with more storage capacities. Conventional search engines can not yield the results in tandem with current information.

structural areas: web content mining, web structure mining and web usage mining. Web content min ing is the application of data min ing techniques to content published on the Internet, usually as semi-structured, unstructured or structured documents. Web document text min ing, resource discovery based on concepts indexing or agent based technology may also fall in this category. Web structure functions on the hyperlinks structure and produce the graph structure which provide information about a page ranking. It infers knowledg e fro m the web and links between references in the web. Web usage mining analyses results of user interaction including web logs, clickstreams and other transactions held. Recent research binds content and structure mining to leverage the technique more strength and to yield high productivity. II. DATA COLLECTION The data collected to this task come fro m three different locations: Server side collection - the browser behavior of the web user is collected in the log file of the site server. Client side collection - uses a client side application, like a remote agent to collect the data, of the user navigation. Proxy side collection - like the server collect ion, proxy side collect ion, gathers the data in a log file. It is useful to characterize a group of users that use the same pro xy server.

Web personalization is increasing and more imminent to eradicate the difficulties by taking the content and entire structure of websites to the requirements of the web users and understand the web users access activities and their behavior. To match that, the promising research work has been carried out in Web Mining. Web mining pro jects to find the necessary and most seek informat ion or patterns from the web hyperlinks structure, page content and web usage log. The web mining can be categorically divided into three dominant

Vo lu me 2, Issue 3, March 2012 III. WEB LOG DATA PREPARATION The overall data preparation process is briefly described in the follo wing sections. A. Data Cleaning Not every access to the content should be taken into consideration. We need to remove accesses to irrelevant items (such as button images), accesses by Web crawlers (i.e. nonhuman accesses), and failed requests. B. Efficient User Identification Many users can be assigned the same IP address and on the other hand one user can have s everal different IP addresses even in the same session. The first inconvenience is usually the side-effect of intermed iary pro xy devices and local network gateways (also, many users can have access to the same co mputer). The second problem occurs when the ISP is performing load balancing over several proxies. All this prevents us from easily identifying and tracking the user. By using the information contained in the referrer and browser fields we can distinguish between some users that have the same IP, however, a co mplete distinction is not possible. Cookies can be used for better user identification. Users can block or delete cookies but it is estimated that well over 90% of users have cookies enabled. Another means of good user identification is ass igning users usernames and passwords. However, requiring users to authenticate is inappropriate for Web browsing in general. C. Session Identification and Path Completion Session identification is carried out using the assumption that if a certain predefined period of time between two accesses is exceeded, a new session starts at that point. Sessions can have some missing parts. This is due to the browsers own caching mechanism and also because of the intermediate pro xy -caches. The missing parts can be inferred fro m the sites structure. D. Transaction Identification

www.ijarcsse.com (consisting of only content pages). Several approaches, such as transaction identification by reference length and transaction identificat ion by maximal forward reference are available for this purpose. IV. WEB PERSONALIZATION The sole objective of the personalization process is to identify the generation of user models. Conventionally, the user models are more elusive and very simp listic in representing the user as a vector of ratings. Users rate different items for different reasons and under different contexts. The user interests and needs change with time. Identifying these changes and adapting to them is a key goal of personalizat ion. We suggest that the personalizat ion process be taken to a new level, a level where the user does not to be actively involved with the personalization process. All that the user needs to do is to have an active profile file and when the user logs onto a web site, the browser checks for that profile file as it checks for the cookies. The profile file describes the users interest and the levels at which the user wants a particular personalizable feature. Since the profile file is in a standardized format, the web sites would be able to provide the content according to the profile file. Th is would enhance the users personalization process without their active involvement. V. DET ECTING NOIS E IN THE WEB PAGES In the nutshell, the web page can be the inclusion of many blocks of dividends like content area space, navigation area space, advertisement space, etc. There is some sort of scalability and measure the space to separate these areas to suite for several practical applicat ions and user needs. It is the process of identifying the main content area or removing some noisy block over the web pages like advertisement, navigational panels which most of the user dont admire of when they surf. The information contained in noisy blocks can seriously harm Web data min ing. Another application is Web browsing using a small screen device, such as a PDA. Identifying different content blocks allows one to re-arrange the layout of the page so that the main contents can be seen easily without losing any other information fro m the page. VI. B INDING THE WEB INFORMATION

Some authors propose dividing or join ing the sessions into meaningful clusters, i.e. transactions. Pages visited within a session can be categorized as auxiliary or content pages. Auxiliary pages are used for navigation, i.e. the user is not interested in the content (at the time) but is merely trying to navigate fro m one page to another. Content pages, on the other hand, are pages that seem to provide some useful contents to the user. The transaction generation process usually tries to distinguish between auxiliary and content pages to produce the so called au xiliary-content transactions (consisting of au xiliary pages up to and including the first content page) and the so called content-only transactions

Due to the sheer scale of the Web and diverse authorships, various Web sites may use different syntaxes to express similar or related information. In order to make use of or to extract informat ion fro m mu ltip le sites to provide value added services, e.g., metasearch, deep Web search, etc, one needs to semantically integrate informat ion fro m mu ltiple sources. Recently, several researchers attempted this task. Two popular problems related to the Web are (1) Web query interface integration, to enable querying multiple Web databases and (2) schema matching, e.g., integrating Yahoo and Googles directories to match concepts in the hierarchies.

2012, IJARCSS E All Rights Reserved

Page | 200

Vo lu me 2, Issue 3, March 2012 The ability to query multip le deep Web databases is attractive and interesting because the deep Web contains a huge amount of information or data that is not indexed by general search engines. VII. WEB DATABAS E STRATEGIES

www.ijarcsse.com so-called reactive and proactive strategies. Reactive strategies want to associate requests with users, based upon web server logs, after they have interacted with the website. On the other hand, proactive strategies want to associate requests with users, during their interaction with the website. IX. APPLIED TECHNIQUES ON WEB USAGE DATA

Database approaches to web mining have focused on techniques for organizing the semi-structured data on the Web into more structured collections of resources and using standard database querying mechanisms and data mining techniques to analyze it. The database techniques on the Web are related to the problems of managing and querying the informat ion on the Web. There are three classes of tasks related to those problems: modeling and querying the Web, informat ion extraction and integration and Web site construction and restructuring. First two tasks are related to the Web content min ing applications. The database view tries to infer the structure of the Web site or to transform a Web site to become a database so that better information management and querying on the Web become possible. A lot of applications use multilevel databases (MLDB) in wh ich each level is obtained by generalizations on lower level and use a special purpose query language for Web mining t o extract some knowledge fro m the M LDB o f Web documents. In mult ilevel databases the main idea is that the lowest level of the database contains semi-structured informat ion stored in various Web repositories, such as hypertext documents. At the higher level metadata or generalizat ions are extracted fro m lower levels and organized in structured collections, i.e. relational or object-oriented databases. Most of Web-based query systems and languages utilize standard database query language such as SQL, structural informat ion about Web documents and even natural language processing for the queries that are used in WEB searches. VIII. ADAPTIVE WEBS ITES AND WEB PERSONALIZATION S YSTEMS The sites accommodate the way in which users access them, in order to improve their organization and presentation. The Web Personalization System is the action which adapts information and services provided by a web application to a user's or to a group of users' needs. A personalization system must be able to provide users with th e informat ion they need, without them having to explicit ly request it. In order to be able to use data residing in log files, it is absolutely necessary that these be cleaned and filtered. The analysis of the log files raises a series of problems, namely: the existence of a high nu mber of irrelevant records for the process of web usage min ing, the difficulty of the identification of users and sessions, the lack of information about the content of accessed pages, the data processing is a batch processing, which takes time and resources. The multip le inconvenient connected to the identification of the users and sessions have led to the development of some

The most used techniques applied to Web usage data are: Statistical Analysis: This method gives the clear description and narrow results over the traffic on a web site like most visited pages, average daily hits etc., and makes the system to act upon it. This form of analysis is carried out easily by many existing tools and also they are available free. Association Rules: This technique provides the mechanis m to understand the user behavior by considering every URL requested by a user in a particular visit as item and by means of that, it will discover the relationships with a min imu m support lever like basket ball analysis. Sequential Patterns: It will find out the time frame sequence of every URLs followed by past users to bring in the strategy to predict future ones and this will be much utilized for web advertisement. Clustering: Group the mean ingful URLs and discover the similar characteristics between them in cognizance with users behaviors. Segregate the user activities and based on that, it will cluster the urls and bring the new dimension. X CONCLUS ION In this paper, we present a trend discovery system for dynamic web content mining. Th is system extends the capabilit ies of traditional web content min ing approaches in order to analyse constantly changing web sites containing informat ion about multip le topics (such as online news sites). With the continued growth of the Web as an information source and as a medium for provid ing web services, Web Mining continues to play an ever expanding and inevitable role. Web min ing has adapted techniques from the field of data mining, database mining and informat ion retrieval, as well as developing some techniques of its own, e.g. path analysis. A lot of work still remains to be done in adapting known mining techniques as well as developing new ones. REFERENCES [1] Facca, F.M.; Lanzi, P.L. (2005): M ining interesting knowledge fro m web logs: a survey, Data Knowledge Eng. 53 (3), pp. 225 241. [2] Honghua Dai and Bamshad Mobasher-Integrating Semantic Knowledge with Web Usage Mining for Personalization, 2007.

2012, IJARCSS E All Rights Reserved

Page | 201

Vo lu me 2, Issue 3, March 2012 [3] G T Rajul and P S Satyanarayana -"Knowledge Discovery fro m Web Usage Data: Co mp lete Preprocessing Methodology", IJCSNS International Journal of Co mputer Science and Network Security, VOL,8 No.1, January 2008. [4] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan- "Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data", 2008. [5] Eirinaki, M., Lampos, C., Paulakis, S., Vazirgiannis, M.: Web personalizat ion integrating content, semantics and navigational patterns. In: ACM Web Informat ion and Data Management Workshop. (2004) 72 79. [6] Punin, J., Krishnamoorthy, M., M.J.Zaki: M ining web log data across all customers touch points. In: Web Usage MiningLanguages and Algorithms, WEBKDD01 Workshop. (2001). [7] Jin, X., Zhou, Y., Mobasher, B.: A unified approach to personalization based on probabilistic latent semantic models of web usage and content. In: Proc. of the AAAI 2004 Workshop SWP04. (2004) pp. 2634. [8] Mobasher, B., Cooley, R., Srivastava, J.: Automatic personalization based on Web usage mining, pp. 142-151. Co mmun. A CM 43, 8 (2000) [9] Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis, pp. 171190. INFORMS J. on Co mputing 15, 2 (2003). [10] Bay ir, M. A., Toroslu, I. H., Cosar, A., Fidan, G.: Smart Miner: a new framework for mining large scale web usage data. In: Proceedings of the 18th international Conference on World Wide Web, pp. 161-170. WWW '09. A CM, New Yo rk (2009). [11] Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis, pp. 171190. INFORMS J. on Co mputing 15, 2 (2003). [12] Arasu, A. and Garcia-Mo lina, H. Ext racting Structured Data fro m Web Pages. SIGMOD-03, 2003. [13] Chakrabart i, S . Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, 2002. [14] Doan, A. Madhavan, J. Do mingos, P., and Halevy, A. Learn ing to Map between Ontologies on the Semantic Web. WWW-02, 2002. [15] Liu, B., Chin, C., Ng, H-T. M ining Topic -Specific Concepts and Defin itions on the Web. WWW-03, 2003.

www.ijarcsse.com [16] Liu, B., Grossman, R. and Zhai, Y. Min ing Data Records in Web Pages. KDD-03, 2003.

2012, IJARCSS E All Rights Reserved

Page | 202

Anda mungkin juga menyukai