Anda di halaman 1dari 6

Volume 3, Issue 10, October 2013

ISSN: 2277 128X

International Journal of Advanced Research in Computer Science and Software Engineering


Research Paper Available online at: www.ijarcsse.com

Web Mining and Pre-processing of Web Usage Data


Sanjay Tiwari * Associate Professor, CSE Deptt. Arya Institute of Engineering & Technology, India Renu Tilwani M. Tech. Research Scholar, CSE Deptt. Arya Institute of Engineering & Technology, India Abstract With the rapid growth of web services and Web-based information system, the amount of web data have reached sky-high proportions. Thus several kinds of data have to be handled and organized in a manner that they can be accessed by several users effectively and efficiently. So the usage of web mining methods and knowledge discovery on the web is now on the spotlight of a boosting number of researchers. Web Data Mining is an application of Data Mining which deals with extraction of interesting or hidden knowledge from the World Wide Web. Web Data Mining can be categorized into: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web usage mining is a kind of data mining method that can be useful in recommending the web usage patterns with the help of users session and behaviour. Web usage mining includes three process, namely, preprocessing, pattern discovery and pattern analysis. This paper mainly focused on preprocessing approach. Keywords-- Data Mining, Web Data Mining, Web Content Mining, Web Structure Mining, Web Usage Mining. I. INTRODUCTION Data mining is the of extracting knowledge from bulk of data stored in large databases, data warehouse or other repositories. It is also known as Knowledge Discovery in Databases (KDD). Data Mining involves data analysis and knowledge discovery algorithms with satisfactory computational efficiency, that result in useful patterns and useful knowledge which can be used in different areas [1]. With the rapid growth of World Wide Web, Data Mining methods can be used to and extract information from Web content, structure and usage data. Web mining is the application of data mining techniques to extract knowledge from Web data, where at least one of structure or usage data is used in the mining process (with or without other types of Web data). Web mining has been gaining popularity because it extracts interesting and useful knowledge from Web data [2]. II. WEB DATA One of the most important steps in knowledge discovery in databases is to construct a proper target data set for the data mining task. In Web data mining, data can be gathered from Web servers, client sites, and proxy server or obtained from organizations database. Different type of data is collected from different location. There are many types of data that can be used in Web Mining [3, 4, 5]. A. Web Content The data that is present on the Web pages which provide information to the users. Some examples of Web Content data are text, HTML, audio, video, images, etc. B. Web Structure The Web pages are connected with each other through hyperlinks i.e. various HTML tags used to link one page to another and one Web site to another Web site. C. Web Usage These data reflect the usage of Web and are collected on Web servers, proxy server, and client browser with IP address, date, time etc. D. Web User Profile The data that provides demographic information about users of the Web sites, i.e. user registration data and customers profile information. III. WEB MINING TAXONOMY Web Mining can be broadly divided into three categories as shown in Fig.1. according to kinds of data to be mined: A. Web Content Mining Web Content Mining is the process of discovering useful information from the Web content data. Web Content data such as images, video, audio, text, structured records etc is fabricated on Web page to convey information to the users. Text mining and its application to Web content has been the most widely researched area [2]. Some of the research topics in text mining are, topic discovery, extracting association patterns, clustering of web documents and 2013, IJARCSSE All Rights Reserved

Page | 629

Tiwari et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 629-634 classification of Web Pages. Research activities in this area also involve methods from other disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP). While there exists a enough amount of work in extracting information from images - in the area of image processing and computer vision but the application of these techniques to Web content mining has not been very rapid. B. Web Structure Mining The structure of a Web graph comprise of Web pages as nodes, and hyperlinks as edges connecting between two related pages. Web Structure Mining can be defined as the process of extracting structure information from the Web. This type of mining is further divided into two types on the basis of structural data used. 1) Hyperlinks: A Hyperlink is a structural unit that is used to connect Web pages. A hyperlink that connects a webpage to a different location of the same page is called an Intra-Document Hyperlink, on the other hand a hyperlink that connects two different pages is called an Inter-Document Hyperlink. There has been a large amount of work on hyperlink analysis [6]. 2) Document Structure: The content of a Web page can be organized in a tree-structured format on the basis of various HTML and XML tags within the page. Here mining is used for extracting document object model(DOM) structure out of documents [6]. C. Web Usage Mining Web Usage Mining is the application of data mining techniques to extract interesting usage patterns and knowledge from Web usage data, to understand and serve the needs of Web-based applications [6]. Usage data consist of identity, origin and browsing behaviour of Web users at a Web site. Web usage mining can be further divided on the basis of usage data as: 1) Web Server Data: This data is collected in user logs at Web server. Some of the user log data at a Web server are IP addresses, page references, and access time of the users.

Fig. 1 Web Mining Taxonamy 2) Application Server Data: There are various Commercial application servers, e.g. Web logic [BEA], BroadVision [BV], StoryServer [VIGN], etc. A key feature is the ability to capture various kinds of business events and record them in application server logs. 3) Application Level Data: Finally, new kinds of events can always be defined in an application, and logging can be turned on for them generating histories of these specially defined events. The usage data can also be split into three different kinds on the basis of the source of its collection: on the server side, the client side, and the proxy side. The key issue is that on the server side there is an aggregate picture of the usage of a service by all users, while on the client side there is complete picture of usage of all services by a particular client, with the proxy side being somewhere in the middle . IV. WEB USAGE MINING PROCESS The Web usage mining process consist of following three inter-dependent stages: data collection and pre-processing, pattern discovery, and pattern analysis. In the preprocessing stage, the click stream data is cleaned and divided into a set of user transactions represent the behaviour of each user during different sessions. In the pattern discovery stage, statistical, database, and machine learning operations are executed to get hidden patterns revealing the usual behaviour of users, summary statistics on Web resources, sessions, and users [6]. In the final stage of the process, the extracted patterns and statistics are further analysed, filtered, which result in aggregate user models that is used as input to applications such as recommendation engines, visualization tools, and Web analytics and report generation tools. The overall process is depicted in Fig. 2. 2013, IJARCSSE All Rights Reserved

Page | 630

Tiwari et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 629-634

Fig. 2. Web Usage Mining process V. PREPROCESSING IN WEB USAGE MINING Preprocessing of web data is important in Web usage mining due to characteristics of clickstream data and its relationship with other related data gathered from different sources. The data preparation process is the most time consuming step in the Web usage mining process, and often requires the use of special algorithms and heuristics not usually used in other domains. This process a key role play to the successful extraction of useful patterns from the data. Preprocessing of data involve integrating data from different sources, and then converting the integrated data into a form appropriate for input into specific data mining operations. Collectively, this process is referred as data preparation. For the successful application of data mining methods to Web usage data, the preprocessing task must be correctly implemented . Fig. 2. provides a summary of the primary tasks and elements in usage data pre-processing. A. Data Fusion & Cleaning In some cases to reduce the load on a particular server, multiple servers are used. Data fusion is defined as a process of merging log files from various Web and application servers [7]. The Data Fusion process is depicted in Fig. 3. Data cleaning include removing irrelevant and erroneous references to embedded objects [8] [9]. Some information does not provide useful information in analysis or data mining tasks then Data cleaning is used [7].

Server Logs

Server Logs

Server Logs

Merge Log File (Data Fusion)

User Identification, Session etc


Fig.3. Data Fusion B. User Identification Web usages mining is not completely based on user history for discovering knowledge because the users visit or request more than one time to the server. For each visit (or request), a new sessions is generated for each user. It is also referred as User activity Records [7]. There are various mechanism with the help of which we can identify users on the web like Cookies, Embedded Session IDs, Software Agents. One such method is Identifying user by using IP address and User Agent in log files [8] [10]. When Client request to server, cli ents system IP Address and user Agent is recorded in log files. Consider the example of Fig. 4. On the left, the figure represents a portion of a partly pre-processed log file (the time stamps are given as hours and minutes only). Using a combination of IP and Agent fields in the log file, we are able to identify activity records for three separate users (depicted on the right) visited to the site. Session captures in two way [7, 9]: 1) Time oriented Heuristics 2) Navigation oriented Heuristics 2013, IJARCSSE All Rights Reserved

Page | 631

Tiwari et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 629-634

Fig. 4. User Iedntification using IP + Agent C. Sessionization Sessionization is the process of segmenting the user activity record of each user into sessions, each representing a single visit to the site [7]. 1) Time Oriented Heuristics: - Time oriented heuristics is based on the Time stamps or date and time of request in the server log file [7]. In the time oriented session following two rules are used to identify session of user i) The difference between First request and last request must be < =30 minutes. ii) The difference between first request and next request is <= 10. Using these two points we judge time oriented sessions. In the below Fig. 5 User 2 first request is at time 0:12 and last request is at 0:35, thus difference between the two(first and last) <=30 minutes and difference between every request is <=10 minutes therefore its called as one session TIME 0:12 0:15 0:20 0:25 0:35 0:45 0:49 0:55 IP 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 URL A B C D D E F G REFF A B C C D C F AGENT IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp

Fig. 5. Log file On the basis of time oriented heuristic we find out two sessions from the above Fig. 5 log file shown in Fig.6. and Fig. 7. TIME 0:12 0:15 0:20 0:25 0:35 IP 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 URL A B C D D Fig. 6. Session 1 Session 2 shows as follow TIME 0:45 0:49 0:55 IP 192.168.100.102 192.168.100.102 192.168.100.102 URL E F G REFF D C F AGENT IE6;Xp IE6;Xp IE6;Xp REFF A B C C AGENT IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp

Fig. 7. Session 2 2013, IJARCSSE All Rights Reserved

Page | 632

Tiwari et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 629-634 2) Navigation Oriented Heuristics: Navigation Oriented Heuristics capture in the referrer fields of the server logs [7]. Navigation Oriented Heuristics judge the session of user on the basis of whether a page must have been reached from a previous page in the same session except if the referrer is undefined, and the time elapsed since the last request is below 10 seconds. TIME 0:04 0:10 0:12 0:15 0:20 0:25 0:48 0:52 0:58 IP URL 192.168.100.101 A 192.168.100.101 B 192.168.100.102 A 192.168.100.102 B 192.168.100.102 C 192.168.100.102 D 192.168.100.101 C 192.168.100.101 D 192.168.100.102 D Fig. 8 Login status for 101 and 102 IP REFF A A B C B C C AGENT IE5; Win2k IE5; Win2k IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE5; Win2k IE5; Win2k IE6;Xp

IP 102 login

IP 102 login

On the basis of navigation oriented heuristic we figure out one session from the above Fig. 8 log file shown in Fig. 9.

TIME 0:12 0:15 0:20 0:25 0:58

IP 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102

URL A B C D D

REFF A B C C

AGENT IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp

Fig. 9. Session 1

Using time oriented it generate two session as below, because the difference between first and last request is >30 minutes as shown in Fig 10. and Fig. 11. TIME 0:12 0:15 0:20 0:25 IP 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 URL A B C D REFF A B C AGENT IE6;Xp IE6;Xp IE6;Xp IE6;Xp

Fig. 10. Session 1 0:58 192.168.100.102 D C IE6;Xp

Fig. 11. Session 2 D. Path completion: Because of proxy servers and cached versions of the pages visited by the client using Back, the sessions identified have many missed pages [8]. So path completion step is carried out to identify missing pages. Path completion is depends on mostly URL and REFF fields in server log file [7] [11] [12]. Graph model represents some relation defined on Web pages (or web), and each tree of the graph represents a web site. Each node in the tree represents a web page (html document), and edges between trees represent the links between web sites, while the edges between nodes inside a same tree represent links between documents at a web site. In the path completion Missing Reference this method also used. Missing Reference means the user backtrack should not be recorded in server log file. It cached in client side. VI. CONCLUSION The rapid growth of the web has result in a mammoth amount of web data that is now freely offered for user access. Web mining is the application of data mining techniques to extract knowledge from Web data. In this paper, a data pre 2013, IJARCSSE All Rights Reserved

Page | 633

Tiwari et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 629-634 processing treatment system for web usage mining has been analyzed for log data. It has undergone various steps such as data cleaning, user identification, session identification, path completion. In data cleaning all the unnecessary and erroneous record is removed. Then user is identified using Unique IP + Agent mechanism. User Session is identified using either Time oriented heuristic or navigation oriented heuristic. Finally all the missing references is identified for path completion and understanding the navigational behaviour of user. REFERENCES [1] Han, J., and Micheline, K. Data Mining Concepts and Techniques, 2nd Edition, Elsevier. [2] Ujwala Manoj Patil, J.B. Patil. Web Data Mining Trends & Technique In International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 961-965, 2012 [3] Srivastava, J., Cooley, R., Deshpande, M., and Tan, p. 2000. Web Usage Mining: Discovery and Applications of usage patterns from Web Data, In Proceedings of: ACM SIGKDD, Vol.1, Issue 2, pp-12 23, (Jan 2000). [4] Nina, S. P., Rahaman, M., Bhuiyan, K., and Khandakar E. 2009. Pattern Discovery Of Web Usage Mining, In Proceedings of IEEE International Conference On Computer Technology and Development,Vol. 1., pp. 499-503, 2009. [5] Hussain, T., and Asghar, S., and Masood, N. 2010. Web Usage Mining: A Survey on Preprocessing Of Web Log File, In Proceedings of: International Conference on Information and Emerging Technologies (ICIET), pp-1-6, June 2010. [6] Jaideep Srivastava , Prasanna Desikan, Vipin Kumar Web Mining Accomplishments & Future Directions [7] Marathe Dagadu Mitharam Preprocessing in Web Usage mining In proceedings of International Journal of Scientific & Engineering Research, Volume 3, Issue 2, pp.1-7, February -2012. [8] C.P. Sumathi, r. Padmaja valli ,|T. Santhanam An overview of preprocessing of web log file for web usage mining In proceedings of: Journal of Theoretical and Applied Information Technology , 15th December 2011. Vol. 34 No.1Pp-88-95, December 2011 [9] J. Vellingiri,S. Chenthur Pandian A Novel Technique for Web Log mining with Better Data Cleaning and Transaction Identification In Proceedings of: Journal of Computer Science pp-683-689, 2011 [10] V. Chitraa, Dr. Antony Selvadoss Davarnani A Novel Technique for Sessions Identification in Web Usage Mining Preprocessing. In Proceedings of: International Journal of Computer Applications (0975 8887) Volume 34 No.9, pp-23-27, November 2011. [11] Thanakorn Parmutha,Siriporn Chimphlee, Chom Kimpan, Parinya Sanguansat Data Preprocessing on Web Server Log Files for Mining Users Access Patterns.In: Proceedings of International Journal of Research and Review in Wireless Communication(IJRRWC) Vol. 2, No 2, pp. 92-98, June 2012. [12] V. Chitraa, Dr. Antony Selvadoss Davarnani An Efficient path completion technique for web log mining In: Proceedings of 2010 IEEE International Conference on computational Intelligence and Computing Research

2013, IJARCSSE All Rights Reserved

Page | 634

Anda mungkin juga menyukai