Page | 629
Tiwari et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 629-634 classification of Web Pages. Research activities in this area also involve methods from other disciplines such as Information Retrieval (IR) and Natural Language Processing (NLP). While there exists a enough amount of work in extracting information from images - in the area of image processing and computer vision but the application of these techniques to Web content mining has not been very rapid. B. Web Structure Mining The structure of a Web graph comprise of Web pages as nodes, and hyperlinks as edges connecting between two related pages. Web Structure Mining can be defined as the process of extracting structure information from the Web. This type of mining is further divided into two types on the basis of structural data used. 1) Hyperlinks: A Hyperlink is a structural unit that is used to connect Web pages. A hyperlink that connects a webpage to a different location of the same page is called an Intra-Document Hyperlink, on the other hand a hyperlink that connects two different pages is called an Inter-Document Hyperlink. There has been a large amount of work on hyperlink analysis [6]. 2) Document Structure: The content of a Web page can be organized in a tree-structured format on the basis of various HTML and XML tags within the page. Here mining is used for extracting document object model(DOM) structure out of documents [6]. C. Web Usage Mining Web Usage Mining is the application of data mining techniques to extract interesting usage patterns and knowledge from Web usage data, to understand and serve the needs of Web-based applications [6]. Usage data consist of identity, origin and browsing behaviour of Web users at a Web site. Web usage mining can be further divided on the basis of usage data as: 1) Web Server Data: This data is collected in user logs at Web server. Some of the user log data at a Web server are IP addresses, page references, and access time of the users.
Fig. 1 Web Mining Taxonamy 2) Application Server Data: There are various Commercial application servers, e.g. Web logic [BEA], BroadVision [BV], StoryServer [VIGN], etc. A key feature is the ability to capture various kinds of business events and record them in application server logs. 3) Application Level Data: Finally, new kinds of events can always be defined in an application, and logging can be turned on for them generating histories of these specially defined events. The usage data can also be split into three different kinds on the basis of the source of its collection: on the server side, the client side, and the proxy side. The key issue is that on the server side there is an aggregate picture of the usage of a service by all users, while on the client side there is complete picture of usage of all services by a particular client, with the proxy side being somewhere in the middle . IV. WEB USAGE MINING PROCESS The Web usage mining process consist of following three inter-dependent stages: data collection and pre-processing, pattern discovery, and pattern analysis. In the preprocessing stage, the click stream data is cleaned and divided into a set of user transactions represent the behaviour of each user during different sessions. In the pattern discovery stage, statistical, database, and machine learning operations are executed to get hidden patterns revealing the usual behaviour of users, summary statistics on Web resources, sessions, and users [6]. In the final stage of the process, the extracted patterns and statistics are further analysed, filtered, which result in aggregate user models that is used as input to applications such as recommendation engines, visualization tools, and Web analytics and report generation tools. The overall process is depicted in Fig. 2. 2013, IJARCSSE All Rights Reserved
Page | 630
Tiwari et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 629-634
Fig. 2. Web Usage Mining process V. PREPROCESSING IN WEB USAGE MINING Preprocessing of web data is important in Web usage mining due to characteristics of clickstream data and its relationship with other related data gathered from different sources. The data preparation process is the most time consuming step in the Web usage mining process, and often requires the use of special algorithms and heuristics not usually used in other domains. This process a key role play to the successful extraction of useful patterns from the data. Preprocessing of data involve integrating data from different sources, and then converting the integrated data into a form appropriate for input into specific data mining operations. Collectively, this process is referred as data preparation. For the successful application of data mining methods to Web usage data, the preprocessing task must be correctly implemented . Fig. 2. provides a summary of the primary tasks and elements in usage data pre-processing. A. Data Fusion & Cleaning In some cases to reduce the load on a particular server, multiple servers are used. Data fusion is defined as a process of merging log files from various Web and application servers [7]. The Data Fusion process is depicted in Fig. 3. Data cleaning include removing irrelevant and erroneous references to embedded objects [8] [9]. Some information does not provide useful information in analysis or data mining tasks then Data cleaning is used [7].
Server Logs
Server Logs
Server Logs
Page | 631
Tiwari et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 629-634
Fig. 4. User Iedntification using IP + Agent C. Sessionization Sessionization is the process of segmenting the user activity record of each user into sessions, each representing a single visit to the site [7]. 1) Time Oriented Heuristics: - Time oriented heuristics is based on the Time stamps or date and time of request in the server log file [7]. In the time oriented session following two rules are used to identify session of user i) The difference between First request and last request must be < =30 minutes. ii) The difference between first request and next request is <= 10. Using these two points we judge time oriented sessions. In the below Fig. 5 User 2 first request is at time 0:12 and last request is at 0:35, thus difference between the two(first and last) <=30 minutes and difference between every request is <=10 minutes therefore its called as one session TIME 0:12 0:15 0:20 0:25 0:35 0:45 0:49 0:55 IP 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 URL A B C D D E F G REFF A B C C D C F AGENT IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp
Fig. 5. Log file On the basis of time oriented heuristic we find out two sessions from the above Fig. 5 log file shown in Fig.6. and Fig. 7. TIME 0:12 0:15 0:20 0:25 0:35 IP 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 URL A B C D D Fig. 6. Session 1 Session 2 shows as follow TIME 0:45 0:49 0:55 IP 192.168.100.102 192.168.100.102 192.168.100.102 URL E F G REFF D C F AGENT IE6;Xp IE6;Xp IE6;Xp REFF A B C C AGENT IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE6;Xp
Page | 632
Tiwari et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 629-634 2) Navigation Oriented Heuristics: Navigation Oriented Heuristics capture in the referrer fields of the server logs [7]. Navigation Oriented Heuristics judge the session of user on the basis of whether a page must have been reached from a previous page in the same session except if the referrer is undefined, and the time elapsed since the last request is below 10 seconds. TIME 0:04 0:10 0:12 0:15 0:20 0:25 0:48 0:52 0:58 IP URL 192.168.100.101 A 192.168.100.101 B 192.168.100.102 A 192.168.100.102 B 192.168.100.102 C 192.168.100.102 D 192.168.100.101 C 192.168.100.101 D 192.168.100.102 D Fig. 8 Login status for 101 and 102 IP REFF A A B C B C C AGENT IE5; Win2k IE5; Win2k IE6;Xp IE6;Xp IE6;Xp IE6;Xp IE5; Win2k IE5; Win2k IE6;Xp
IP 102 login
IP 102 login
On the basis of navigation oriented heuristic we figure out one session from the above Fig. 8 log file shown in Fig. 9.
URL A B C D D
REFF A B C C
Fig. 9. Session 1
Using time oriented it generate two session as below, because the difference between first and last request is >30 minutes as shown in Fig 10. and Fig. 11. TIME 0:12 0:15 0:20 0:25 IP 192.168.100.102 192.168.100.102 192.168.100.102 192.168.100.102 URL A B C D REFF A B C AGENT IE6;Xp IE6;Xp IE6;Xp IE6;Xp
Fig. 11. Session 2 D. Path completion: Because of proxy servers and cached versions of the pages visited by the client using Back, the sessions identified have many missed pages [8]. So path completion step is carried out to identify missing pages. Path completion is depends on mostly URL and REFF fields in server log file [7] [11] [12]. Graph model represents some relation defined on Web pages (or web), and each tree of the graph represents a web site. Each node in the tree represents a web page (html document), and edges between trees represent the links between web sites, while the edges between nodes inside a same tree represent links between documents at a web site. In the path completion Missing Reference this method also used. Missing Reference means the user backtrack should not be recorded in server log file. It cached in client side. VI. CONCLUSION The rapid growth of the web has result in a mammoth amount of web data that is now freely offered for user access. Web mining is the application of data mining techniques to extract knowledge from Web data. In this paper, a data pre 2013, IJARCSSE All Rights Reserved
Page | 633
Tiwari et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 629-634 processing treatment system for web usage mining has been analyzed for log data. It has undergone various steps such as data cleaning, user identification, session identification, path completion. In data cleaning all the unnecessary and erroneous record is removed. Then user is identified using Unique IP + Agent mechanism. User Session is identified using either Time oriented heuristic or navigation oriented heuristic. Finally all the missing references is identified for path completion and understanding the navigational behaviour of user. REFERENCES [1] Han, J., and Micheline, K. Data Mining Concepts and Techniques, 2nd Edition, Elsevier. [2] Ujwala Manoj Patil, J.B. Patil. Web Data Mining Trends & Technique In International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 961-965, 2012 [3] Srivastava, J., Cooley, R., Deshpande, M., and Tan, p. 2000. Web Usage Mining: Discovery and Applications of usage patterns from Web Data, In Proceedings of: ACM SIGKDD, Vol.1, Issue 2, pp-12 23, (Jan 2000). [4] Nina, S. P., Rahaman, M., Bhuiyan, K., and Khandakar E. 2009. Pattern Discovery Of Web Usage Mining, In Proceedings of IEEE International Conference On Computer Technology and Development,Vol. 1., pp. 499-503, 2009. [5] Hussain, T., and Asghar, S., and Masood, N. 2010. Web Usage Mining: A Survey on Preprocessing Of Web Log File, In Proceedings of: International Conference on Information and Emerging Technologies (ICIET), pp-1-6, June 2010. [6] Jaideep Srivastava , Prasanna Desikan, Vipin Kumar Web Mining Accomplishments & Future Directions [7] Marathe Dagadu Mitharam Preprocessing in Web Usage mining In proceedings of International Journal of Scientific & Engineering Research, Volume 3, Issue 2, pp.1-7, February -2012. [8] C.P. Sumathi, r. Padmaja valli ,|T. Santhanam An overview of preprocessing of web log file for web usage mining In proceedings of: Journal of Theoretical and Applied Information Technology , 15th December 2011. Vol. 34 No.1Pp-88-95, December 2011 [9] J. Vellingiri,S. Chenthur Pandian A Novel Technique for Web Log mining with Better Data Cleaning and Transaction Identification In Proceedings of: Journal of Computer Science pp-683-689, 2011 [10] V. Chitraa, Dr. Antony Selvadoss Davarnani A Novel Technique for Sessions Identification in Web Usage Mining Preprocessing. In Proceedings of: International Journal of Computer Applications (0975 8887) Volume 34 No.9, pp-23-27, November 2011. [11] Thanakorn Parmutha,Siriporn Chimphlee, Chom Kimpan, Parinya Sanguansat Data Preprocessing on Web Server Log Files for Mining Users Access Patterns.In: Proceedings of International Journal of Research and Review in Wireless Communication(IJRRWC) Vol. 2, No 2, pp. 92-98, June 2012. [12] V. Chitraa, Dr. Antony Selvadoss Davarnani An Efficient path completion technique for web log mining In: Proceedings of 2010 IEEE International Conference on computational Intelligence and Computing Research
Page | 634