Data Mining For XML Query (Repaired)

CHAPTER 1
1.INTRODUCTION
1.1 OVERVIEW
In recent years the database research field has concentrated on XML
(eXtensible Markup Language) as a flexible hierarchical model suitable to
represent huge amounts of data with no absolute and fixed schema, and a
possibly irregular and incomplete structure. There are two main approaches to
XML document access: keyword-based search and query-answering. The first
one comes from the tradition of information retrieval, where most searches are
performed on the textual content of the document; this means that no advantage
is derived from the semantics conveyed by the document structure.
As for query-answering, since query languages for semistructured data
rely on the document structure to convey its semantics, in order for query
formulation to be effective users need to know this structure in advance, which
is often not the case. In fact, it is not mandatory for an XML document to have a
defined schema: 50% of the documents on the web do not possess one. When
users specify queries without knowing the document structure, they may fail to
retrieve information which was there, but under a different structure. This
limitation is a crucial problem which did not emerge in the context of relational
database management systems.
Frequent, dramatic outcomes of this situation are either the information
overload problem, where too much data are included in the answer because the
set of keywords specified for the search captures too many meanings, or the
information deprivation problem, where either the use of inappropriate
keywords, or the wrong formulation of the query, prevent the user from
receiving the correct answer.
This paper addresses the need of getting the gist of the document before
querying it, both in terms of content and structure. Discovering recurrent

patterns inside XML documents provides high-quality knowledge about the
document content.
Frequent patterns are in fact intensional information about the data
contained in the document itself, that is, they specify the document in terms of a
set of properties rather than by means of data. As opposed to the detailed and
precise information conveyed by the data, this information is partial and often
approximate, but synthetic, and concerns both the document structure and its
content.
In particular, the idea of mining association rules to provide summarized
representations of XML documents has been investigated in many proposals
either by using languages (e.g. XQuery) and techniques developed in the XML
context, or by implementing graph- or tree-based algorithms.
In this paper I introduce a proposal for mining and storing TARs (Tree-
based Association Rules) as a means to represent intensional knowledge in
native XML. Intuitively, a TAR represents intensional knowledge in the form
SB SH, where SB is the body tree and SH the head tree of the rule and SB is
a subtree of SH. The rule SB SH states that, if the tree SB appears in an
XML document D, it is likely that the wider, tree SH also appears in D.
Graphically, we render the nodes of the body of a rule by means of black
circles, and the nodes of the head by empty circles.
TARs can be queried to obtain fast, although approximate, answers. This
is particularly useful not only when quick answers are needed but also when the
original documents are unavailable. In fact, once extracted, TARs can be stored
in a (smaller) document and be accessed independently of the dataset they were
extracted from. Summarizing, TARs are extracted for two main purposes: to
get a concise idea the gist of both the structure and the content of an XML
document, and to use them for intensional query answering, that is, allowing
the user to query the extracted TARs rather than the original document.By
querying such summaries, investigators obtain initial knowledge about specific

entities in the vast dataset, and are able to devise more specific queries for
deeper investigation. An important side-effect of using such a technique is that
only the most promising specific queries are issued towards the integrated data,
dramatically reducing time and cost.

1.2 PROBLEM DEFINITION
For query-answering, since query languages for semistructured data rely
on the document structure to convey its semantics, in order for query
formulation to be effective users need to know this structure in advance, which
is often not the case
. In fact, it is not mandatory for an XML document to have a defined
schema: 50% of the documents on the web do not possess one. When users
specify queries without knowing the document structure, they may fail to
retrieve information which was there, but under a different structure. This
limitation is a crucial problem which did not emerge in the context of relational
database management systems.
Frequent, dramatic outcomes of this situation are either the information
overload problem, where too much data are included in the answer because the
set of keywords specified for the search captures too many meanings, or the
information deprivation problem, where either the use of inappropriate
keywords, or the wrong formulation of the query, prevent the user from
receiving the correct answer.

1.3 OBJECTIVE
This project provides a method for deriving intensional Knowledge
from XML documents in the form of TARs, and then storing these
TARs as an alternative, synthetic dataset to be queried for providing
quick and summarized answers.

The procedure is characterized by the following key aspects: a) It works
directly on the XML documents, without transforming the data into any
intermediate format, b) It looks for general association rules, without the need to
impose what should be contained in the antecedent and consequent of the rule,
c) It stores association rules in XML format, and d) It translates the queries on
the original dataset into queries on the TARs set.
The aim of our proposal is to provide a way to use intensional knowledge
as a substitute of the original document during querying and not to improve the
execution time of the queries over the original XML dataset.

CHAPTER 2
LITERATURE SURVEY

Literature survey is the most important step in the software development
process. Before developing the tool it is necessary to determine the time factor,
economy and company strength. Once these things are satisfied, ten next steps
are to determine which operating system and language can be used for
developing the tool.Once the programmers start building the tool, the
programmers need lot of external support. This support can be obtained from
senior programmers, from book or from websites
2.1 Answering xml queries by means of data summaries
E. Baralis, P. Garza, E. Quintarelli, and L. Tanca published in 2007
The idea of using association rules as summarized representations of
XML documents was also introduced in this is based on the extraction of rules
both on the structure (schema patterns) and on content (instance patterns) of
XML datasets. The limitation of this approach are: i) the root of the rule is
established a-priori and ii) the patterns, used to describe general properties of
the schema applying to all instances, are not mined, but derived as an
abstraction of similar instance patterns and are less precise and reliable.

2.2 Relational computation for mining association rules from xml data
H. C. Liu and J. Zeleznikow published in 2005,
It uses Fixpoint operator which works only on Relational format. In this
technique, XML document is first preprocessed to to transform into an Object-
Relational database.
2.3 A new method for mining association rules from a collection of xml
documents

J. Paik, H. Y. Youn, and U. M. Kim. in 2005
An approach to extract association rules, i.e. to mine all frequent rules,
without any a-priori knowledge of the XML dataset.The paper introduced
HoPS, an algorithm for extracting association rules from a set of XML
documents. Such rules are called XML association rules and are implications of
the form X Y , where X and Y are fragments of an XML document
. The two trees X and Y have to be disjunct; moreover, both X and Y are
embedded subtrees of the XML documents which means that they do not
always represent the actual structure of the data. Another limitation of the
proposal is that it does not consider the possibility to mine general association
rules within a single XML dataset, achieving this feature is one of our goals.

2.4 Extracting association rules from xml documents using xquery
J. W. W. Wan and G. Dobbie published in 2003
Use XQuery to extract association rules from simple XML documents.
They propose a set of functions, written in XQuery, which implement the
Apriori algorithm. This approach performs well on simple XML documents but
it is very difficult to apply to complex XML documents with an irregular
structure.we have used the PathJoin algorithm to find frequent subtrees in XML
documents. Such algorithm performed exponentially which is a disadvantage.

2.5 Discovering interesting information in xml data with association rules
D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. Lanzi published in 2003
It overcome the disadvantage of xQuery with data mining and knowledge
discovery capabilities, by introducing XMINE RULE, an operator for mining
association rules for native XML documents. They formalize the syntax and
semantics for the operator and propose some examples of complex association
rules.

CHAPTER 3
SYSTEM ANALYSIS

3.1 EXISTING SYSTEM
Extracting information from semi structured documents is a very
hard task, and is going to become more and more critical as the amount
of digital information available on the internet grows. Indeed, documents
are often so large that the dataset returned as answer to a query may be
too big to convey interpretable knowledge. The existing version of the TARs
extraction algorithm introduced in, which was based on PathJoin and
CMTreeMiner to mine frequent subtrees from XML documents;

3.2 PROPOSED SYSTEM
This work describes an approach based on Tree-based
Association Rules (TARs) mined rules, which provide approximate,
intensional information on both the structure and the contents of
XML documents, and can be stored in XML format as well. TARs can
be queried to obtain fast, although approximate, answers.
This is particularly useful not only when quick answers are needed but
also when the original documents are unavailable. In fact, once extracted, TARs
can be stored in a (smaller) document and be accessed independently of the
dataset they were extracted from. Summarizing, TARs are extracted for two
main purposes: 1) to get a concise idea the gist of both the structure and the
content of an XML document, and 2) to use them for intensional query
answering,

The intensional information embodied in TARs provides a valid support in
several cases:

(i) When a user faces a dataset for the first time, s/he does not know
its features and frequent patterns provide a way to understand quickly what is
contained in the dataset.

(ii) Besides intrinsically unstructured documents, there is a significant portion
of XML documents which have some structure, but only implicitly, that is, their
structure has not been declared via a DTD or an XML-Schema. Since most
work on XML query languages has focused on documents having a known
structure, querying the above-mentioned documents is quite difficult because
users have to guess the structure to specify the query conditions correctly. TARs
represent a dataguide that helps users to be more effective in query formulation.

(iii) It supports query optimization design, first of all because recurrent
structures can be used for physical query optimization, to support the
construction. of indexes and the design of efficient access methods for frequent
queries, and also because frequent patterns allow to discover hidden integrity
constraints, that can be used for semantic optimization.
(iv) for privacy reasons, a document answer might expose a controlled set of
TARs instead of the original document, as a summarized view that masks
sensitive details.

CHAPTER 4
PROJECT DESCRIPTION
4.1ARCHITECTURE

Figure 4.1
The purpose of this framework is to perform data mining on XML and obtain
intentional knowledge. The intentional knowledge is also in the form of XML.
This is nothing but rules with supports and confidence. In other words the result
of data mining is TARs (Tree-based Association Rules). The framework is to
have data mining for XML query answering support.
When XML file is given as input, DOM parser will parse it for
wellformedness and validness. If the given XML document is valid, it is parsed

and loaded into a DOM object which can be navigated easily. The parsed XML
file is given to data mining sub system which is responsible for sub tree
generation and also TAR extraction. The generated TARs are used by Query
Processor Sub System. This module takes XML query from end user and makes
use of mined knowledge to answer the query quickly.

4.2TREE-BASED ASSOCIATION RULES
Association rules describe the co-occurrence of data items in a large
amount of collected data and are represented as implications of the form X Y
, where X and Y are two arbitrary sets of data items, such that X Y = . The
quality of an association rule is measured by means of support and confidence.
Support corresponds to the frequency of the set X Y in the dataset,
while confidence corresponds to the conditional probability of finding Y ,
having found X and is given by supp(X Y )/supp(X). In this work we extend
the notion of association rule introduced in the context of relational databases to
adapt it to the hierarchical nature of XML documents.
Following the Infoset conventions, we represent an XML document as a
tree2 _N,E, r, _, c_ where N is the set of nodes, r N is the root of the tree, E is
the set of edges, _ : N L is the label function which returns the tag of nodes
and c : N C {} is the content function which returns the content of nodes.
We consider the element-only Infoset content model [28], where XML
nonterminal tags include only other elements and/or attributes, while the text is
confined to terminal elements. We are interested in finding relationships among
subtrees of XML documents.
Thus, since both textual content of leaf elements and values of attributes
convey content, we do not distinguish between them. As a consequence, for
the sake of readability, we do not report the edge label and the node type label
in the figures. Attributes and elements are characterized by empty circles,

whereas the textual content of elements, or the value of attributes, is reported
under the outgoing edge of the element or attribute it refers to.

4.3 FUNDAMENTAL CONCEPTS
Given two trees T = _NT,ET , rT , _T , cT _ and S = _NS,ES, rS, _S, cS_,
S is an induced subtree of T Without loss of generality, we do not consider
namespaces, ordering label, referencing formalism through ID-IDREF
attributes, URIs, Links and entity nodes because they are not relevant to the
present work.only if there exists a mapping : NS NT such that for each
node ni NS, _T (ni) = _S(nj) and cT (ni) = cS(nj), where (ni) = nj , and for
each edge e = (n1, n2) ES, ((n1), (n2)) ET .
Moreover, S is a rooted subtree of T if and only if S is an induced subtree
of T and rS = rT .Given a tree T = _NT,ET , rT , _T , cT _, a subtree of T, t =
_Nt,Et, rt, _t, ct_ and a user-fixed support threshold smin: (i) t is frequent if its
support is greater or at least equal to smin; (ii) t is maximal if it is frequent and
none of its proper supertrees is frequent; (iii) t is closed if none of its proper
supertrees has support greater than that of t.
A Tree-based Association Rule (TAR) is a tuple of the form Tr = _SB,
SH, sTr, cTr _, where SB = _NB,EB, rB, _B, cB_ and SH = _NH,EH, rH, _H,
cH_ are trees and sTr and cTr are real numbers in the interval [0,1] representing
the support and confidence of the rule respectively. A TAR describes the co-
occurrence of the two trees SB and SH in an XML document. For the sake of
readability we shall often use the short notation SB SH; SB is called the body
or antecedent of Tr while SH is the head or consequent of the rule.
Furthermore, SB is a subtree of SH with an additional property on the
node labels: the set of tags of SB is contained in the set of tags of SH with the
addition of the empty label _: _SB(NSB) _SB(NSB){_}. The empty label
is introduced because the body of a rule may contain nodes with unspecified
tags, that is, blank nodes

a rooted TAR (RTAR) is a TAR such that SB is a rooted subtree of SH
an extended TAR (ETAR) is a TAR such that SB is an induced subtree of SH
Let count(S,D) denote the number of occurrences of a subtree S in the
tree D and cardinality(D) denote the number of nodes of D. We formally define
the support of a TAR SB SH as count(SH,D)/cardinality(D) and its
confidence as count(SH,D)/count(SB,D). Notice that TARs, in addition to
associations between data values, also provide information about the structure
of frequent portions of XML document.
Thus they are more expressive than classical association rules which only
provide frequent correlations of flat values. It is worth pointing out that TARs
are different from XML association rules as defined in [24], because, given a
rule X Y where both X and Y are subtrees of an XML document, that paper
require that (X _ Y )(Y _ X), i.e. the two trees X and Y have to be disjoint;
on the contrary TARs require X to be an induced subtree of Y . Given an XML
document, we extract two types of TARs:
A TAR is a structure TAR (sTAR) iff, for each node n contained in SH, cH(n)
= , that is, no data value is present in sTARs, i.e. they provide information
only on the structure of the document.
A TAR, SB SH, is an instance TAR (iTAR) iff SH contains at least one
node n such that cH(n) _= , that is, iTARs provide information both on the
structure and on the data values contained in a document. According to the
definitions above we have: structure- Rooted-TARs (sRTARs), structure-
Extended-TARs (sETARs), instance-Rooted-TARs (iRTARs) and i(iETARs).
Since TARs provide an approximate view of both the content and the structure
of an XML document, (1) sTARs can be used as an approximate DataGuide of
the original document, to help users formulate queries; (2) iTARs can be used to
provide intensional, approximate answers to user queries
.It shows a sample XML document and some sTARs. Rules (1) and (3)
are rooted sTARs, rule (2) is an extended sTAR. Rule (1) states that, if there is a

node labeled A in the document, with a 86% probability that node has a child
labeled B. Rule (2) states that, if there is a node labeled B, with a 75%
probability its parent is labeled A. Finally, Rule (3) states that, if a node A is the
grandparent of a node C , with a 75% probability the child of A and parent of C,
is labeled B.
By observing sTARs users can guess the structure of an XML document,
and thus use this approximate schema to formulate a query when no DTD or
schema is available: as DataGuides , sTARs represent a concise structural
summary of XML documents The proposed XML query answering support
framework.
The purpose of this framework is to perform data mining on XML and
obtain intentional knowledge. The intentional knowledge is also in the form of
XML. This is nothing but rules with supports and confidence. In other words
the result of data mining is TARs (Tree-based Association Rules).

4.4 ALGORITHM
TAR mining is a process composed of two steps: 1) mining frequent subtrees,
that is, subtrees with a support above a userdefined threshold, from the XML
document; 2) computing interesting rules, that is, rules with a confidence above
a user defined threshold, from the frequent subtrees.

Algorithm 1 presents our extension to a generic frequentsubtree mining
algorithm in order to compute interesting TARs. The inputs of Algorithm 1 are

the XML document D, the threshold for the support of the frequent subtrees
minsupp, and the threshold for the confidence of the rules, minconf. Algorithm
1 finds frequent subtrees and then hands each of them over to a function that
computes all the possible rules.
Depending on the number of frequent subtrees and their cardinality, the
amount of rules generated by a naive Compute-Rules function may be very
high. Given a subtree with n nodes, we could generate 2n 2 rules, making the
algorithm exponential. This explosion occurs in the relational context too, thus,
based on similar considerations , it is possible to state the following property,
that allows us to propose the optimized vesion of Compute-Rules shown in
Function 2.

If the confidence of a rule SB SH is below the established threshold minconf
then the confidence of every other rule SBi then SHi , such that its body SBi is
an induced subtree of the body SB, is no greater than minconf. It a frequent
subtree and three possible TARs mined from the tree; all the three rules have the
same support k and confidence to be determined.
Let the support of the body tree of rule (1) be s. Since the body trees of
rules (2) and (3) are subtrees of the body tree of rule (1), their support is at least

s, and possibly higher. This means that the confidences of rules (2) and (3) are
equal, or lower, than the confidence of rule (1)
.In Function 2, TARs are mined exploiting by generating first the rules
with the highest number of nodes in the body tree. Consider two rules Tr1 and
Tr2 whose body trees contain one and three nodes respectively; suppose both
rules have confidence below the fixed threshold. If the algorithm considers rule
Tr2 first, all rules whose bodies are induced subtrees of Tr2 will be discarded
when Tr2 is eliminated.
Therefore, it is more convenient to first generate rule Tr2 and in general,
to start the mining process from the rules with a larger body. Using this solution
we can lower the complexity of the algorithm, though not enough to make it
perform better than exponentially. However, notice that the process of deriving
TARs from XML documents is only done periodically.
Since intensional knowledge represents frequent information, to update
it, it is desirable to perform such process after a big amount of updates have
been made on the original document. Therefore, in the case of stable documents
the algorithm has to be applied few times or only once.
Once the mining process has finished and frequent TARs have been
extracted, they are stored in XML format. This decision has been taken to allow
the use of the same language such as xQuery for querying both the original
dataset and the mined rules. Each rule is saved inside a <rule> element which
contains three attributes for the ID, support and confidence of the rule.

CHAPTER 5
SYSTEM REQUIREMENTS

5.1 SOFTWARE REQUIREMENTS:
Language : ASP.NET, C#.NET
Technologies : Microsoft.NET Framework
IDE : Visual Studio 2008
Operating System : Microsoft Windows XP SP2 or Later Version
Backend : XML

5.2 HARDWARE REQUIREMENTS:
Processor : Intel Pentium or more
RAM : 512 MB (Minimum)
Processor Speed : 3.06 GHz
Hard Disk Drive : 250 GB
Floppy Disk Drive : Sony
CD-ROM Drive : Sony
Monitor : 17 inches
Keyboard : TVS Gold
Mouse : Logitech

CHAPTER 6
6. MODULE DESCRIPTION
Implementation is the stage of the project when the theoretical design is
turned out into a working system. Thus it can be considered to be the most
critical stage in achieving a successful new system and in giving the user,
confidence that the new system will work and be effective.
The implementation stage involves careful planning, investigation of the
existing system and its constraints on implementation, designing of methods to
achieve changeover and evaluation of changeover methods.

1. Admin
2. User
3. Xml Query Answering

6.1Admin:
Admin maintains the total information about the Whole application.
Admin maintain the data in XML format only. Admin only has the authority to
create user name and password for new users. This is mainly for the purpose of
security so that no one can misuse it.

Figure 6.1
We can retrieve data as a whole file or can retrieve some specific data
needed. It also maintains the table for registered users for further uses. Admin

has the retrieve module which is used to retrieve the whole XML document. If
needed, specific root node can be retrieved from whole document.

Figure 6.2
6.2User:
User search queries and he got the reply in xml format. Here users are
provided with unique id and password. It is provided by the administrator for
security purpose. User can log in using their login id and password to
accomplish two tasks. In first one, Data is created by using Create data. And
then subnode and all the child nodes are filled with data by the user. This is for
storing the data in xml format.

Figure 6.3

It is also designed in such a way this at the same root node is appended in
same and only one xml file. In second one, the stored xml data is retrieved by
using root node. The user is asked for root node which takes the specific data
from entire xml document.

Figure 6.4
6.2.1 Xml Query Answering:
In this project user search the information in semistructure document and
they got reply in xml format only. This module is designed with the aim to
provide fast and efficient retrieval of data in xml format. The data are stored in
xml as tree structured format so that retrieval would be fast and efficient

Figure 6.5

CHAPTER 7
SYSTEM DESIGN

7.1 ARCHITECTURAL DESIGN

Figure 7.1
The purpose of this framework is to perform data mining on XML and obtain
intentional knowledge. The intentional knowledge is also in the form of XML.
This is nothing but rules with supports and confidence. In other words the result
of data mining is TARs (Tree-based Association Rules).

7.2 UML DIAGRAM:
7.2.1 DATAFLOW DIAGRAMS:
A data flow diagram is graphical tool used to describe and analyze
movement of data through a system. These are known as the logical data flow
diagrams.

ADMIN

Figure 7.2
In this figure, admin and user can login using id and password to access the
applications.

Figure 7.3

This Data flow diagram is used for admin to use semistructure document from which TAR
rules are extracted and xml data is efficiently retrieved.

USER MODULE

Figure 7.4
This figure is used for the user to search and retrieve data. It also used to
display the searched data.

Fig ure 7.5

login
registration
search
user
xml query answering
This figure is used by the user to enter xml query. It will search data by
answering to the query.

7.2.2 USE CASE DIAGRAM

ADMIN:

Figure 7.6

This figure is used to represents users and their operations and how they
are related. Here admin has the right to enter data and upload them.

Figure 7.7

This figure represents users operations such as login, registration, search and xml query
Answering. Here they can create and retrieve data.

7.2.3CLASSDIAGRAM

Figure 7.8

The class diagram has the information about class members and class variables.
In this figure, three classes are there. They are user, admin and xml query
answering.

7.2.4 ER-DIAGRAM

Figure 7.9
This diagram is used to depict the relationship between user, admin and
database. User and admin can retrieve and view data.

7.2.5 SEQUENCE DIAGRAMS:
A sequence diagram is a kind of interaction diagram in UML that shows how
processes operate one with another and in what order. A sequence diagram
shows, as parallel vertical lines ("lifelines"), different processes or objects that
live simultaneously, and, as horizontal arrows, the messages exchanged between
them, in the order in which they occur.

admin user database
enter xml data
registration
login
search xml data
answering

Figure 7.10

This diagram depicts sequential flow of admin, user and database. Admin send
xml data to the database and user can search and retrieve data.

7.2.6 COLLOBORATION DIAGRAM

Figure 7.11

This diagram shows operations in collaboration. Here numbering is used to find
correct sequence.

CHAPTER 8
TESTING

8.1TESTING
A process of executing a program with the explicit intention of finding
errors, that is making the program fail.

8.1Software Testing:

It is the process of testing the functionality and correctness of software by
running it. Process of executing a program with the intent of finding an error.

A good test case is one that has a high probability of finding an as yet
undiscovered error. A successful test is one that uncovers an as yet
undiscovered error. Software Testing is usually performed for one of two
reasons:
Defect detection
Reliability estimation

8.1.1 Black Box Testing:

Applies to software systems or module, tests functionality in terms of
inputs and outputs at interfaces.Test reveals if the software function is fully
operational with reference to requirements specification.

8.1.2 White Box Testing:

Knowing the internal workings i.e., to test if all internal operations are
performed according to program structures and data structures.
To test if all internal components have been adequately exercised.

8.2 Software Testing Strategies:
A strategy for software testing will begin in the following order:
1. Unit testing
2. Integration testing
3. Validation testing
4. System testing

UNIT TESTING
It concentrates on each unit of the software as implemented in source
code and is a white box oriented. Using the component level design description
as a guide, important control paths are tested to uncover errors within the
boundary of the module. In the unit testing, The step can be conducted in
parallel for multiple components.

INTEGRATION TESTING:
Here focus is on design and construction of the software architecture.
Integration testing is a systematic technique for constructing the program
structure while at the same time conducting tests to uncover errors associated
with interfacing. The objective is to take unit tested components and build a
program structure that has been dictated by design.

VALIDATION TESTING:
In this, requirements established as part of software requirements analysis
are validated against the software that has been constructed i.e., validation
succeeds when software functions in a manner that can reasonably expected by
the customer.

CHAPTER 9
CONCLUSION

The main goals we have achieved in this work are:

1) mine all frequent association rules without imposing any a-priori
restriction on the structure and the content of the rules;

2)store mined information in XML format;

3)use extracted knowledge to gain information about the original datasets.

4)We have developed a C++ prototype that has been used to test the
effectiveness of our proposal. We have not discussed the updatability
of both the document storing TARs and their index.

FUTURE ENHANCEMENT

In future , I will study how to incrementally Update mined TARs
when the original XML datasets change and how to further optimize our
mining algorithm. Moreover for the moment I deal with a (substantial)
fragment of XQuery , I would like to find the exact fragment of XQuery which
lends itself to translation into intensional queries.

10. APPENDIX
10.1 SNAP SHOTS

Figure 10.1

This is the home page of my project. It has three main modules such as
admin,user and retrieve xml data. Admin button redirects us to the administrator
page. User button redirects us to the user login page.

Figure 10.2

This figure is used to sign in to admin page. It redirects us to admin page where
we can register new users, retrieve total information in xml format and also
possible to retrieve some specific details only.

Figure 10. 3

In this page, user can login using their unique id and password. New user can
register by clicking new user registration. This page redirects us to the create
data page.

Figure 10. 4

This page is used to create data by the users. First of all, a root node is created
and then root element and further node elements are stored by the user. It is also
designed in such a way that new data can be appended to the existing one.

Figure 10. 5

This page is used to search the document using the key element.the stored
xml data is retrieved by using root node. The user is asked for root node which
takes the specific data from entire xml document.

Figure 10. 6

This page is used find and display the paths of the document which we searched
using root element. Using this path, We can easily retrieve data using these
paths and also used for efficient retrieving.

Figure 10. 7

This page is used by the user to retrieve information which is stored in xml
format. It can retrieve full xml document or can retrieve some specific
information as needed. Retrieve button is clicked to invoke the function.

Figure 10. 8

On clicking the retrieve button the needed information gets displayed in the
small drop down list box.

Figure 10. 9

This page gives us information about how data stored in xml file. The data
entered by users are automatically stored as tags in tree format.

10.2 SAMPLE CODE

USER LOGIN
using System;
using System.Collections;
using System.Configuration;
using System.Data;
using System.Web;
using System.Web.Security;
using System.Web.UI;
using System.Web.UI.HtmlControls;
using System.Web.UI.WebControls;
using System.Web.UI.WebControls.WebParts;
////using System.Xml.Linq;
using System.Data.SqlClient;

public partial class Newuser : System.Web.UI.Page
{
SqlConnection conn = new SqlConnection(@"Data
Source=.\SQLEXPRESS;AttachDbFilename=E:\data_mining\xml query
answering\App_Data\ASPNETDB.MDF;Integrated Security=True;User
Instance=True");
protected void Button1_Click(object sender, EventArgs e)
{
try
{
conn.Open();
SqlCommand cmd = new SqlCommand("insert into login values('" +
TextBox1.Text + "','" + TextBox2.Text + "')", conn);

cmd.ExecuteNonQuery();
conn.Close();
Label6.Visible = true;
Label6.Text = "Successfully Created";
Button2.Visible = true;
}
catch(Exception ex)
{
Label6.Text = "Username Already Exist !!!";
}
CREATE DATA
using System;
using System.Data;
////using System.Linq;
using System.Web;
////using System.Xml.Linq;
using System.Xml;
public partial class CreateData : System.Web.UI.Page
{
string path1, path2, path3, path4, path5;

SqlConnection cnn = new SqlConnection(@"Data
Instance=True");
protected void Button1_Click(object sender, EventArgs e)
{

string xmlFile = Server.MapPath("~/Files/" + TextBox1.Text +".xml");
XmlDocument xd = new XmlDocument();
xd.Load(xmlFile);
XmlNode rootelement = xd.CreateNode(XmlNodeType.Element,
TextBox1.Text, null);
XmlNodeList list = xd.GetElementsByTagName("root");
list[0].AppendChild(rootelement);
xd.Save(xmlFile);
string xmlFile1 = Server.MapPath("~/XMLFile.xml");
XmlDocument xd1 = new XmlDocument();
xd1.Load(xmlFile1);
XmlNode rootelement1 = xd1.CreateNode(XmlNodeType.Element,
XmlNodeList list1 = xd1.GetElementsByTagName("root");
list1[0].AppendChild(rootelement1);
xd1.Save(xmlFile1);
Label4.Text = "node created successfully";
}
protected void Button3_Click1(object sender, EventArgs e)
{

Session["gold"] = TextBox2.Text;
ViewState["count"] = Convert.ToInt32(ViewState["count"]) + 1;
Response.Write(Convert.ToInt32(ViewState["count"]));
string xmlFile = Server.MapPath("~/Files/" + TextBox1.Text + ".xml");
XmlDocument xd = new XmlDocument();
xd.Load(xmlFile);
XmlNode rootelement = xd.CreateNode(XmlNodeType.Element,
XmlNodeList list = xd.GetElementsByTagName("root");
list[0].AppendChild(rootelement);
xd.Save(xmlFile);
//XmlDocument doc = new XmlDocument();
//string xmlFile =
System.Web.HttpContext.Current.Server.MapPath("~/XMLFile.xml");
//doc.Load(xmlFile);
XmlNode node1 = xd.CreateNode(XmlNodeType.Element,
XmlElement ele1 = xd.CreateElement(TextBox4.Text);
ele1.InnerText = TextBox9.Text;
path1 = TextBox2.Text + '.' + TextBox3.Text + '.' + TextBox4.Text;

node1.AppendChild(ele1);
// XmlNodeList list = xd.GetElementsByTagName(TextBox2.Text);
list[0].AppendChild(node1);
xd.Save(xmlFile);

xmlcreate()
string sql = "insert into paths(pathname,pathexp) values('" + TextBox1.Text +
"','" + path1.ToString() + "') insert into paths(pathname,pathexp) values('" +
TextBox1.Text + "','" + path2.ToString() + "') insert into
paths(pathname,pathexp) values('" + TextBox1.Text + "','" + path3.ToString() +
"')insert into paths(pathname,pathexp) values('" + TextBox1.Text + "','" +
path4.ToString() + "') insert into paths(pathname,pathexp) values('" +
TextBox1.Text + "','" + path5.ToString() + "')";
SqlCommand cmd1 = new SqlCommand(sql, cnn);
cnn.Open();
cmd1.ExecuteNonQuery();
Label12.Text = "created data successfully in XMLFile.xml";
cnn.Close();

}

RETRIEVE DATA
using System;
using System.Data;
using System.Web;
using System.Xml;

public partial class UserRetiveData : System.Web.UI.Page
{
SqlConnection con = new SqlConnection(@"Data
Instance=True");
protected void btnreteive_Click(object sender, EventArgs e)
{
ListBox1.Items.Clear();
con.Open();
SqlCommand cmd = new SqlCommand("select pathexp from paths where
pathname='" + txtrname.Text + "'", con);
SqlDataReader dr = cmd.ExecuteReader();
if (dr.HasRows == true)
{
gv.DataSource = dr;
gv.DataBind();
}
else
{
Response.Write("<script language='javascript'>alert('Incorrect Root
element ')</script>");
gv.DataSource = dr;
gv.DataBind();
}

dr.Close();

XmlTextReader reader = new XmlTextReader(Server.MapPath("~/Files/"
+ txtrname.Text +".xml"));

reader.WhitespaceHandling = WhitespaceHandling.None;
XmlDocument xmlDoc = new XmlDocument();
//Load the file into the XmlDocument

xmlDoc.Load(reader);
//Close off the connection to the file.

reader.Close();
//Add and item representing the document to the listbox

ListBox1.Items.Add("XML Document");

//Find the root nede, and add it togather with its childeren

XmlNode xnod = xmlDoc.DocumentElement;
AddWithChildren(xnod, 1);

}

10.REFERENCES

(1) R. Agrawal and R. Srikant. Fast algorithms for mining association rules in
large databases. In Proc. of the 20th Int. Conf. on Very Large Data Bases,
Morgan Kaufmann Publishers Inc., 1994.

(2) T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa.
Efficient substructure discovery from large semi-structured data. In Proc. of the
SIAM Int. Conf. on Data Mining, 2002.

(3) T. Asai, H. Arimura, T. Uno, and S. Nakano. Discovering frequent
substructures in large unordered trees. In Technical Report DOI-TR 216,
Department of Informatics, Kyushu University., 2003.

(4) E. Baralis, P. Garza, E. Quintarelli, and L. Tanca. Answering xml queries
by means of data summaries. ACM Transactions on Information Systems,
25(3):10, 2007.

(5) D. Barbosa, L. Mignet, and P. Veltri. Studying the xml web: Gathering
statistics from an xml sample. World Wide Web, 8(4):413438, 2005.

SITES REFERED
http://www.i.kyushuu.ac
http://www.ieee.org
http://www.computer.org/publications/dlib

Data Mining For XML Query (Repaired)

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Data Mining For XML Query (Repaired)

Diunggah oleh

Hak Cipta:

Format Tersedia

CHAPTER 1

Anda mungkin juga menyukai