Volume 2, Issue 12, December - 2015. ISSN 2348 4853, Impact Factor 1.317
I.
INTRODUCTION
A web crawler (also known as a robot or a spider) is a system for the bulk downloading of web pages.
Web crawlers are used for a variety of purposes. Most prominently, they are one of the main components
of web search engines, systems that assemble a corpus of web pages, index them, and allow users to issue
queries against the index and find the web pages that match the queries. A related use is web archiving (a
service provided by e.g. the Internet archive), where large sets of web pages are periodically collected
and archived for posterity. A third use is web data mining, where web pages are analyzed for statistical
properties, or where data analytics is performed on them (e.g. would be Attributor [7], a company that
monitors the web for copyright and trademark infringements). Finally, web monitoring services allow
their clients to submit standing queries, or triggers, and they continuously crawl the web and notify
clients of pages that match those queries (e.g would be GigaAlert).
A web crawler is systems that go around over internet storing and collecting data into database for
further arrangement and analysis. The process of web crawling involves gathering pages from the web.
After that they arranging way the search engine can retrieve it efficiently and easily. The critical objective
can do so quickly. Also it works efficiently and easily without much interference with the functioning of
the remote server.
A web crawler begins with a URL or a list of URLs, called seeds. It can visit the URL on the top of the list.
Other hand the web page it looks for hyperlinks to other web pages that means it adds them to the
existing list of URLs in the web pages list. Web crawlers are not a centrally managed repository of info.
1 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
II. CHALLENGES
The basic web crawling algorithm is simple: Given a set of seed Uniform Resource Locators (URLs), a
crawler downloads all the web pages addressed by the URLs, extracts the hyperlinks contained in the
pages, and iteratively downloads the web pages addressed by these hyperlinks. Despite the apparent
simplicity of this basic algorithm, web crawling has many inherent challenges:
Scale: - The web is very large and continually evolving. Crawlers that seek broad coverage and good
freshness must achieve extremely high throughput, which poses many difficult engineering problems.
Modern search engine companies employ thousands of computers and dozens of high-speed network
links.
Content selection tradeoffs: - Even the highest-throughput crawlers do not purport to crawl the whole
web, or keep up with all the changes. Instead, crawling is performed selectively and in a carefully
controlled order. The goals are to acquire high-value content quickly, ensure eventual coverage of all
reasonable content, and bypass low-quality, irrelevant, redundant, and malicious content. The crawler
must balance competing objectives such as coverage and freshness, while obeying constraints such as
per-site rate limitations. A balance must also be struck between exploration of potentially useful content,
and exploitation of content already known to be useful.
Social obligations: - Crawlers should be good citizens of the web, i.e. not impose too much of a burden
on the web sites they crawl. In fact, without the right safety mechanisms a high-throughput crawler can
inadvertently carry out a denial-of-service attack.
Adversaries: - Some content providers seek to inject useless or misleading content into the corpus
assembled by the crawler. Such behavior is often motivated by financial incentives, for example
misdirecting traffic to commercial web sites.
www.ijafrc.org
www.ijafrc.org
Web crawlers are a central part of search engines, and details on their algorithms and architecture are
kept as business secrets. When crawler designs are published, there is often an important lack of detail
that prevents others from reproducing the work. There are also emerging concerns about "search engine
spamming", which prevent major search engines from publishing their ranking algorithms.
V. CONCLUSION
In this survey paper I have survey different kind of general searching technique and Meta search engine
strategy and by using this I am going to propose an efficient way of searching most relevant data from
hidden web. Also In this I am going to combining multiple search engine and two stage crawler for
harvesting most relevant site. By using page ranking on collected sites and by focusing on a topic,
advanced crawler achieves more accurate results. The two stage crawling performing site locating and
in-site exploration on the site collected by Meta crawler.
4 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
[1]
Feng Zhao, J. Z. (2015). Smart Crawler: Two stage Crawler for Efficiently Harvesting Deep-Web
Interface. IEEE Transactions on Service Computing Volume: pp Year: 2015.
[2]
K. Srinivas, P.V. S. Srinivas, A.Goverdhan (2011). Web Service Architecture for Meta Search
Engine. International Journal of Advanced computer Science and Application.
[3]
Bing Liu (2011). 'Web Data Mining (Exploring Hyperlinks, Contents and Usage Data ). Second
Edition, Copyright: Springer Verlag Berlin Heidelberg 2007. (E-books).
[4]
[5]
Hai-Tao Zheng, Bo-Yeong Kang, Hong-Gee Kim. (2008). An ontology based approach to learnable
focused crawling. Information Sciences.
A. Rungsawang, N. Angkawattanawit (2005). Learnable topic specific web crawler. Journal of
Network and Computer Applications.
[6]
[7]
Ahmed Patel, Nikita Schmidt (2011). Application of structured document parsing to focused web
crawling. Computer Standards& Interfaces.
[8]
Sotiris Batsakis, Euripides G.M. Petrakis, Evangelos Milios(2009). Improving the performance of
focused web crawlers. Data & Knowledge Engineering.
Michael K. Bergman (2001). The Deep Web: Surfing Hidden Value. Bright Planet-Deep Web
Content.
[9]
[10]
Kevin Chen-Chuan Chang, Bin He and Zhen Zhang. Towards large scale integration: Building a
Meta Querier over database on the web. In CIDR 44-55, 2005.
[11]
Gaikwad Dhananjay M , Tanpure Navnath B., Gulaskar Sangram S. , Bakale Avinash D. Efficient
Deep-Web-Harvesting Using Advanced Crawler. International Journal on Recent and Innovation
Trends in Computing and Communication ,Volume: 3 Issue: 9 ISSN: 2321-8169 ,5540 5542
[12]
Mangesh Manke, Kamlesh Kumar Singh, Vinay Tak, Amit Kharade , Crawdy: Integrated crawling
system for deep web crawling, International Journal of Advanced Research in Computer and
Communication Engineering, Vol. 4, Issue 9, September 201, ISSN (Online) 2278-1021
ISSN (Print) 2319-5940,
[13]
[14]
E. Adar, J. Teevan, S. T. Dumais, and J. L. Elsas, The web changes everything: Understanding the
dynamics of web content, in Proceedings of the 2nd International Conference on Web Search and
Data Mining, 2009.
www.ijafrc.org
[16]
J. Cho and A. Ntoulas, Eective change detection using sampling, in Proceedings of the 28th
International Conference on Very Large Data Bases, 2002.
[17]
J. Cho and U. Schonfeld, RankMass crawler: A crawler with high personalized Page Rank
coverage guarantee, in Proceedings of the 33rd International Conference on Very Large Data
Bases, 2007.
[18]
E. G. Coman, Z. Liu, and R. R. Weber, Optimal robot scheduling for web search engines, Journal
of Scheduling, vol. 1, no. 1, 1998.
[19]
[20]
A. Dasgupta, A. Ghosh, R. Kumar, C. Olston, S. Pandey, and A. Tomkins, The discoverability of the
web, in Proceedings of the 16th International World Wide Web Conference, 2007.
[21]
A. Dasgupta, R. Kumar, and A. Sasturkar, De-duping URLs via rewrite rules, in Proceedings of
the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008.
www.ijafrc.org