Building Your Own Web Spider: Thoughts, Considerations and Problems

Building Your Own Web
Spider
Thoughts, Considerations and
Problems
Who am I
• Graduate: Computer Systems Technology

– Fanshawe College – London, ON
• Occupation: Security Research Engineer
– nCircle Network Security – Toronto, ON
o Current Primary Focus – Web Security
Research
o Past Focus – OS X Security, Reverse
Engineering
• Blogger: ComputerDefense.org
Why Discuss This?
• Spiders are becoming more common, and

everyone is making use of them.
• Web Spidering is the backbone for many
Web Application Security Scanners.
• It’s actually a pretty cool topic!
What Will We Talk About?
• Why Build a Spider?

• Current Products
• Design Considerations
• Hurdles
• Sample Spider Code
Why Build a Spider?
• Create the base for a larger web-based

product.
• Monitor Websites for Changes.
• Mirror a Website.
• Download specific types of files.
• Create a Dynamic Search Engine.
Current Products
• The most well known – wget

• Others include:
o Softbyte Labs: Black Widow
o BurpSuite: Burp Spider
o Jspider
o Robots for every major search engine
Others?
Design Considerations
aka ‘Spider Do’s and Don’ts”
• What do I want to spider?
o Do I want specific pages?
 Following on that, do I want specific page
extensions?
o Do I want to submit forms?
 Do I want to submit valid data?
o Do I want to reach authenticated portions of
the website?
o Do I want to support SSL?
‘Do’s and Don’ts’ #2
• What don’t I want to spider for?

o Do I NOT want to spider external links?
o Do I NOT want to download files over X bytes?
o Do I NOT want to follow links on error pages?
What other don’ts can you think of?

Do’s and Don’ts #3 & #4
• Not all web servers are created equal.

o How do you prepare yourself for potential
idiosyncrasies of non-compliant web servers?
o Do you want to handle non-compliant web
servers?
• How far do I spider (maximum recursion
depth)?
o This one is interesting, and potentially unique
to the individual and the task.
Do’s and Don’ts #5 & #6
• What are the user definable options?

o Part of this falls back to our included options.
 Do we allow the user to specific authentication?
o Do we allow the user to provide default
credentials (eg. valid data is required for a blog
comment)
o Does the user define the recursion level and
other tasks laid out in your includes/excludes?
 Which begs the question: Who are you designing
this for?
Hurdles
• (X)HTML is unstructured
o Whitespace is insignificant, as are certain design
requirements.
 Forms can have a name but don’t require one, they can
have a action but don’t need one.
 Image (img) tags can have a /> to close or they can
exclude the / and they can even be followed by a </img>
tag
o How do you deal with all these differences when
parsing HTML?
 Build your own parser or integrate a freely available
parser?
Hurdles #2
• Client-Side Technologies are a pain.

o Is it sufficient today for a spider to simply parse
(x)html? Nope.
o We have to consider client side technologies.
Links are built via javascript now, entire sites are
developed in flash.
 Do we ignore these client-side technologies?
 Do we include a javascript engine?
 Do we decompile the flash and parse the actionscript?
What do these actions do to our development
time? Is it justifiable?
Hurdles #3
• What do I consider to be a link?

o Spidering is about finding and following links
but what is considered to be a link?
 Does it require an anchor
 If so, do I follow URI anchors or any anchor, including one
that points to a local name?
 Does a frame src tag count?
 Does an iframe a count?
o Do you parse each of these separately or
include them in a single group?
Simple Spider Sample
#!/usr/bin/python
import urllib
import urlparse
import sys
import re
RECURSION_LEVEL = 3
Simple Spider Sample Continued
def getLinks ( start_page, page_data ) :

url_list = []
anchor_href_regex = '<\s*a\s*href\s*=\s*[\x27\x22]?([a-zA-
Z0-9:/\\\\._-]*)[\x27\x22]?\s*'
urls = re.findall(anchor_href_regex,page_data)
for url in urls :
url_list.append(urlparse.urljoin( start_page, url ))
return url_list
def getPage ( url ) :

page_data = urllib.urlopen(url).read()
return page_data
Simple Spider Sample Continued (2)
if __name__ == '__main__' :
end_results = []
recursion_count = 0
try: page_array = [sys.argv[1]]
except IndexError:
print 'Please provide a valid url.'
sys.exit()
while recursion_count < RECURSION_LEVEL:
results = []
for current_page in page_array:
page_data = getPage( current_page )
link_list = getLinks(current_page, page_data)
for item in link_list:
if item.find( current_page ) != -1:
results.append( item )
Q&A
Questions, Comments, Concerns?
Bring them up now or email me:

ht@computerdefense.org
treguly@ncircle.com

Building Your Own Web Spider: Thoughts, Considerations and Problems

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Building Your Own Web Spider: Thoughts, Considerations and Problems

Diunggah oleh

Hak Cipta:

Format Tersedia

Building Your Own Web

• Graduate: Computer Systems Technology

• Spiders are becoming more common, and

• Why Build a Spider?

• Create the base for a larger web-based

• The most well known – wget

• What don’t I want to spider for?

What other don’ts can you think of?

• Not all web servers are created equal.

• What are the user definable options?

• Client-Side Technologies are a pain.

• What do I consider to be a link?

def getLinks ( start_page, page_data ) :

def getPage ( url ) :

Questions, Comments, Concerns?

Bring them up now or email me:

Anda mungkin juga menyukai