Anda di halaman 1dari 17

Building Your Own Web

Spider
Thoughts, Considerations and
Problems
Who am I

• Graduate: Computer Systems Technology


– Fanshawe College – London, ON
• Occupation: Security Research Engineer
– nCircle Network Security – Toronto, ON
o Current Primary Focus – Web Security
Research
o Past Focus – OS X Security, Reverse
Engineering
• Blogger: ComputerDefense.org
Why Discuss This?

• Spiders are becoming more common, and


everyone is making use of them.
• Web Spidering is the backbone for many
Web Application Security Scanners.
• It’s actually a pretty cool topic!
What Will We Talk About?

• Why Build a Spider?


• Current Products
• Design Considerations
• Hurdles
• Sample Spider Code
Why Build a Spider?

• Create the base for a larger web-based


product.
• Monitor Websites for Changes.
• Mirror a Website.
• Download specific types of files.
• Create a Dynamic Search Engine.
Current Products

• The most well known – wget


• Others include:
o Softbyte Labs: Black Widow
o BurpSuite: Burp Spider
o Jspider
o Robots for every major search engine

Others?
Design Considerations
aka ‘Spider Do’s and Don’ts”
• What do I want to spider?
o Do I want specific pages?
 Following on that, do I want specific page
extensions?
o Do I want to submit forms?
 Do I want to submit valid data?
o Do I want to reach authenticated portions of
the website?
o Do I want to support SSL?
‘Do’s and Don’ts’ #2

• What don’t I want to spider for?


o Do I NOT want to spider external links?
o Do I NOT want to download files over X bytes?
o Do I NOT want to follow links on error pages?

What other don’ts can you think of?


Do’s and Don’ts #3 & #4

• Not all web servers are created equal.


o How do you prepare yourself for potential
idiosyncrasies of non-compliant web servers?
o Do you want to handle non-compliant web
servers?
• How far do I spider (maximum recursion
depth)?
o This one is interesting, and potentially unique
to the individual and the task.
Do’s and Don’ts #5 & #6

• What are the user definable options?


o Part of this falls back to our included options.
 Do we allow the user to specific authentication?
o Do we allow the user to provide default
credentials (eg. valid data is required for a blog
comment)
o Does the user define the recursion level and
other tasks laid out in your includes/excludes?
 Which begs the question: Who are you designing
this for?
Hurdles

• (X)HTML is unstructured
o Whitespace is insignificant, as are certain design
requirements.
 Forms can have a name but don’t require one, they can
have a action but don’t need one.
 Image (img) tags can have a /> to close or they can
exclude the / and they can even be followed by a </img>
tag
o How do you deal with all these differences when
parsing HTML?
 Build your own parser or integrate a freely available
parser?
Hurdles #2

• Client-Side Technologies are a pain.


o Is it sufficient today for a spider to simply parse
(x)html? Nope.
o We have to consider client side technologies.
Links are built via javascript now, entire sites are
developed in flash.
 Do we ignore these client-side technologies?
 Do we include a javascript engine?
 Do we decompile the flash and parse the actionscript?
What do these actions do to our development
time? Is it justifiable?
Hurdles #3

• What do I consider to be a link?


o Spidering is about finding and following links
but what is considered to be a link?
 Does it require an anchor
 If so, do I follow URI anchors or any anchor, including one
that points to a local name?
 Does a frame src tag count?
 Does an iframe a count?
o Do you parse each of these separately or
include them in a single group?
Simple Spider Sample
#!/usr/bin/python

import urllib
import urlparse
import sys
import re

RECURSION_LEVEL = 3
Simple Spider Sample Continued

def getLinks ( start_page, page_data ) :


url_list = []
anchor_href_regex = '<\s*a\s*href\s*=\s*[\x27\x22]?([a-zA-
Z0-9:/\\\\._-]*)[\x27\x22]?\s*'
urls = re.findall(anchor_href_regex,page_data)
for url in urls :
url_list.append(urlparse.urljoin( start_page, url ))
return url_list

def getPage ( url ) :


page_data = urllib.urlopen(url).read()
return page_data
Simple Spider Sample Continued (2)

if __name__ == '__main__' :
end_results = []
recursion_count = 0
try: page_array = [sys.argv[1]]
except IndexError:
print 'Please provide a valid url.'
sys.exit()
while recursion_count < RECURSION_LEVEL:
results = []
for current_page in page_array:
page_data = getPage( current_page )
link_list = getLinks(current_page, page_data)
for item in link_list:
if item.find( current_page ) != -1:
results.append( item )
Q&A

Questions, Comments, Concerns?

Bring them up now or email me:


ht@computerdefense.org
treguly@ncircle.com

Anda mungkin juga menyukai