Spider
Thoughts, Considerations and
Problems
Who am I
Others?
Design Considerations
aka ‘Spider Do’s and Don’ts”
• What do I want to spider?
o Do I want specific pages?
Following on that, do I want specific page
extensions?
o Do I want to submit forms?
Do I want to submit valid data?
o Do I want to reach authenticated portions of
the website?
o Do I want to support SSL?
‘Do’s and Don’ts’ #2
• (X)HTML is unstructured
o Whitespace is insignificant, as are certain design
requirements.
Forms can have a name but don’t require one, they can
have a action but don’t need one.
Image (img) tags can have a /> to close or they can
exclude the / and they can even be followed by a </img>
tag
o How do you deal with all these differences when
parsing HTML?
Build your own parser or integrate a freely available
parser?
Hurdles #2
import urllib
import urlparse
import sys
import re
RECURSION_LEVEL = 3
Simple Spider Sample Continued
if __name__ == '__main__' :
end_results = []
recursion_count = 0
try: page_array = [sys.argv[1]]
except IndexError:
print 'Please provide a valid url.'
sys.exit()
while recursion_count < RECURSION_LEVEL:
results = []
for current_page in page_array:
page_data = getPage( current_page )
link_list = getLinks(current_page, page_data)
for item in link_list:
if item.find( current_page ) != -1:
results.append( item )
Q&A