Class: Arachni::Spider
- Inherits:
-
Object
- Object
- Arachni::Spider
- Includes:
- UI::Output
- Defined in:
- lib/spider.rb
Overview
Spider class
Crawls the URL in opts[:url] and grabs the HTML code and headers.
@author: Anastasios “Zapotek” Laskos
<tasos.laskos@gmail.com> <zapotek@segfault.gr>
@version: 0.1-pre
Instance Attribute Summary (collapse)
-
- (Proc) on_every_page_blocks
readonly
Code block to be executed on each page.
- - (Options) opts readonly
-
- (Array) sitemap
readonly
Sitemap, array of links.
Instance Method Summary (collapse)
-
- (Spider) initialize(opts)
constructor
Constructor
Instantiates Spider class with user options. -
- (self) on_every_page(&block)
Hook for further analysis of pages, statistics etc.
-
- (Array) run(&block)
Runs the Spider and passes the url, html and headers Hash.
Methods included from UI::Output
#debug!, #debug?, #only_positives!, #only_positives?, #print_debug, #print_debug_backtrace, #print_debug_pp, #print_error, #print_info, #print_line, #print_ok, #print_status, #print_verbose, #verbose!, #verbose?
Constructor Details
- (Spider) initialize(opts)
Constructor
Instantiates Spider class with user options.
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
# File 'lib/spider.rb', line 63 def initialize( opts ) @opts = opts @anemone_opts = { :threads => 3, :discard_page_bodies => false, :delay => 0, :obey_robots_txt => false, :depth_limit => false, :link_count_limit => false, :redirect_limit => false, :storage => nil, :cookies => nil, :accept_cookies => true, :proxy_addr => nil, :proxy_port => nil, :proxy_user => nil, :proxy_pass => nil } hash_opts = @opts.to_h @anemone_opts.each_pair { |k, v| @anemone_opts[k] = hash_opts[k.to_s] if hash_opts[k.to_s] } @anemone_opts = @anemone_opts.merge( hash_opts ) @sitemap = [] @on_every_page_blocks = [] # if we have no 'include' patterns create one that will match # everything, like '.*' @opts.include = @opts.include.empty? ? [ Regexp.new( '.*' ) ] : @opts.include end |
Instance Attribute Details
- (Proc) on_every_page_blocks (readonly)
Code block to be executed on each page
55 56 57 |
# File 'lib/spider.rb', line 55 def on_every_page_blocks @on_every_page_blocks end |
- (Array) sitemap (readonly)
Sitemap, array of links
48 49 50 |
# File 'lib/spider.rb', line 48 def sitemap @sitemap end |
Instance Method Details
- (self) on_every_page(&block)
Hook for further analysis of pages, statistics etc.
172 173 174 175 |
# File 'lib/spider.rb', line 172 def on_every_page( &block ) @on_every_page_blocks.push( block ) self end |
- (Array) run(&block)
Runs the Spider and passes the url, html and headers Hash
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
# File 'lib/spider.rb', line 107 def run( &block ) i = 1 # start the crawl Anemone.crawl( @opts.url, @anemone_opts ) { |anemone| # apply 'exclude' patterns anemone.skip_links_like( @opts.exclude ) if @opts.exclude # apply 'include' patterns and grab matching pages # as they are discovered anemone.on_pages_like( @opts.include ) { |page| url = page.url.to_s # something went kaboom, tell the user and skip the page if page.error print_error( "[Error: " + (page.error.to_s) + "] " + url ) print_debug_backtrace( page.error ) next end # push the url in the sitemap @sitemap.push( url ) print_line print_status( "[HTTP: #{page.code}] " + url ) # call the block...if we have one if block block.call( url, page.body, page.headers ) end # run blocks specified later @on_every_page_blocks.each { |block| block.call( page ) } # we don't need the HTML doc anymore page.discard_doc!( ) # make sure we obey the link count limit and # return if we have exceeded it. if( @opts.link_count_limit && @opts.link_count_limit <= i ) return @sitemap.uniq end i+=1 } } return @sitemap.uniq end |