Class: Arachni::Spider

Inherits:

Object

Object
Arachni::Spider

show all

Includes:

UI::Output

Defined in:

lib/spider.rb

Overview

Spider class

Crawls the URL in opts[:url] and grabs the HTML code and headers.

@author: Anastasios “Zapotek” Laskos

                                     <tasos.laskos@gmail.com>
                                     <zapotek@segfault.gr>

@version: 0.1-pre

Instance Attribute Summary (collapse)

- (Proc) on_every_page_blocks readonly
Code block to be executed on each page.
- (Options) opts readonly
- (Array) sitemap readonly
Sitemap, array of links.

Instance Method Summary (collapse)

- (Spider) initialize(opts) constructor
Constructor
Instantiates Spider class with user options.
- (self) on_every_page(&block)
Hook for further analysis of pages, statistics etc.
- (Array) run(&block)
Runs the Spider and passes the url, html and headers Hash.

Methods included from UI::Output

#debug!, #debug?, #only_positives!, #only_positives?, #print_debug, #print_debug_backtrace, #print_debug_pp, #print_error, #print_info, #print_line, #print_ok, #print_status, #print_verbose, #verbose!, #verbose?

Constructor Details

- (Spider) initialize(opts)

Constructor
Instantiates Spider class with user options.

Parameters:

(Options) opts

# File 'lib/spider.rb', line 63

def initialize( opts )
    @opts = opts
    @anemone_opts = {
        :threads              =>  3,
        :discard_page_bodies  =>  false,
        :delay                =>  0,
        :obey_robots_txt      =>  false,
        :depth_limit          =>  false,
        :link_count_limit     =>  false,
        :redirect_limit       =>  false,
        :storage              =>  nil,
        :cookies              =>  nil,
        :accept_cookies       =>  true,
        :proxy_addr           =>  nil,
        :proxy_port           =>  nil,
        :proxy_user           =>  nil,
        :proxy_pass           =>  nil
    }

    hash_opts = @opts.to_h
    @anemone_opts.each_pair {
        |k, v|
        @anemone_opts[k] = hash_opts[k.to_s] if hash_opts[k.to_s]
    }
    
    @anemone_opts = @anemone_opts.merge( hash_opts )
    
    @sitemap = []
    @on_every_page_blocks = []

    # if we have no 'include' patterns create one that will match
    # everything, like '.*'
    @opts.include =
        @opts.include.empty? ? [ Regexp.new( '.*' ) ] : @opts.include
end

Instance Attribute Details

- (Proc) on_every_page_blocks (readonly)

Code block to be executed on each page

Returns:

(Proc)



55
56
57

# File 'lib/spider.rb', line 55

def on_every_page_blocks
  @on_every_page_blocks
end

- (Options) opts (readonly)

Returns:

(Options)



41
42
43

# File 'lib/spider.rb', line 41

def opts
  @opts
end

- (Array) sitemap (readonly)

Sitemap, array of links

Returns:

(Array)



48
49
50

# File 'lib/spider.rb', line 48

def sitemap
  @sitemap
end

Instance Method Details

- (self) on_every_page(&block)

Hook for further analysis of pages, statistics etc.

Parameters:

(Proc) block —
code to be executed for every page

Returns:

(self)

# File 'lib/spider.rb', line 172

def on_every_page( &block )
    @on_every_page_blocks.push( block )
    self
end

- (Array) run(&block)

Runs the Spider and passes the url, html and headers Hash

Parameters:

(Proc) block —
a block expecting url, html, cookies

Returns:

(Array) —
array of links, a sitemap

# File 'lib/spider.rb', line 107

def run( &block )

    i = 1
    # start the crawl
    Anemone.crawl( @opts.url, @anemone_opts ) {
        |anemone|
        
        # apply 'exclude' patterns
        anemone.skip_links_like( @opts.exclude ) if @opts.exclude
        
        # apply 'include' patterns and grab matching pages
        # as they are discovered
        anemone.on_pages_like( @opts.include ) {
            |page|

            url = page.url.to_s
            
            # something went kaboom, tell the user and skip the page
            if page.error
                print_error( "[Error: " + (page.error.to_s) + "] " + url )
                print_debug_backtrace( page.error )
                next
            end

            # push the url in the sitemap
            @sitemap.push( url )

            print_line
            print_status( "[HTTP: #{page.code}] " + url )
            
            # call the block...if we have one
            if block
                block.call( url, page.body, page.headers )
            end

            # run blocks specified later 
            @on_every_page_blocks.each {
                |block|
                block.call( page )
            }

            # we don't need the HTML doc anymore
            page.discard_doc!( )

            # make sure we obey the link count limit and
            # return if we have exceeded it.
            if( @opts.link_count_limit &&
                @opts.link_count_limit <= i )
                return @sitemap.uniq
            end

            i+=1
        }
    }

    return @sitemap.uniq
end