Gavin Spearhead

Devastating Chaos

  Home :: Contact :: Syndication  :: Login
  202 Posts :: 1 Stories :: 22 Comments :: 6 Trackbacks

Article Categories

Archives

Image Galleries

Link farms

Metal music

Misc

News

Trains

Perhaps you know this European directive http://en.wikipedia.org/wiki/Telecommunications_data_retention. It basically says that all your web requests needs to be logged for half a year (or longer in some nations). Naturally it is all for your own good. O and for terrorists of course. Naturally, it is completely useless.

Anyway there is nothing against making it all a little more difficult for them spooks. Perhaps you have heard of trackmenot (Website) or perhaps not. A firefox extension that does random searches on google and other search engines. But you need to have firefox running for it. So I decided to write a tool which can simply run in the background at all times and it works in php.

And this is crawler. It works slightly different from trackmenot. It just downloads a website and simply follows all urls from that page (basically like a webcrawler would) but in a random order. It can bootstrap automatically by fetching a random query from google or it can be bootstrapped by giving one or more urls on the command line. At exit it will save all stored urls to urls.txt which it will load on startup and start again from there. You can send a HUP signal to print the current list of URLs it has queued to fetch or send it a USR2 signal to clear the current list and bootstrap again.

Performance impact on the system is minimal. It only downloads 8k per page and tries only one URL per second. Of course these are all configurable settings.

It can be started from the command line running this command

php crawl.php
You need to have php installed and runnable on a command line. It should basically work on all php installation, but the inclusion of curl is strongly adviced. All settings are in the craw.php file.

The latest changes I made are

  • can broadcast all crawled websites to a UDP port so you can follow what it is doing
  • can send a predefined, random referrer or use the URL itself as a referrer
Some of its other features include
  • Tracking whether a host has an IPv6 address
Download this tool here Crawler

posted on Saturday, February 05, 2011 4:50 PM

Feedback

No comments posted yet.

Post Comment

Title  
Name  
Url
Comment   

ATTENTION: the code you need to copy is CaSe SeNsItIvE and is required to prevent spam.
Enter the code you see: