Perhaps you know this European directive
http://en.wikipedia.org/wiki/Telecommunications_data_retention. It basically says that all your web requests needs to be logged for half a year (or longer in some nations). Naturally it is all for your own good. O and for terrorists of course. Naturally, it is completely useless.
Anyway there is nothing against making it all a little more difficult for them spooks. Perhaps you have heard of trackmenot (Website) or perhaps not. A firefox extension that does random searches on google and other search engines. But you need to have firefox running for it. So I decided to write a tool which can simply run in the background at all times and it works in php.
And this is crawler. It works slightly different from trackmenot. It just downloads a website and simply follows all urls from that page (basically like a webcrawler would) but in a random order. It can bootstrap automatically by fetching a random query from google or it can be bootstrapped by giving one or more urls on the command line. At exit it will save all stored urls to urls.txt which it will load on startup and start again from there. You can send a HUP signal to print the current list of URLs it has queued to fetch or send it a USR2 signal to clear the current list and bootstrap again.
Performance impact on the system is minimal. It only downloads 8k per page and tries only one URL per second. Of course these are all configurable settings.
It can be started from the command line running this command
php crawl.php
You need to have php installed and runnable on a command line. It should basically work on all php installation, but the inclusion of curl is strongly adviced. All settings are in the craw.php file.
The latest changes I made are
- can broadcast all crawled websites to a UDP port so you can follow what it is doing
- can send a predefined, random referrer or use the URL itself as a referrer
Some of its other features include
- Tracking whether a host has an IPv6 address
Download this tool here
Crawler