With version 6 of the crawler there is also a new and improved crawl-filter file.  This is actually very powerful when used properly and I'll tell you why.  Lazarus also has a feature called "Wild" mode, which basically means that it will not stay on seedlist sites, it will crawl all found links.  This can be good or bad, good because it will have a never-ending queue of links to crawl, bad because some of those sites it crawls to could be junk.  However, you can put restraints on the crawler when in wild mode by editing the crawl-filter and making some modifications to what type of sites you will allow the crawler to crawl.


For example, say you are interested in only crawling .edu and .gov websites, because you know that domains found on those sites will obviously have a backlink from .edu and .gov sites.  So what you would do is add *.gov* and *.edu* to the crawl-filter under the [include] section, like this:


[include]

*.edu*
*.gov*


What this will tell the crawler is that it is okay to crawl any website that include ".edu" or ".gov" in the url.  Then you can add to your seedlist a .edu or .gov site, set it to wild crawl, and it will endlessly find other .edu and .gov websites as long as their are links to them while it is crawling.


You can also use the crawl-filter to tell the crawler which urls not to crawl by adding strings under the [exclude] section like this:


[exclude]

*facebook.com*

*youtube.com*

*twitter.com*


That exclude section would tell the crawler to never crawl urls that contain those strings.