Barkrowler is our experimental and very fresh version of the BUbiNG crawler (it's basically BUbiNG with our pull requests applied and the right configuration for distribution on EC2)
It's supposed to respect robots.txt , and have a politeness setting per HOST and per IP.
However I've received several reports that in several cases, the politeness setting is not enforced.
We're are currently investigating this, hoping to stop very soon this problematic behaviour.
1.) Who are we ?
Exensa is a very small French company specialized in large scale text data analysis. We have worked on social networks, legal documentation, ecommerce.
To give you an idea we have a small demo of wikipedia pages similarity service :
2.) What we're after at your sites?
We crawl the web at large, so no particular target - except, maybe, for experimental purposes, certain languages. We want to identify the semantic / thematic orientation of the web sites and pages.
3.) What we will do with the data we retrieve?
For now, our goal is to provide a "same site" search engine which is better than the alternatives, especially for the long tail (current alternatives allow you to find the first 10/20 similiar sites).
There is no beta online yet (that's why we need to perform a crawl). But we hope very soon.
4.) Why you should allow us to take your property. How does it benefit you, the site owner??
People looking for information sources, customers, providers, or identify competition or possible cooperation may have a use of our tool.
So even though we won't bring you as much traffic as Google, Bing or similar web search engines, the traffic we will provide should be of very high value (and otherwise we won't bother you for long...)