Are there any mechanisms to control what the Internet Archive archives on a site? I know to disallow all pages I could add:
User-agent: ia_archiver Disallow: /
Can I tell the bot that I want them to crawl my site once a month, or once a year?
I have a site/pages that doesn’t/don’t get archived correctly because of assets not picked up. Is there a way to tell the Internet Archive bot what assets it needs if it’s going to grab the site?
Here is Solutions:
We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.
Note: This answer is increasingly out-of-date.
The largest contributor to the Internet Archive’s web collection has been Alexa Internet. Material that Alexa crawls for its purposes has been donated to IA a few months later. Adding the disallow rule mentioned in the question does not affect those crawls, but the Wayback will ‘retroactively’ honor them (denying access, the material will still be in the archive – you should exclude Alexa’s robot if you really want to keep your material out of the Internet Archive).
There may be ways to affect Alexa’s crawls, but I’m not familiar with that.
Since IA developed its own crawler (Heritrix) they have started doing their own crawls, but those tend to be targeted crawls (they do election crawls for Library of Congress and have done national crawls for France and Australia etc.). They do not engage in the kind of sustained world scale crawls that Google and Alexa conduct. IA’s largest crawl was a special project to crawl 2 billion pages.
As these crawls are operated on schedules that derive from project specific factors, you can not affect how often they visit your site or if they visit your site.
The only way to directly affect how and when IA crawls your site is to use their Archive-It service. That service allows you to specify custom crawls. The resultant data will (eventually) be incorporated into IA’s web collection. This is however a paid subscription service.
Most search engines support the “Crawl-delay” directive, but I don’t know if IA does. You could try it though:
User-agent: ia_archiver Crawl-delay: 3600
This would limit the delay between requests to 3600 seconds (i.e. 1 hour), or ~700 requests per month.
I don’t think #2 is possible – the IA bot grabs the assets as and when it sees fit. It may have a file size limit to avoid using too much storage.
Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂