Help with Googlebot eating my bandwidth!

learning_brain · Jul 10, 2011

I'm not sure where to put this query.

I've been checking my logs because I easily get through 20GB of Bandwidth per month - which takes me up to my limit prematurely and then cuts me off about 7 days before the end of the month (not X10hosting).

In the stats, Googlebot (just last month) ate 16.10GB last month before service was terminated 8 days before month end - yes that's right 16.10 GB - not MB!!! In more detail, I find they consume about 700,000 KB a day!

It hit pages 251,496 times in June alone.

OUCH!!!

OK - it's a biggish site with currently 342,000 odd pages and images - dynamically listed in the sitemap-indexes.

Under Google Webmaster tools, I've now set crawl rate at 200 seconds between requests, but is there any other advice anyone can give? (other than disallowing bots in the robots file).

Thoughts would be appreciated. I get pretty hacked off when I lose service for days on end.

Thanks

Rich

cybrax · Jul 10, 2011

Well.. if the site is making a regular income now might be the time to upgrade the hosting

On the other hand I would consider swaping in a medium size image for pages that googlebot crawls rather than leave the high resolution one in place. Not sure how the folk at Mountain View feel about that, so probably better to ask them first.

Without the stats it's hard to say, are these unique crawls or is the bot repeatedly crawling the same images over and over? Dynamically re-writing the robots file may provide a solution if this is the case.

essellar · Jul 10, 2011

Well, the number of page hits isn't bad at all for indexing a site that size; the problem is the amount of data per page. I can't see how ~250K pages translates to >16GB unless Google is indexing your high-res images. If the high-res image view pages are part of your site map, I can't see how to claw back the bandwidth without blocking access to your images (non-thumbs) directory with robots.txt. On the other hand, if what you're offering for indexing is just the search results pages, then you should be able to use rel="nofollow" in the links on the thumbnails -- the thumbs will still be indexed (probably multiple times each, since they're likely to turn up on multiple results pages), but the high-res images won't.

descalzo · Jul 10, 2011

If you have your images in several directories, try excluding the bots from one directory per month.

lllllllbob61 · Jul 10, 2011

oh wow.. that's a lot bandwidth. I will try to help ya. :-D

Try Enabling file caching with your htaccess file. That should help. and improve page load time.

Add these lines to your htaccess file and save. You can also change the # days to # months.
(This htaccess code is for an apache server. Like stoli here at x10. ;-)
---------------------------------------------

Code:

## EXPIRES CACHING ##
<IfModule mod_expires.c>
ExpiresActive On
ExpiresByType image/jpg "access 7 days"
ExpiresByType image/jpeg "access 7 days"
ExpiresByType image/gif "access 7 days"
ExpiresByType image/png "access 7 days"
ExpiresByType text/css "access 7 days"
ExpiresByType application/pdf "access 7 days"
ExpiresByType text/x-javascript "access 7 days"
ExpiresByType application/x-shockwave-flash "access 7 days"
ExpiresByType image/x-icon "access 7 days"
ExpiresDefault "access 7 days"
</IfModule>
<IfModule mod_headers.c>
<FilesMatch "\.(js|css|xml|gz)$">
Header append Vary Accept-Encoding
</FilesMatch>
</IfModule>
<IfModule mod_deflate.c>
#The following line is enough for .js and .css
AddOutputFilter DEFLATE js css
#The following line also enables compression by file content type, for the following list of Content-Type:s
AddOutputFilterByType DEFLATE text/html text/plain text/xml application/xml
#The following lines are to avoid bugs with some browsers
BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
</IfModule>
## EXPIRES CACHING ##

Check your pages performance with google's page speed online. http://pagespeed.googlelabs.com/
See what your performance score is before and after adding that code.
(note..google pagespeed online may take a bit to acknowledge some of your htaccess updates. )
There is also a browser plugin available for firefox and chrome for instant checks. here-a.

Also in a sitemap.xml you can set <changefreq>monthly</changefreq> <priority>0.5</priority> for each url. But in your case, it would be way too many urls. and a sitemap generator would really eat up your bandwidth I would assume.

Let me know if that helps.

Darkmere · Jul 11, 2011

I do not think you can control the amount of times the spiders visit your site can you? But this might work as well tell the spiders not to send in a download request I dont remember the command right off the top of my head but if it is not there they send a request back to Google to download and index your entire web site every time the spiders visit

learning_brain · Jul 11, 2011

cybrax said:
Well.. if the site is making a regular income now might be the time to upgrade the hosting

Unfortunately, it's making just enough to cover current hosting

cybrax said:
On the other hand I would consider swaping in a medium size image for pages that googlebot crawls rather than leave the high resolution one in place. Not sure how the folk at Mountain View feel about that, so probably better to ask them first.

All of the high res images are (ahem) hotlinked. How is that counting toward bandwidth I hear you ask... I'm not sure either but it is. That said, I'm also struggling on server space - with about 1/3 gone already with only thumbs. Adding a medium res version would simply overload my storage limit......

cybrax said:
Without the stats it's hard to say, are these unique crawls or is the bot repeatedly crawling the same images over and over? Dynamically re-writing the robots file may provide a solution if this is the case.

Pages are being repeatedly crawled. I'm not sure how a robots file re-write will help????? Advice would be good here. I believe the changefreq in the sitemap is ignored by the major engines?

essellar said:
Well, the number of page hits isn't bad at all for indexing a site that size; the problem is the amount of data per page. I can't see how ~250K pages translates to >16GB unless Google is indexing your high-res images. If the high-res image view pages are part of your site map, I can't see how to claw back the bandwidth without blocking access to your images (non-thumbs) directory with robots.txt. On the other hand, if what you're offering for indexing is just the search results pages, then you should be able to use rel="nofollow" in the links on the thumbnails -- the thumbs will still be indexed (probably multiple times each, since they're likely to turn up on multiple results pages), but the high-res images won't.

Yes the high res images are part of the site - mainly to try to improve SEO. Blocking the view_image.php page (either block the one file in robots.txt or your suggested nofollow) would do nicely but will significantly alter availability to SEO.

descalzo said:
If you have your images in several directories, try excluding the bots from one directory per month.

Nope - sorry - thumbs in one dir - high res are all hotlinked (and yes I know that's naughty but I don't have the space to support hundreds of thousands of high res images).

lllllllbob61 said:
oh wow.. that's a lot bandwidth. I will try to help ya. :-D

Try Enabling file caching with your htaccess file. That should help. and improve page load time.

Add these lines to your htaccess file and save. You can also change the # days to # months.
(This htaccess code is for an apache server. Like stoli here at x10. ;-)
---------------------------------------------

Code:

## EXPIRES CACHING ## <IfModule mod_expires.c> ExpiresActive On ExpiresByType image/jpg "access 7 days" ExpiresByType image/jpeg "access 7 days" ExpiresByType image/gif "access 7 days" ExpiresByType image/png "access 7 days" ExpiresByType text/css "access 7 days" ExpiresByType application/pdf "access 7 days" ExpiresByType text/x-javascript "access 7 days" ExpiresByType application/x-shockwave-flash "access 7 days" ExpiresByType image/x-icon "access 7 days" ExpiresDefault "access 7 days" </IfModule> <IfModule mod_headers.c> <FilesMatch "\.(js|css|xml|gz)$"> Header append Vary Accept-Encoding </FilesMatch> </IfModule> <IfModule mod_deflate.c> #The following line is enough for .js and .css AddOutputFilter DEFLATE js css #The following line also enables compression by file content type, for the following list of Content-Type:s AddOutputFilterByType DEFLATE text/html text/plain text/xml application/xml #The following lines are to avoid bugs with some browsers BrowserMatch ^Mozilla/4 gzip-only-text/html BrowserMatch ^Mozilla/4\.0[678] no-gzip BrowserMatch \bMSIE !no-gzip !gzip-only-text/html </IfModule> ## EXPIRES CACHING ##

Check your pages performance with google's page speed online. http://pagespeed.googlelabs.com/
See what your performance score is before and after adding that code.
(note..google pagespeed online may take a bit to acknowledge some of your htaccess updates. )
There is also a browser plugin available for firefox and chrome for instant checks. here-a.

Also in a sitemap.xml you can set <changefreq>monthly</changefreq> <priority>0.5</priority> for each url. But in your case, it would be way too many urls. and a sitemap generator would really eat up your bandwidth I would assume.

Let me know if that helps.

This is great and is certainly something I should do to improve load times. I believe however that googlebot does not use the cache as it is trying to discover changed content... not sure on this one. I've got the FF plugin for site performace.

As for the sitemap, I have a static sitemap-index and several sitemap.php files - each of which are dynamically written depending on images found. So yes, I could do this but as said earlir, I think the major engines ignore the changefreq tag. (I could be wrong).

Darkmere said:
I do not think you can control the amount of times the spiders visit your site can you? But this might work as well tell the spiders not to send in a download request I dont remember the command right off the top of my head but if it is not there they send a request back to Google to download and index your entire web site every time the spiders visit

You can control the hit frequency for google - from webmaster controls. I've now set mine at 200 seconds between each request and I'll see what that does in reality. I don't know about the other engines but the second highest consumer of bandwidth is minute in comparison.

And yeah - re-indexing the entire site each visit would be catastrophic!!!

I think in summary, I'll have to block the crawlers from the high res image pages but this then gives me another problem. (doesn't it always?). How then do I optimise SEO for the thumbnail results pages to best advantage?? I can't really use the usual url, page title, h1 tags etc, so I need to come up with another solution. In addition, if only the thumbs are being indexed, it's going to severley affect my CTR as the high res thing is what the site is about.

...hmmmm

*Scratches head*

Rich

essellar · Jul 11, 2011

Hmm -- if the hi-res images are hotlinked (as opposed to piped through your server or stored locally), that should be a bandwidth freebie at your end, so you can take that out of consideration. That means that the "base problem" is that there are a lot of access paths to the images. You can try extending the expiry on the thumbs, as suggested, but you'll probably find that it's not going to save you as much as you'd hoped -- it depends how Google's spider manages cache. Using a CDN (like Cloudflare) might work a lot better, since only the first request for a thumb would actually hit your server. The real difficulty is, then, that you've covered the search space well: if your search criteria were specific enough to make indexing trivial (say, one keyword per image), then your site would be the next best thing to useless.

If you are piping through (with, say, curl), then you're taking a double bandwidth hit with every request (fulfilling Google's request and making your own for the hi-res image at the original source). You can cut that in half by creating a local copy, but then your storage skyrockets and you're still out-of-pocket a whole bunch to fix it.

learning_brain · Jul 12, 2011

Yep - I thought that hotlinked images were freebies too so you must be right. However, I have them as embedded images rather than link-to-.jpg, so it may be counting?? Perhaps I should just do a simple js lightbox type effect - but then I lose the page ranking benefit.

I am also at a loss as to how Google manages the cache, but if it's anything like mine, it has to load the entire content to check for currency of content. The only other way is if google were to store every page on first crawl and then only look for changes... which I can't see happening.

Double hitting is a problem with Adsense pages. Every time a user hits one page, it also get hit by Google Media Partners - another blow!

The crawler/spiderer is the only function to use cURL, but it does run pretty much constantly so that too is eating into bandwidth - but surely that wouldn't be linked to the google drain??

The alteration of the google request frequency doesn't seem to have helped much either so no joy there....

I think for the moment, I'm just going to block google from the view_image page until I can come up with a more permanent solution. It will take a while for the indexed pages to drop off anyway so I have some time to think. It will also tell me if that's where the major hit is coming from.

Rich

Help with Googlebot eating my bandwidth!

learning_brain

New Member

cybrax

Community Advocate

essellar

Community Advocate

descalzo

Grim Squeaker

lllllllbob61

New Member

Darkmere

New Member

learning_brain

New Member

essellar

Community Advocate

learning_brain

New Member

Free Web Hosting

Our Community

Legal