Filtering Dynamic URL's from URL scrape

learning_brain · Jun 6, 2010

My Image crawler is now working but...

The URL crawl picks up every URL link... which is fine on static pages but on dynamic pages, this can be a problem seeing as exactly the same page content can have a different URL.

i.e.

http://www.mysite.com/index.php?mai...=12181&zenid=35fb33a00db84d0d133da01967c1c616

is likely to be the same as

http://www.mysite.com/index.php?mai...=12181&zenid=35fb33b54db84d0d133da01967c1c616

now this is tricky because some of the ?q= data is important in dynamically generated site but a lot is irrelevant, such as session etc.

So how do I filter out this garbage???

Rhy

lemon-tree · Jun 6, 2010

Just to let you know, crawlers and scrapers are specifically prohibited by the x10 Terms of Service and can result in a suspension.

Script Hosting: Space provided by x10Hosting is to be used to create a functional website, we do not allow bots, content scrapers, or any other script that runs continuously on your account. Any scripts that are executed via cron or manually must be directly related to your website.

learning_brain · Jun 6, 2010

lemon-tree said:
Just to let you know, crawlers and scrapers are specifically prohibited by the x10 Terms of Service and can result in a suspension.

Good job I'm not using their services anymore then

I stopped using X10 months ago when they deleted a complete mysql db when they moved.

essellar · Jun 6, 2010

There's no way to programmatically determine what part of the link is semantically significant in terms of where the link points (apart from obvious URL parts, as in the case of something like "&session_id=xxxxxxxxxxxxxxxxxxxxx"). You would need to follow the links and compare the results (normally by hashing the returned data and comparing hashes with unique pages having similar URLs). Note, though, that different access paths to the same data may result in different HTML presentations (different templates for the same data), which would generate different hashes. And remember that REST-ish URLs (something I tend to use whenever the platform allows it) may mean that the same data may be accessed using very different URLs, depending on how the user discovered the resource, so merely looking at the URL is insufficient.

descalzo · Jun 6, 2010

Why do you think Google etc want you to have a Site Map on PHP driven sites?

KryptosV2 · Jun 7, 2010

Use a regex.replace method. The regular expression can contain normal words to search for, but for the numbers you can use .*? (lazily find any character until the next expression is matched).

To remove, say, products_id you would use...
regex.replace(the file, "products_id.*?&", "");

descalzo · Jun 7, 2010

KryptosV2 said:
Use a regex.replace method. The regular expression can contain normal words to search for, but for the numbers you can use .*? (lazily find any character until the next expression is matched).

To remove, say, products_id you would use...
regex.replace(the file, "products_id.*?&", "");

His problem is that he does not know the format of the query string ahead of time. He is working on a form of 'spider' or 'crawler' that will visit various sites automatically.

learning_brain · Jun 7, 2010

descalzo said:
His problem is that he does not know the format of the query string ahead of time. He is working on a form of 'spider' or 'crawler' that will visit various sites automatically.

Absolutely!

Thanks everyone for confirming what I already feared.

I did think about comparing content with existing, but this is a huge drain on resources (I would think).

descalzo - site maps! why didn't I think of that! As I can obtain the root address, it's also likely I can find the site map file (if exists) on dynamic sites. I'll do an initial check on that first and then add urls if one isn't found. This is still likely to give me problems though so I'll have to do a url queue purge every so often on sites that are multiplying uncontrollably.

Filtering Dynamic URL's from URL scrape

learning_brain

New Member

lemon-tree

x10 Minion

learning_brain

New Member

essellar

Community Advocate

descalzo

Grim Squeaker

KryptosV2

New Member

descalzo

Grim Squeaker

learning_brain

New Member

Free Web Hosting

Our Community

Legal