Filtering Dynamic URL's from URL scrape

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
My Image crawler is now working but...

The URL crawl picks up every URL link... which is fine on static pages but on dynamic pages, this can be a problem seeing as exactly the same page content can have a different URL.

i.e.

http://www.mysite.com/index.php?mai...=12181&zenid=35fb33a00db84d0d133da01967c1c616

is likely to be the same as

http://www.mysite.com/index.php?mai...=12181&zenid=35fb33b54db84d0d133da01967c1c616

now this is tricky because some of the ?q= data is important in dynamically generated site but a lot is irrelevant, such as session etc.

So how do I filter out this garbage???

Rhy
 

lemon-tree

x10 Minion
Community Support
Messages
1,420
Reaction score
46
Points
48
Just to let you know, crawlers and scrapers are specifically prohibited by the x10 Terms of Service and can result in a suspension.
Script Hosting: Space provided by x10Hosting is to be used to create a functional website, we do not allow bots, content scrapers, or any other script that runs continuously on your account. Any scripts that are executed via cron or manually must be directly related to your website.
 
Last edited:

essellar

Community Advocate
Community Support
Messages
3,295
Reaction score
227
Points
63
There's no way to programmatically determine what part of the link is semantically significant in terms of where the link points (apart from obvious URL parts, as in the case of something like "&session_id=xxxxxxxxxxxxxxxxxxxxx"). You would need to follow the links and compare the results (normally by hashing the returned data and comparing hashes with unique pages having similar URLs). Note, though, that different access paths to the same data may result in different HTML presentations (different templates for the same data), which would generate different hashes. And remember that REST-ish URLs (something I tend to use whenever the platform allows it) may mean that the same data may be accessed using very different URLs, depending on how the user discovered the resource, so merely looking at the URL is insufficient.
 

descalzo

Grim Squeaker
Community Support
Messages
9,373
Reaction score
326
Points
83
Why do you think Google etc want you to have a Site Map on PHP driven sites?
 

KryptosV2

New Member
Messages
24
Reaction score
1
Points
0
Use a regex.replace method. The regular expression can contain normal words to search for, but for the numbers you can use .*? (lazily find any character until the next expression is matched).

To remove, say, products_id you would use...
regex.replace(the file, "products_id.*?&", "");
 

descalzo

Grim Squeaker
Community Support
Messages
9,373
Reaction score
326
Points
83
Use a regex.replace method. The regular expression can contain normal words to search for, but for the numbers you can use .*? (lazily find any character until the next expression is matched).

To remove, say, products_id you would use...
regex.replace(the file, "products_id.*?&", "");

His problem is that he does not know the format of the query string ahead of time. He is working on a form of 'spider' or 'crawler' that will visit various sites automatically.
 

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
His problem is that he does not know the format of the query string ahead of time. He is working on a form of 'spider' or 'crawler' that will visit various sites automatically.

Absolutely!

Thanks everyone for confirming what I already feared.

I did think about comparing content with existing, but this is a huge drain on resources (I would think).

descalzo - site maps! why didn't I think of that! As I can obtain the root address, it's also likely I can find the site map file (if exists) on dynamic sites. I'll do an initial check on that first and then add urls if one isn't found. This is still likely to give me problems though so I'll have to do a url queue purge every so often on sites that are multiplying uncontrollably.
 
Top