learning_brain
New Member
- Messages
- 206
- Reaction score
- 1
- Points
- 0
My Image crawler is now working but...
The URL crawl picks up every URL link... which is fine on static pages but on dynamic pages, this can be a problem seeing as exactly the same page content can have a different URL.
i.e.
http://www.mysite.com/index.php?mai...=12181&zenid=35fb33a00db84d0d133da01967c1c616
is likely to be the same as
http://www.mysite.com/index.php?mai...=12181&zenid=35fb33b54db84d0d133da01967c1c616
now this is tricky because some of the ?q= data is important in dynamically generated site but a lot is irrelevant, such as session etc.
So how do I filter out this garbage???
Rhy
The URL crawl picks up every URL link... which is fine on static pages but on dynamic pages, this can be a problem seeing as exactly the same page content can have a different URL.
i.e.
http://www.mysite.com/index.php?mai...=12181&zenid=35fb33a00db84d0d133da01967c1c616
is likely to be the same as
http://www.mysite.com/index.php?mai...=12181&zenid=35fb33b54db84d0d133da01967c1c616
now this is tricky because some of the ?q= data is important in dynamically generated site but a lot is irrelevant, such as session etc.
So how do I filter out this garbage???
Rhy