cURL , how to scrape multiple page ?

fordvb

New Member
Messages
4
Reaction score
0
Points
0
Hello everyone,
i want to use cUrl to scrape content from another website and store them to a database , like in www.filestube.com .
i know how to get a page html code and use preg_match to get links for example . but i want to entre every link i find using preg_match and go to that link and get specific things on that page.
 

Mr. DOS

Member
Messages
230
Reaction score
5
Points
18
preg_match is probably the wrong way to go about identifying URL's, seeing as it only identifies whether or not a match exists, not where in the string it exists. I don't have the time to look for the exact code to do what you need right now, but here's the steps your code needs to take:
1. Load the external page into a string.
2. Loop through the string looking for instances of href=". For each instance, get a substring from its position plus its length to the position of the next ". Add each substring to an array.
3. Once you've got the array populated, you'll need to loop through it (tip: foreach) and then process each item as necessary.

The big step is #2. I hope I've described it clearly enough. There's a bunch of built-in PHP functions to help.

--- Mr. DOS
 

descalzo

Grim Squeaker
Community Support
Messages
9,373
Reaction score
326
Points
83
Do you use cURL to get the base page (the one with the links)?
Scrape the links, put them in an array
Loop through the array, using cURL to grab the secondary pages.
Scrape each of those pages, put the info into the DB.

How much code do you have already?
Which part of the above is the part you cannot do?
 

fordvb

New Member
Messages
4
Reaction score
0
Points
0
The problem is not the first step . i can get all links , ihave used foreach and cUrl to get content but i can't get all links content, it stop on the secound link and show an error message .
 

Mr. DOS

Member
Messages
230
Reaction score
5
Points
18
Well that's different then! What code do you have already, and what's the error?

--- Mr. DOS
 

descalzo

Grim Squeaker
Community Support
Messages
9,373
Reaction score
326
Points
83
and show an error message .

And the error message was.........?

Did you call

PHP:
curl_close($curlHandle);

at the bottom of the loop?

Were you using the curl_multi_xxx family of functions?
 
Last edited:

JenniC

New Member
Messages
1
Reaction score
0
Points
0
This is just an alternate approach.

To get the contents of all web pages referred in www.filestube.com, you can try the following code.


Code:
# Script urlcontents.txt
# Get the URLs referred in the referring URL.
var str URLList, nextURL
script ss_URLs.txt URL("[URL="http://www.filestube.com/"]http://www.filestube.com[/URL]") > $URLList
echo -e "DEBUG: Found the following URLs."
echo -e $URLList
 
# Go thru URL List one by one and print contents using cat.
while ($URLList <> "")
do
 
    # Get the next referred URL.
    lex "1" $URLList > $nextURL
 
    # Show this URL's contents.
    cat $nextURL
done


Script is in biterscripting. Save the script in file C:/Scripts/urlcontents.txt, run it with the following command.

Code:
script "C:/Scripts/urlcontents.txt"

The documentation for the SS_URLs script and cat command are at http://www.biterscripting.com/SS_URLs.html and http://www.biterscripting.com/helppages/repro.html . I just tested this - my output is showing the following ( I am posting only the beginning portion.)

DEBUG: Found the following URLs.
http://static.filestube.com/files/styles/ft_home.2517.css
http://static.filestube.com/files/images/favicon.ico
http://groups.filestube.com
http://video.filestube.com
http://games.filestube.com
http://lyrics.filestube.com
http://software.filestube.com
http://filestube.com/account/login.html
http://www.filestube.com/javascript:showShortcutLinks()
http://www.filestube.com/javascript:showShortcutLinks()
http://www.filestube.com/account/login.html
http://www.filestube.com/account/register_choice.html
http://www.filestube.com/lists/downloads.html
http://www.filestube.com/lists/mp3.html
http://www.filestube.com/account/history.html
http://www.filestube.com/mp3more.html
http://www.filestube.com/advanced_search.html
http://www.filestube.com/payments.html
http://www.filestube.com/c/craig+robinson
http://www.filestube.com/c/crooked+houses
http://www.filestube.com/k/kenny+britt
http://www.filestube.com/d/discover+the+networks
http://www.filestube.com/f/find+my+family
http://www.filestube.com/j/jean+val+jean
http://www.filestube.com/r/recursion
http://www.filestube.com/t/troglodyte
http://www.filestube.com/w/warsaw+community+schools
http://www.filestube.com/j/joe+mauer
http://www.filestube.com/d/dreamhost
http://www.filestube.com/j/jean+valjean
http://www.filestube.com/trends/day.html
http://www.filestube.com/s/sex+3gp
http://www.filestube.com/h/hentai
http://www.filestube.com/l/lady+sonia
http://www.filestube.com/f/filetube
http://www.filestube.com/m/mixed+wrestling
http://www.filestube.com/n/nudist
http://www.filestube.com/a/abby+winters
http://www.filestube.com/t/them+crooked+vultures
http://www.filestube.com/l/lady+gaga+bad+romance
http://www.filestube.com/s/sean+cody
http://www.filestube.com/n/naughty+america
http://www.filestube.com/e/esperanza+gomez
http://www.filestube.com/trends/month.html
http://www.filestube.com/e61458644ccd4c0d03e9
http://www.filestube.com/8e5850c552a0137603e9
http://www.filestube.com/a3642a10e552a9eb03e9
http://www.filestube.com/60acaa5e533d650e03e9
http://www.filestube.com/b7a46520fd97749703e9
http://www.filestube.com/1ad4fd36c1d0e88f03e9
http://www.filestube.com/303558c1b0bc64aa03e9
http://www.filestube.com/a49839140f1228d903e9
http://www.filestube.com/e2ce6f0c8d93ecd103e9
http://www.filestube.com/a93cddbb80c17ae403e9
http://www.filestube.com/59885124ebbd9e3403e9
http://www.filestube.com/1e34da2a788fd59303e9/details.html
http://www.filestube.com/8894491af7ac96f603e9/details.html
http://www.filestube.com/525afcb0541b127803e9/details.html
http://www.filestube.com/c4b323e66040910e03e9/details.html
http://www.filestube.com/7e179dcadf38d1ca03e9/details.html
http://www.filestube.com/e8dc3eb94f26342403e9/details.html
http://www.filestube.com/4b6bdcb64dc65c2f03e9/details.html
http://www.filestube.com/dde7fc76e9e3786803e9/details.html
http://www.filestube.com/67b28d3304126c6a03e9/details.html
http://www.filestube.com/f1f7b0daa0b7739303e9/details.html
http://www.filestube.com/6097111fddb92e9a03e9/details.html
http://www.filestube.com/58872f60eceb11c903e9
http://www.filestube.com/a591bded5dd676ba03ea/details.html
http://www.filestube.com/2cb163c706e798db03e9/details.html
http://www.filestube.com/ba3244005b8c31a303e9/details.html
http://www.filestube.com/00e477651400dc4003e9/details.html
http://www.filestube.com/ce53f8d72761974b03e9/details.html
http://www.filestube.com/949ca1051f01d2df03e9
http://www.filestube.com/c9d5a79247a35c1903e9/details.html
http://www.filestube.com/2013ac13ede8e7c403ea
http://www.filestube.com/334b63eec25858c203ea/details.html
http://www.filestube.com/d5f664674593452003e9
http://www.filestube.com/e23eaee0831dd70003e9/details.html
http://www.filestube.com/6c9a336c67c110cc03ea/details.html
http://www.filestube.com/fada9cc21031be4d03ea/details.html
http://www.filestube.com/d174bdc19c41da1d03ea/details.html
http://www.filestube.com/6bf1fe87d22ebd2c03ea/details.html
http://www.filestube.com/fd93b7f9e9479a2a03ea/details.html
http://www.filestube.com/e48f51b45e4f333e03ea/details.html
http://www.filestube.com/47027b30f96aeebd03ea/details.html
http://www.filestube.com/170c57685c778f7103e9
http://www.filestube.com/d5d6c0814256c3e403e9
http://www.filestube.com/743f523e9be56eba03e9/details.html
http://www.filestube.com/02e5ab72415f90e003e9
http://www.filestube.com/3b1a7c2542e70c8703e9
http://www.filestube.com/5ff8dca055da511d03e9
http://www.filestube.com/0109c6e162677ffc03e9/details.html
http://www.filestube.com/850f24810c7bc90f03e9
http://www.filestube.com/0d921be4a849589e03e9
http://www.filestube.com/3db74b31dc1eb5a703e9
http://www.filestube.com/9e23df226a8bf9ad03e9
http://www.filestube.com/7063cf544566e70d03e9
http://www.filestube.com/b518526bf706827903e9
http://www.filestube.com/3286239d5917d13f03ea
http://www.filestube.com/1954e2c5393effdd03e9
http://www.filestube.com/about.html
http://blog.filestube.com
http://www.filestube.com/privacy.html
http://www.filestube.com/submit.html
http://www.filestube.com/terms.html
http://www.filestube.com/dmca.html
http://www.filestube.com/contact.html
http://www.filestube.com/api.html
http://www.hollywire.com
html,body{margin:0;padding:0}body{font:12px Tahoma,Helvetica,sans-serif;background:#fff}h2{font-size:18px;margin-top:10px}a{color:#1549c1}a:hover{text-decoration:none}img{border:0px}.logo{clear:both;margin:0px
auto;padding:20px
0;display:block;width:195px}.content{text-align:center}.content
ul{margin:10px
auto;width:500px;list-style:square;text-align:left}.content
li{padding:3px
.
.
.
 
Last edited:
Top