Clearing Dom Object

learning_brain · Jun 3, 2010

I have a site that crawls sites for images using CURL and parsing to DOM elements.

This works great for single urls, but what I want to achieve is for a preliminary a->href search and then a loop to search through all href pages for images as well (1 deep).

Ideally, I would like to extend this to possibly 5 deep, but I'm guessing I would also have timeout issues as well.

This is fine in theory, but this means opening up potentially lots of pages an creating lots of dom documents. My first effort ended up with "memory limit exceeded".

Is there a way to clear a dom document after each loop so that it creates a new one afresh?

Is it as simple as $dom="" ???

misson, if you're reading this, you may remember that I wanted help on preg_match_all and regex issues. After a lot of experimentation, this dom parser seems to be a far simpler solution, together with an absolute path resolver . The results are great and far easier to manipulate.

marshian · Jun 3, 2010

What you should do is create a list of all pages you have to index.
Eg.
1. Download page 1
2. Search & process(*) url's
3. Search & process images
4. Download the next page and continue with 2

(*) = add the url to the list of url's to process.

About the memory usage, see curl_close.

Closes a cURL session and frees all resources. The cURL handle, ch, is also deleted.

That should be what you're looking for

Additionally, about the time-outs, you could just continue going deeper, processing images and getting more url's on the way until you've reached a certain critical time. Or explained in human language, get the current time when you start your indexing and keep processing the next url on your list as long as you've not yet reached X seconds.

Pseudo-code:

Code:

$time = time();
while(time() < $time + 20 && has_url) {
    process_url(next_url());
}

function process_url() {
    // get url's
    add_url($url);
    // get images
    add_image($url);
}

Is that of any use?

learning_brain · Jun 3, 2010

That's a pretty good start... Thanks!

Quick questions though... You mention a "list" of URL's to process. This isn't how I was going about it.

Initially, the page was opening the submitted URL and scaping for a hrefs, then img src's. Then as part of that loop, it would start another loop to go through all url's obtained from the first scrape. As it is aprt of the first scrape loop, I can't close the first CURL otherwise, I lose the URL list to work on.

Your "list" idea sounds better but I don't know how to go about it.... would this be a seperate table in the db which is then accessed by chronjob?

Your timeout idea is a bit confusing. Surely this will limit the time the processing has to complete, not extend it... which is what I would need for a complete 1st level/2nd level url scrape.

I'm determined to get this cracked.

Current code below..

PHP:

<?php
require_once('Connections/freewebhost.php');//connection parameters
mysql_select_db($database_freewebhost, $freewebhost);//select mysql database
require_once('functions/url_to_absolute.php');//function to resolve absolute urls
require_once('functions/getmysqlvaluestring.php');//function to sanitise string prior to mysql injection

if (isset($_POST['domain_url']))//if a url is submitted
{
	//print another form
	echo '
	<form id="form1" name="form1" method="post" action="">
	<label> Submit another Page URL
		<input type="text" name="domain_url" id="domain_url" />
	</label>
	<label>
	  <input type="submit" name="button" id="button" value="Submit" />
	  </label>
	</form>
	'; 
	
	//define url to search
	$domain_url = $_POST['domain_url'];
	echo "<div style='background-color: #ccc; border:1px solid #ccc; padding: 10px; margin-left:10px; margin-top:10px;'>";
	echo "<hr/><strong>Crawling: ".$domain_url."</strong><hr/>";

	//------------------------------------------------CURL------------------------------------------------
	$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
	
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
	curl_setopt($ch, CURLOPT_URL,$domain_url);
	curl_setopt($ch, CURLOPT_FAILONERROR, true);
	//curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
	curl_setopt($ch, CURLOPT_AUTOREFERER, true);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
	curl_setopt($ch, CURLOPT_TIMEOUT, 10);
	$html = curl_exec($ch);
	if (!$html)
	{
		echo "<br />cURL error number:" .curl_errno($ch);
		echo "<br />cURL error:" . curl_error($ch);
		exit;
	}
	
	
	$dom = new DOMDocument();
	@$dom->loadHTML($html);
	
	$xpath = new DOMXPath($dom);
	$img = $xpath->evaluate("/html/body//img");
	//------------------------------------------------END CURL------------------------------------------------
	
	//-------------------------------------start domain url loop-------------------------------
	for ($i = 0; $i < $img->length; $i++)
	{
		echo "<div style='background-color: #fff; border:1px solid #ccc; padding: 10px; margin-left:10px; margin-top:10px;'>";
		$imgTags = $img->item($i);
		$imageUrl = url_to_absolute($domain_url, $imgTags->getAttribute('src'));
		$imageAlt = $imgTags->getAttribute('alt');

		echo '<br/><img src="'.$imageUrl.'"/><br/>';
		echo "Image URL: ".$imageUrl."<br/>";
		echo "Image Keywords: ".$imageAlt."<br/>";
		
		//get image size
	   	$imgSize=getimagesize($imageUrl);
	   	echo "Size: ".$imgSize[0]."x".$imgSize[1]."<br/>";
		
		//if image too small
		if($imgSize[0]<300 || $imgSize[1]<300)
		{
			echo " Image less than 300 x 300px.  Skipping......";
		//else if image is size specified
		} else {
			echo " Size OK";

		   //check if already exists in DB
			if(mysql_num_rows(mysql_query("SELECT URL FROM IMAGES WHERE URL = '$imageUrl'")))
			{
				echo "Image URL exists. Skipping.....";
			} else {
				echo "<br/>Image URL does not exist in database: Saving.....";
			
				//insert into DB
				$insertSQL = sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
									   GetSQLValueString($imageUrl, "text"),
									   GetSQLValueString($imageAlt, "text"));
					
				$Result1 = mysql_query($insertSQL, $freewebhost) or trigger_error('Query failed');
			}
		}//end image test
		
		echo "</div>";
	}//-------------------------------------end domain url loop-------------------------------
	
	echo "<br/>Page crawl complete.";
	echo "</div>";
	
	
	
	
} else {//if form is not submitted, print initial form
	echo '
	<p><a href="index.php">Back to Image Search</a></p>

	<form id="form1" name="form1" method="post" action="">
	<label> Submit full Page URL
		<input type="text" name="domain_url" id="domain_url" />
	</label>
	<label>
	  <input type="submit" name="button" id="button" value="Submit" />
	  </label>
	</form>
	';
}
?>

misson · Jun 3, 2010

curl_close() is similar in many ways to mysql_close(). curl_close() for when you no longer need to fetch resources using a curl session (mysql_close() is for when you're done with a MySQL connection). The counterpart to curl_close() is curl_init() (like mysql_close()/mysql_connect()) and the number of calls to curl_close() must be no greater than the number of calls to curl_open(). You can reuse a curl session (as you can reuse MySQL connections) by simply calling curl_setopt() to set a new URL (along with any other curl options), followed by curl_exec(). Also, when a curl session is garbage collected, it gets closed. As a consequence, curl_close() doesn't necessarily gain you much. If your script has stages where it doesn't need to fetch anything, closing a curl session early may help. If you only have one curl session or need it throughout the script, there isn't a great benefit.

The world wide web forms a graph. Resources (anything with a URL) is a node, links are edges. Resources that can't contain anchors (such as images) are leaves. You want to traverse a portion of this graph, which leads to two algorithms: breadth-first search (BFS) and depth-first search (DFS). In the former, you process all nodes at a given distance from the starting node before processing nodes further out; in the latter, you fully process one branch before processing the next. Each is very similar. Here's an outline for both:

Put the root node in the list N
While there's a node left in list N
1. remove the next node, store it as the current node
2. (preorder) process current node
3. add each child of the current node to the list N
4. (postorder) process current node

The main difference between a BFS and DFS is the data structure used to hold the list of nodes to process. A BFS uses a queue (first-in, first-out) and a DFS uses a stack (first-in, last-out). PHP doesn't have specialized queues and stacks, but you can implement them with arrays and array_push() with array_pop() or array_shift(). You can implement DFS recursively, in which case the node list is the call stack. This also gives you a new version of every local variable, which may or may not be desirable. If you only want one instance of local variables, don't use the call stack. This is the main source of memory usage. Don't use the call stack and reuse whichever resources you can (such as the DOMDocument and curl session) to reduce memory usage.

Another axis along which traversal algorithms differ is the point at which additional node processing (i.e., beyond adding its children to the list) is performed: before, during or after adding children to the list, also called pre-, in- and post-order traversal. Pre- and post-order are marked in the outline. In your case, extracting image URLs and adding them to the DB counts as additional processing. You can re-use the curl session and DOMDocument with either pre- or post-order traversal.

The linked articles have more specific information about tree traversal.

If you can rely on allow_url_fopen being set, you don't need to use curl. Simply pass the URL to DOMDocument::load().

One other issue is that BFS and DFS are designed to work on trees, which are connected graphs without cycles. Since the web most decidedly has cycles, you'll have to do something to break them. Use a set to record URLs. Here's the pseudocoded algorithm, updated to handle cycles:

Code:

set Seen to {}
add root to Nodes
while size(Nodes) > 0:
    remove next element of Nodes and store it as current node
    [additional processing of current node]
    for each child of current:
        if child is not in Seen:
            add child to Seen and Nodes
    [additional processing of current node]

In PHP, you can use an associative array as a set of URLs. Mapping set operations to array operations:

$item is in $Set := isset($Set[$item])
add $item to $Set := $Set[$item] = true
remove $item from $Set := unset($Set[$item])

learning_brain · Jun 5, 2010

Thanks misson - helpful as always.

After reviewing all of this (and there was a lot to understand and get my head around), I have made a few decisions...

Firstly, I've got this sort of working. The trouble is, my url-to-crawl list grows exponentially and my host loses interest (just stops unexpectedly or says "mysql has gone away" - what on holiday?) before completion of all scraping, leaving lots of open ends and no way of tracking what has or has not been crawled.

I went back to marshian's comment about a list...and thought why not? A pending queue... Its so much easier to track and manage.

So I now have seperate pages.

1) submit initial URL (Could be root), which crawls for links and adds to mysql queue table.
2) processing page which loops through unprocessed url's looking for images/saving src and updates queue when done. It also looks for tertiary url links and adds to queue. This page I can keep refreshing or put as a chron job.

Just need to tidy up now.

One thing I need to check is whether the link in the queue is a url for an image itself. (i.e. http://www.mysite.com/images/image.jpg) in which case it should just save it as an image url rather than as a page to be crawled.

marshian · Jun 7, 2010

That's pretty much what my idea was as well, yet implemented in a different method. Nice find, it sounds lik a good implementation. Just a small remark, you're basically about to go multi-threaded now. You'll want to lock your resources. Basically, at the beginning of your processing page add a mysql LOCK query and UNLOCK when you're done. This way you prevent your script from interfering with itself if it would happen to run twice.

About your content remark (whether the link is an image or file), it's impossible to tell what kind of document you'll be seeing if you only have an url. Your example (http://www.mywebsite.com/images/image.jpg) might refer to "image.jpg" or "image.jpg/index.php", there's no way of telling which one it is. And that's just the tip of the iceberg. Ever heard of mod_rewrite? It allows you to rewrite any incoming request. For example GET /index.php might be rewritten to GET /somesubdir/image.jpg.

Luckily there is something that should help you out here. The mime type of the response content is usually send by the server, in the Content-type header. The value of these header is a mime type, like (for exampe) text/html, image/jpeg, image/png. You can use curl_getinfo along with the option CURLINFO_CONTENT_TYPE to find out which content-type the content is.

Obviously this should be done by your crawler when he starts reading a new page. You'll be interested in anything matching image/* (or image/.+ in valid regex). These are the images whose url you can immediately store.

learning_brain · Jun 7, 2010

Thanks marshian.

Your locking idea is a good one - I'll check that out.

The curl_getinfo presents a problem... I don't want to have to open a curl for every link and would prefer to check it before I start the curl object to preserve resources.

Order:

Loop through URL pending queue (limit results per load)

{
Check if image - if yes, check size, check if exists, save to db
If not an image, create curl object
Loop through img elements

{
Scrape src and alt info
Absolutise (is that a word?) src
check size
check if exists
save to db
refresh mysql connection
}

Close curl
}

I have however found a very simple solution. I just do a getimagesize() test on the url and if it returns a value, it's a readable image of some type. If not, its likely to be html. Seeing as I do a getimagesize() check on every link anyway, this seems to fit the bill and works OK with initial testing.

I have also now split out the url and image searches. The url scrape nearly always returns far more results than the image scrape which means my pending urls was growing stupidly. As split files, I can control the frequency of the crawl types depending on list length.

I also had to add a mysql_ping() in there as it was disconnecting half way through if no suitable images were found. That seems to work nicely too.

I'm almost there now but have posted another thread regarding the filtering of dynamically created urls, which are giving me a headache - leading to endless loops of effectively the same page.

Thanks for all the help.

Clearing Dom Object

learning_brain

New Member

marshian

New Member

learning_brain

New Member

misson

Community Paragon

learning_brain

New Member

marshian

New Member

learning_brain

New Member

Free Web Hosting

Our Community

Legal