img src preg_match_all regex problem

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
I have searched around everywhere for the right way of doing this.....

I have an image search engine and as part of it, I have a page that can extract and store img sources.

The problem is that the regex I'm using is not always reliable.

1) issues with relative paths (doesn't inlcude complete path)
2) there is also an issue with links such as "LBPC-Style/site_icons/profile.png".

PHP:
<?php
//define url to search
$url = $_POST['url'];
//get contents
$contents = file_get_contents($url);

//set matching pattern for img tag source
$pattern = '/src=[\"\']?([^\"\']?.*(png|jpg|gif))[\"\']?/i';

//match all img tag source
preg_match_all($pattern, $contents, $images);


//count number of items in array
$imageCount = count($images[1]);

//loop through each item
for ($i=0; $i<$imageCount; $i++){

    echo "<br/>".$images[1][$i];
    
    echo '<img src="'.$images[1][$i].'" width="100"/>';
    
    $insertSQL = sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
                       GetSQLValueString($images[1][$i], "text"),
                       GetSQLValueString($images[1][$i], "text"));

      $Result1 = mysql_query($insertSQL, $freewebhost) or die(mysql_error());
}
?>

images are always a problem due to the construct.

i.e. <img src="http://www.mysite.com/image.png"/> would be ideal but...

<img alt="description" src="image.png" width="100"/> is not.... see what I mean?

Is there a better way of doing this?
 

descalzo

Grim Squeaker
Community Support
Messages
9,373
Reaction score
326
Points
83
1. Use regexps to get the domain and the directory paths from $url. ie if $url is 'http://www.example.com/stuff/page.html' you want
$domain = 'http://www.example.com' and
$directory = 'http://www.example.com/stuff/'

2. Loop through you link matches and test:
a. Against '/^https?:\/\//' ... if it matches, it is the format you want.
b. Then against '/\//' ... if it matches, the link is of the format '/dirpath/moon.jpg' ... concatenate with $domain to get your full url
c. Rest should be of the format 'sun.gif' or 'dirpath/img/relative.png' But those should work (not tested).

I would guess that PHP should have a library that would parse an HTML page and pull out normalized links, but I am not sure. That would be the cleanest way.
 
Last edited:

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
I would guess that PHP should have a library that would parse an HTML page and pull out normalized links, but I am not sure.
The closest I can think of is realpath(), but that only works on local files (the comments for realpath() have some functions that work on URLs).

SimpleXML offers an alternate way of getting the source URLs, but won't resolve relative references.

PHP:
$doc = new SimpleXMLElement($url, 0, True);
$imageSrcs = $doc->xpath("//img/@src");
foreach ($imageSrcs as $img) {
    process($img->src);
}

Since image elements aren't the only ones with a src attribute, a better regexp might be: /<img[^>]*src=(['"]?)([^>]*)\1/.
 
Last edited:

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
Thanks both of you.

I took descalzo's advice with the extraction of the url path (concatenating url host/dir.file) and have now got the following;

PHP:
<?php
//define url to search
$url = $_POST['url'];

//split url
preg_match('/((http|https|ftp):\/\/)?((.*?)\/)?((.*)\/)?(.*)?/',$url, $urlParts);

//concatenate host and directory path
$urlPath = $urlParts[3].$urlParts[5];

//get contents
$contents = file_get_contents($url);

//set matching pattern for img tag source
$pattern = '/src=[\"\']?([^\"\']?.*(png|jpg|gif))[\"\']?/i';
//match all img tag source
preg_match_all($pattern, $contents, $images);

//count number of items in array
$imageCount = count($images[1]);

//loop through each item
for ($i=0; $i<$imageCount; $i++){

	$testPattern1 = '/^https?:\/\//';
	$testPattern2 = '/\//';

	if ($images[1][$i] = preg_match(??????????????)){
	
		$imageURL = $images[1][$i];

		echo "<br/>".$imageURL;
		echo '<img src="'.$imageURL.'" width="100"/>';
		
		$insertSQL = sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
						   GetSQLValueString($imageURL, "text"),
						   GetSQLValueString($imageURL, "text"));
		
		$Result1 = mysql_query($insertSQL, $freewebhost) or die(mysql_error());
		
	} elseif ($images[1][$i] = preg_match(??????????????)){
		
		$imageURL = $urlPath.$images[1][$i]
		
		echo "<br/>".$imageURL;
		echo '<img src="'.$imageURL.'" width="100"/>';
		
		$insertSQL = sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
						   GetSQLValueString($imageURL, "text"),
						   GetSQLValueString($imageURL, "text"));
		
		$Result1 = mysql_query($insertSQL, $freewebhost) or die(mysql_error());
	}
}
?>

The $urlPath I got more by good luck than good management, however it works.

But as you can see, I having difficulty getting my head round the "if" construct...
 
Last edited:

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
PHP:
//loop through each item
for ($i=0; $i<$imageCount; $i++){

	$testPattern1 = '/^https?:\/\//';
	$testPattern2 = '/\//';

Don't set variables that are invariant inside a loop; you're just wasting cycles.

PHP:
	if ($images[1][$i] = preg_match(??????????????)){
	
		$imageURL = $images[1][$i];

		echo "<br/>".$imageURL;
		echo '<img src="'.$imageURL.'" width="100"/>';
		
		$insertSQL = sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
						   GetSQLValueString($imageURL, "text"),
						   GetSQLValueString($imageURL, "text"));
		
		$Result1 = mysql_query($insertSQL, $freewebhost) or die(mysql_error());
		
	} elseif ($images[1][$i] = preg_match(??????????????)){
		
		$imageURL = $urlPath.$images[1][$i]
		
		echo "<br/>".$imageURL;
		echo '<img src="'.$imageURL.'" width="100"/>';
		
		$insertSQL = sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
						   GetSQLValueString($imageURL, "text"),
						   GetSQLValueString($imageURL, "text"));
		
		$Result1 = mysql_query($insertSQL, $freewebhost) or die(mysql_error());
	}
}
?>
You're repeating far too much code here. Convert relative URLs into absolute URLs. After that, the rest of the code is the same.

PHP:
?><ol><?php
foreach ($images[1] as $imageURL) {
    $imageURL = normalize($imageURL, $baseURL);
    ?><li><?php 
        echo $imageURL; 
        ?><img src="<?php echo $imageURL; ?>" alt="<?php ... ?>"/><?php
        if (! ImageIndex::add($imageURL, ...)) { // or SearchDB::addImage(...) or what-have-you
            // couldn't add image URL to database.
        }
    ?></li><?php
}
?></ol><?php

Don't use or die, and don't output database error messages to non-admin users.

But as you can see, I having difficulty getting my head round the "if" construct...
One thing you should do is write the URL conversion as a function or a class rather than doing everything inline. It will help you focus on the specific task of normalizing absolute and relative URLs.

To test for an absolute URL, try %^((?:https?)://[^/]+)?(/?)(.*)% (you don't need to check for a scheme of "ftp"). If the first and second groups are empty, you've got a relative path. If only the first group is empty, you've got an absolute path. If no group is empty, you've got an absolute URL.
 
Last edited:

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
@misson - doing most of your suggestions now.

OK, my url host/directory isn't working for all urls - only the one I tested.

#1,#2,#3 etc depend on # of directories/subdirectories so in this case, if the url is only the root, I don't get what I need.

2ndly, my abs/rel path test ain't working too well... :(

PHP:
<?php
//define url to search
	$url = $_POST['url'];

//split url
	preg_match('/((http|https|ftp):\/\/)?((.*?)\/)?((.*)\/)?(.*)?/',$url, $urlParts);

//concatenate host and directory path
	$urlHostDir = $urlParts[1].$urlParts[3].$urlParts[5];

	echo "<br>Host and Directory: ".$urlHostDir."<br/>";

//get contents
	$contents = file_get_contents($url);

//define regexp for img tag source
	$pattern = '/src=[\"\']?([^\"\']?.*(png|jpg|gif))[\"\']?/i';
//match all img tag source
	preg_match_all($pattern, $contents, $images);

//count number of items in array
	$imageCount = count($images[1]);

//loop through each item
	for ($i=0; $i<$imageCount; $i++){
	
//check if absolute or relative path

		echo "<br/>Testing: ".$images[1][$i];
	
		if(preg_match('/^http|https?:\/\//', $images[1][$i]))
		{
			echo "<br/>path is absolute";
			$completeImageURL = $images[1][$i];
		}
		else
		{
			echo "<br/>path is relative";
			// concatenate host, directory and relative path
			$completeImageURL = $urlHostDir.$images[1][$i];	
		}
//echo path and image
		echo "<br/>Path to save: ".$completeImageURL;
		echo '<br/><img src="'.$completeImageURL.'" width="100"/>';
		
//insert into DB
		//$insertSQL = sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
		//					   GetSQLValueString($completeImageURL, "text"),
		//					   GetSQLValueString($completeImageURL, "text"));
			
		//$Result1 = mysql_query($insertSQL, $freewebhost) or trigger_error('Query failed');
	}
?>

Actually, after further testing, this is riddled with problems.

Some paths are absolute with http://www.testsite.com/gb/images/image.jpg... fine

but what about nasty relative ones...

/gb/images/image.jpg
../images/image.jpg
./images/image.jpg

or even just image.jpg

all depending on where the file sits and how it has been coded. This obviously changes how many directory layers I need to concatenate.

Nasty................
 
Last edited:

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
Take a second look at the comments for realpath() for sample functions that resolve relative URLs. It's not that nasty. RFC 3986 § 5 even gives an algorithm. Here are two more examples:

PHP:
function resolveURL($url, $base) {
    preg_match('%^([^:/]+://[^/]+)([^?]*/)%', $base, $baseParts);
    // $baseParts[1] is the scheme & host; $baseParts[2] is a '/' terminated absolute path
    preg_match('%^((?:https?://[^/]+)?)(/?)(.*)%', $url, $urlParts);
    if (empty($urlParts[1])) {
        $urlParts[1] = $baseParts[1];
    }
    if (empty($urlParts[2])) {
        $urlParts[2] = $baseParts[2];
    }
    array_shift($urlParts);
    return implode('', $urlParts);
}
// or, based on parse_url()
function resolveURL($url, $base) {
    $url = parse_url($url);
    if (! is_array($base)) {
        $base = parse_url($base);
    }
    foreach ($base as $name => $part) {
        if (!isset($url[$name])) {
            $url[$name] = $base[$name];
        }
    }
    return "$url[scheme]://$url[host]$url[path]";
}
Note that the first ignores query strings in the base URL and treat query strings in the URL to resolve as part of the path. The second will copy over any query string that's in the base URL.

If you want to remove dot segments, remove any occurrence matching %/(\.|[^/]+/+\.\.)(/|$)% from the path segment after you've added the missing URL components.

If the HTTP extension is installed, it turns out all you need is http_build_url():
PHP:
function resolveURL($url, $base) {
    return http_build_url($base, $url, HTTP_URL_JOIN_PATH);
}

Make sure you check for the <base> tag when setting the URL base.
PHP:
if (preg_match('%<base\s*href=['"]?([^'">]*)%', $url, $matches)) {
    // in case the base tag doesn't have an absolute URL, resolve it
    $base = resolveURL($matches[1], $url);
} else {
    $base = $url;
}
// remove trailing non-directory component, if any. Not strictly necessary
$base = preg_replace('%[^/]*$%', '', $base);
 

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
Be sure you follow the advice in my sig. Specifically,
misson said:
Any posted code is intended as illustrative example, rather than a solution to your problem to be copied without alteration. Study it to learn how to write your own solution.
In particular, the sample resolveURL() functions don't handle query strings properly. An implementation that's closer to RFC 3986's algorithm is:

PHP:
function sortBy(array $toSort, array $order) {
    $order = array_intersect_key($order, $toSort);
    return array_merge($order, $toSort);
}

function build_url($parts) {
    foreach (array('host' => '://', 'query' => '?', 'fragment' => '#') as $part => $pre) {
        if (isset($parts[$part])) {
            $parts[$part] = $pre . $parts[$part];
        }
    }
    return implode('', $parts);
}

function resolveURL($url, $base) { 
    $urlParts = parse_url($url); 
    if (! is_array($base)) { 
        $base = parse_url($base);
    }
    $base['path'] = preg_replace('%[^/]+$%', '', $base['path']);
    foreach ($base as $name => $part) { 
        if (!isset($urlParts[$name])) { 
            $urlParts[$name] = $base[$name]; 
        } else {
            break;
        }
    }
    if ($urlParts['path'][0] != '/') {
        $urlParts['path'] = $base['path'] . $urlParts['path'];
    }
    $urlParts['path'] = preg_replace('%/(?:\.|[^/]+/+\.\.)(/|$)%', '$1', $urlParts['path']);
    return build_url(sortBy($urlParts, $base));
}
 
Top