img src preg_match_all regex problem

Discussion in 'Scripts, 3rd Party Apps, and Programming' started by learning_brain, May 22, 2010.

  1. learning_brain

    learning_brain New Member

    Messages:
    206
    Likes Received:
    1
    Trophy Points:
    0
    I have searched around everywhere for the right way of doing this.....

    I have an image search engine and as part of it, I have a page that can extract and store img sources.

    The problem is that the regex I'm using is not always reliable.

    1) issues with relative paths (doesn't inlcude complete path)
    2) there is also an issue with links such as "LBPC-Style/site_icons/profile.png".

    PHP:
    <?php
    //define url to search
    $url $_POST['url'];
    //get contents
    $contents file_get_contents($url);

    //set matching pattern for img tag source
    $pattern '/src=[\"\']?([^\"\']?.*(png|jpg|gif))[\"\']?/i';

    //match all img tag source
    preg_match_all($pattern$contents$images);


    //count number of items in array
    $imageCount count($images[1]);

    //loop through each item
    for ($i=0$i<$imageCount$i++){

        echo 
    "<br/>".$images[1][$i];
        
        echo 
    '<img src="'.$images[1][$i].'" width="100"/>';
        
        
    $insertSQL sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
                           
    GetSQLValueString($images[1][$i], "text"),
                           
    GetSQLValueString($images[1][$i], "text"));

          
    $Result1 mysql_query($insertSQL$freewebhost) or die(mysql_error());
    }
    ?> 
    images are always a problem due to the construct.

    i.e. <img src="http://www.mysite.com/image.png"/> would be ideal but...

    <img alt="description" src="image.png" width="100"/> is not.... see what I mean?

    Is there a better way of doing this?
     
  2. descalzo

    descalzo Grim Squeaker Community Support

    Messages:
    9,375
    Likes Received:
    327
    Trophy Points:
    83
    1. Use regexps to get the domain and the directory paths from $url. ie if $url is 'http://www.example.com/stuff/page.html' you want
    $domain = 'http://www.example.com' and
    $directory = 'http://www.example.com/stuff/'

    2. Loop through you link matches and test:
    a. Against '/^https?:\/\//' ... if it matches, it is the format you want.
    b. Then against '/\//' ... if it matches, the link is of the format '/dirpath/moon.jpg' ... concatenate with $domain to get your full url
    c. Rest should be of the format 'sun.gif' or 'dirpath/img/relative.png' But those should work (not tested).

    I would guess that PHP should have a library that would parse an HTML page and pull out normalized links, but I am not sure. That would be the cleanest way.
     
    Last edited: May 22, 2010
  3. misson

    misson Community Paragon Community Support

    Messages:
    2,572
    Likes Received:
    72
    Trophy Points:
    48
    The closest I can think of is realpath(), but that only works on local files (the comments for realpath() have some functions that work on URLs).

    SimpleXML offers an alternate way of getting the source URLs, but won't resolve relative references.

    PHP:
    $doc = new SimpleXMLElement($url0True);
    $imageSrcs $doc->xpath("//img/@src");
    foreach (
    $imageSrcs as $img) {
        
    process($img->src);
    }
    Since image elements aren't the only ones with a src attribute, a better regexp might be: /<img[^>]*src=(['"]?)([^>]*)\1/.
     
    Last edited: May 23, 2010
  4. learning_brain

    learning_brain New Member

    Messages:
    206
    Likes Received:
    1
    Trophy Points:
    0
    Thanks both of you.

    I took descalzo's advice with the extraction of the url path (concatenating url host/dir.file) and have now got the following;

    PHP:
    <?php
    //define url to search
    $url $_POST['url'];

    //split url
    preg_match('/((http|https|ftp):\/\/)?((.*?)\/)?((.*)\/)?(.*)?/',$url$urlParts);

    //concatenate host and directory path
    $urlPath $urlParts[3].$urlParts[5];

    //get contents
    $contents file_get_contents($url);

    //set matching pattern for img tag source
    $pattern '/src=[\"\']?([^\"\']?.*(png|jpg|gif))[\"\']?/i';
    //match all img tag source
    preg_match_all($pattern$contents$images);

    //count number of items in array
    $imageCount count($images[1]);

    //loop through each item
    for ($i=0$i<$imageCount$i++){

        
    $testPattern1 '/^https?:\/\//';
        
    $testPattern2 '/\//';

        if (
    $images[1][$i] = preg_match(??????????????)){
        
            
    $imageURL $images[1][$i];

            echo 
    "<br/>".$imageURL;
            echo 
    '<img src="'.$imageURL.'" width="100"/>';
            
            
    $insertSQL sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
                               
    GetSQLValueString($imageURL"text"),
                               
    GetSQLValueString($imageURL"text"));
            
            
    $Result1 mysql_query($insertSQL$freewebhost) or die(mysql_error());
            
        } elseif (
    $images[1][$i] = preg_match(??????????????)){
            
            
    $imageURL $urlPath.$images[1][$i]
            
            echo 
    "<br/>".$imageURL;
            echo 
    '<img src="'.$imageURL.'" width="100"/>';
            
            
    $insertSQL sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
                               
    GetSQLValueString($imageURL"text"),
                               
    GetSQLValueString($imageURL"text"));
            
            
    $Result1 mysql_query($insertSQL$freewebhost) or die(mysql_error());
        }
    }
    ?> 
    The $urlPath I got more by good luck than good management, however it works.

    But as you can see, I having difficulty getting my head round the "if" construct...
     
    Last edited: May 23, 2010
  5. misson

    misson Community Paragon Community Support

    Messages:
    2,572
    Likes Received:
    72
    Trophy Points:
    48
    Don't set variables that are invariant inside a loop; you're just wasting cycles.

    You're repeating far too much code here. Convert relative URLs into absolute URLs. After that, the rest of the code is the same.

    PHP:
    ?><ol><?php
    foreach ($images[1] as $imageURL) {
        
    $imageURL normalize($imageURL$baseURL);
        
    ?><li><?php 
            
    echo $imageURL
            
    ?><img src="<?php echo $imageURL?>" alt="<?php ... ?>"/><?php
            
    if (! ImageIndex::add($imageURL, ...)) { // or SearchDB::addImage(...) or what-have-you
                // couldn't add image URL to database.
            
    }
        
    ?></li><?php
    }
    ?></ol><?php
    Don't use or die, and don't output database error messages to non-admin users.

    One thing you should do is write the URL conversion as a function or a class rather than doing everything inline. It will help you focus on the specific task of normalizing absolute and relative URLs.

    To test for an absolute URL, try %^((?:https?)://[^/]+)?(/?)(.*)% (you don't need to check for a scheme of "ftp"). If the first and second groups are empty, you've got a relative path. If only the first group is empty, you've got an absolute path. If no group is empty, you've got an absolute URL.
     
    Last edited: May 23, 2010
  6. learning_brain

    learning_brain New Member

    Messages:
    206
    Likes Received:
    1
    Trophy Points:
    0
    @misson - doing most of your suggestions now.

    OK, my url host/directory isn't working for all urls - only the one I tested.

    #1,#2,#3 etc depend on # of directories/subdirectories so in this case, if the url is only the root, I don't get what I need.

    2ndly, my abs/rel path test ain't working too well... :(

    PHP:
    <?php
    //define url to search
        
    $url $_POST['url'];

    //split url
        
    preg_match('/((http|https|ftp):\/\/)?((.*?)\/)?((.*)\/)?(.*)?/',$url$urlParts);

    //concatenate host and directory path
        
    $urlHostDir $urlParts[1].$urlParts[3].$urlParts[5];

        echo 
    "<br>Host and Directory: ".$urlHostDir."<br/>";

    //get contents
        
    $contents file_get_contents($url);

    //define regexp for img tag source
        
    $pattern '/src=[\"\']?([^\"\']?.*(png|jpg|gif))[\"\']?/i';
    //match all img tag source
        
    preg_match_all($pattern$contents$images);

    //count number of items in array
        
    $imageCount count($images[1]);

    //loop through each item
        
    for ($i=0$i<$imageCount$i++){
        
    //check if absolute or relative path

            
    echo "<br/>Testing: ".$images[1][$i];
        
            if(
    preg_match('/^http|https?:\/\//'$images[1][$i]))
            {
                echo 
    "<br/>path is absolute";
                
    $completeImageURL $images[1][$i];
            }
            else
            {
                echo 
    "<br/>path is relative";
                
    // concatenate host, directory and relative path
                
    $completeImageURL $urlHostDir.$images[1][$i];    
            }
    //echo path and image
            
    echo "<br/>Path to save: ".$completeImageURL;
            echo 
    '<br/><img src="'.$completeImageURL.'" width="100"/>';
            
    //insert into DB
            //$insertSQL = sprintf("INSERT INTO IMAGES (URL, KEYWORDS) VALUES (%s, %s)",
            //                       GetSQLValueString($completeImageURL, "text"),
            //                       GetSQLValueString($completeImageURL, "text"));
                
            //$Result1 = mysql_query($insertSQL, $freewebhost) or trigger_error('Query failed');
        
    }
    ?> 
    Actually, after further testing, this is riddled with problems.

    Some paths are absolute with http://www.testsite.com/gb/images/image.jpg... fine

    but what about nasty relative ones...

    /gb/images/image.jpg
    ../images/image.jpg
    ./images/image.jpg

    or even just image.jpg

    all depending on where the file sits and how it has been coded. This obviously changes how many directory layers I need to concatenate.

    Nasty................
     
    Last edited: May 23, 2010
  7. misson

    misson Community Paragon Community Support

    Messages:
    2,572
    Likes Received:
    72
    Trophy Points:
    48
    Take a second look at the comments for realpath() for sample functions that resolve relative URLs. It's not that nasty. RFC 3986 ยง 5 even gives an algorithm. Here are two more examples:

    PHP:
    function resolveURL($url$base) {
        
    preg_match('%^([^:/]+://[^/]+)([^?]*/)%'$base$baseParts);
        
    // $baseParts[1] is the scheme & host; $baseParts[2] is a '/' terminated absolute path
        
    preg_match('%^((?:https?://[^/]+)?)(/?)(.*)%'$url$urlParts);
        if (empty(
    $urlParts[1])) {
            
    $urlParts[1] = $baseParts[1];
        }
        if (empty(
    $urlParts[2])) {
            
    $urlParts[2] = $baseParts[2];
        }
        
    array_shift($urlParts);
        return 
    implode(''$urlParts);
    }
    // or, based on parse_url()
    function resolveURL($url$base) {
        
    $url parse_url($url);
        if (! 
    is_array($base)) {
            
    $base parse_url($base);
        }
        foreach (
    $base as $name => $part) {
            if (!isset(
    $url[$name])) {
                
    $url[$name] = $base[$name];
            }
        }
        return 
    "$url[scheme]://$url[host]$url[path]";
    }
    Note that the first ignores query strings in the base URL and treat query strings in the URL to resolve as part of the path. The second will copy over any query string that's in the base URL.

    If you want to remove dot segments, remove any occurrence matching %/(\.|[^/]+/+\.\.)(/|$)% from the path segment after you've added the missing URL components.

    If the HTTP extension is installed, it turns out all you need is http_build_url():
    PHP:
    function resolveURL($url$base) {
        return 
    http_build_url($base$urlHTTP_URL_JOIN_PATH);
    }
    Make sure you check for the <base> tag when setting the URL base.
    PHP:
    if (preg_match('%<base\s*href=['"]?([^'">]*)%', $url, $matches)) {
        // in case the base tag doesn'
    t have an absolute URLresolve it
        $base 
    resolveURL($matches[1], $url);
    } else {
        
    $base $url;
    }
    // remove trailing non-directory component, if any. Not strictly necessary
    $base preg_replace('%[^/]*$%'''$base);
     
  8. learning_brain

    learning_brain New Member

    Messages:
    206
    Likes Received:
    1
    Trophy Points:
    0
    Thanks misson - that's cracked it - works like a dream.
     
  9. misson

    misson Community Paragon Community Support

    Messages:
    2,572
    Likes Received:
    72
    Trophy Points:
    48
    Be sure you follow the advice in my sig. Specifically,
    In particular, the sample resolveURL() functions don't handle query strings properly. An implementation that's closer to RFC 3986's algorithm is:

    PHP:
    function sortBy(array $toSort, array $order) {
        
    $order array_intersect_key($order$toSort);
        return 
    array_merge($order$toSort);
    }

    function 
    build_url($parts) {
        foreach (array(
    'host' => '://''query' => '?''fragment' => '#') as $part => $pre) {
            if (isset(
    $parts[$part])) {
                
    $parts[$part] = $pre $parts[$part];
            }
        }
        return 
    implode(''$parts);
    }

    function 
    resolveURL($url$base) { 
        
    $urlParts parse_url($url); 
        if (! 
    is_array($base)) { 
            
    $base parse_url($base);
        }
        
    $base['path'] = preg_replace('%[^/]+$%'''$base['path']);
        foreach (
    $base as $name => $part) { 
            if (!isset(
    $urlParts[$name])) { 
                
    $urlParts[$name] = $base[$name]; 
            } else {
                break;
            }
        }
        if (
    $urlParts['path'][0] != '/') {
            
    $urlParts['path'] = $base['path'] . $urlParts['path'];
        }
        
    $urlParts['path'] = preg_replace('%/(?:\.|[^/]+/+\.\.)(/|$)%''$1'$urlParts['path']);
        return 
    build_url(sortBy($urlParts$base));
    }
     

Share This Page