Replacing all img src in loaded html

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
I don't know if I'm going to get any sense here but here goes...

I have a class that analyses an image (test file here refresh for another image) and censors anything that has too many fleshtones by returning a pixellated version.

I am trying to set up a webpage wrapper that loads the html and replaces every image src with the output from the class.

Currently, I'm loading the page using cURL into a variable $html.

What comes next I can't get my head round. After lots of reading up, there's mentions of DOM documents, Xpaths, str_replace, for each, preg_replace etc.. etc but I can't work out how to apply it to this situation.

The closest topic I can find is here but again, I can't get my head round using the existing image url to call a class and return a replacement image url (or the same one if it passes).

Any explanations in simple English would be much appreciated.

Thank you

Rich
 

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
Take a closer look at the "Extracting data from HTML" document you linked to. It does almost what you want; it iterates over all elements in a document with a certain tag, performing an operation on each. Where it differs is in processing all anchor elements ("//a") rather than images ("//img"), and echoing the "href" attribute rather than setting the "src" attribute. Lastly, you want to output the original document as HTML after processing. If those hints aren't enough, let me know and I'll post a modified version of Kore Nordmann's code.

DOM is simply an OOP interface for documents; a DOM document is a document that supports the DOM interface. Xpaths are like CSS selectors but with different syntax (one that slightly resembles filesystem paths). If you use Firefox, there are a number of xpath add-ons (such as FirePath with FireBug) that let you play with xpaths.
 
Last edited:

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
As always - you come up with the goods, but scraping content was not my problem.

My issue was the replacement of the src Attribute.... but you did give me some hints which I found very useful.

I am having secondary problem though.

Current code...

PHP:
$target_url = $_GET['url'];
	
	$oldSetting = libxml_use_internal_errors( true );
	libxml_clear_errors();
	$html = new DOMDocument();
	$html->loadHtmlFile($target_url);
	$xpath = new DOMXPath($html);
	$imgtags = $xpath->query( '//img' );
	foreach ($imgtags as $imgtag) {
		$absoluteImgSrc = url_to_absolute($target_url, $imgtag->getAttribute('src'));
		$analysedImage = new ImageAnalysis();
		$analysedImage->doAnalysis($absoluteImgSrc);
		$imgtag->setAttribute('src',$analysedImage->outputURL);
	}
	libxml_clear_errors();
	libxml_use_internal_errors( $oldSetting );
	
	
	$root = $html->createElement('html');
	$root = $html->appendChild($root);
	
	$head = $html->createElement('head');
	$head = $root->appendChild($head);
	
	$title = $html->createElement('title');
	$title = $head->appendChild($title);
	
	$text = $html->createTextNode('This is the title');
	$text = $title->appendChild($text);
	
	echo $html->saveHTML();

within the foreach, the url_to_absolute() function is one that resolves an absolute url so it can be used from any location.

Now I know my class works... on my own server... as you can see.

However, it doesn't seem to work using another site and the same process....

Not sure about the necessity of creating elements and replacing child?????? I'm guessing this is to do with XML????

Need to think this through....

Rich
 
Last edited:

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
$html already has an <html> element, and should already have <head> and <title> elements. After the loop and resetting libxml's error setting, all you should need is the echo $html->saveHTML();. The rest gets you malformed HTML.

When you say "it doesn't work", how exactly doesn't it work? Give sample input & output, and say how the output differed from what you expected.
 
Last edited:

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
Thanks Misson

I found the bug - it was to do with relative paths rather than absolute which I have now fixed.

The development page is at http://www.qualityimagesearch.com/cbic_wrapper.php

Just put in an html address (full url) and it will process the images accordingly. I have also changed all a hrefs so that you can browse within the wrapper.

My only issue at the moment is that some css files are not being linked... possibly another url issue.

The image censoring seems relatively accurate at present, but improvements have to be made.

Also - it would be good to have the form re-iterated at the top of the final html so that another address can be entered, but if I simply echo it before the saveHTML, it will come before the html->body etc, which isn't ideal.

Maybe there's a way to append the <body></body> content....

Progress is good so far and reasonably fast too.

Rich
 
Last edited:

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
Also - it would be good to have the form re-iterated at the top of the final html so that another address can be entered, but if I simply echo it before the saveHTML, it will come before the html->body etc, which isn't ideal.

Maybe there's a way to append the <body></body> content....

Build the form in the usual way (with DOMDocument::createElement calls) and insert or append the form element to the body element.

As an alternative to creating the form programmatically, you can create a DOMDocumentFragment then have it parse a string containing the form. You then add the fragment to the body element as you would the programmatically created element (with DOMNode::insertBefore or DOMNode::appendChild). Note that, as with all nodes, the document fragment must be owned by the document, otherwise you'll never be able to add the nodes. The simplest way to do this is to use DOMDocument::createDocumentFragment.

PHP:
$formSource = <<<ETX
<form method="get">
    <input name="u" />
    ...
</form>
ETX;

$form = $doc->createDocumentFragment();
$form->appendXML($formSource);
$body->appendChild($form);

You can also create a document fragment in the usual OO way (with new), but you must add the fragment to the document using DOMDocument::importNode before you parse the HTML string (orphaned DOM nodes are read-only). Since importNode creates a copy of the node rather than altering the original, you must also use the returned document fragment rather than the original, which will still be orphaned and read-only. You might as well just use createDocumentFragment.

PHP:
...
$form = new DOMDocumentFragment();
$form = $doc->importNode($form);
$form->appendXML($formSource);
...

If PHP's DOMDocument supported adoptNode, then creating a fragment with new would be viable, but sadly DOMDocument implements DOM level 2, not 3.
 
Last edited:

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
Hmmm - got an unexpected $end at the end of the script....

So I tried playing with the <<<ETX wrapper.

eg..

PHP:
$formSource=<<<STX
		<form id="form" name="form" method="get" action="">
			<label>
			<input name="url" type="text" id="url" size="100" />
			</label>
			<label>
			<input type="submit" name="button" id="button" value="Go" />
			</label>
		</form>
ETX;

PHP:
$formSource='<<<STX
		<form id="form" name="form" method="get" action="">
			<label>
			<input name="url" type="text" id="url" size="100" />
			</label>
			<label>
			<input type="submit" name="button" id="button" value="Go" />
			</label>
		</form>
ETX';

PHP:
$formSource='<<<STX
		<form id="form" name="form" method="get" action="">
			<label>
			<input name="url" type="text" id="url" size="100" />
			</label>
			<label>
			<input type="submit" name="button" id="button" value="Go" />
			</label>
		</form>
ETX>>>';

PHP:
$formSource='<<<ETX
		<form id="form" name="form" method="get" action="">
			<label>
			<input name="url" type="text" id="url" size="100" />
			</label>
			<label>
			<input type="submit" name="button" id="button" value="Go" />
			</label>
		</form>
ETX';

The furthest I got was "Call to a member function appendChild() on a non-object" - so I'm guessing it wasn't parsing the string correctly... can't find many pages explaining STX..ETX... ASCII characters and syntax.

Am I missing something??

Rich
 

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
Hmmm - got an unexpected $end at the end of the script....
[...]
PHP:
$formSource=<<<STX
   [...]
ETX;

The start and end identifier must be exactly the same. Only identifier characters and (optionally) a trailing ";" can be present in the line with the closing identifier. You have "STX" as an opening identifier, so the string is never closed (as you can see by the red in this colorized source). I sometimes use ETX as a delimiter since the ETX control character means "end of text", but "STX" and "ETX" have no significance in PHP. You could use "EOS" (short for "end of string"), or "String_end" or "Whoops_Mrs_Miggens_youre_sitting_in_my_artichokes".

PHP:
$formSource='<<<STX
   [...]
ETX';
This is simply a single quoted string, as are the rest of the samples, as the color again reveals. The "<<<STX" and "ETX'" are a part of that string.

The furthest I got was "Call to a member function appendChild() on a non-object" - so I'm guessing it wasn't parsing the string correctly...
This means thing you're calling the method on (the $obj in $obj->method()) isn't an object, not that an argument to the method isn't an object. Whatever you're doing to retrieve the DOM node (the body element?) is failing. Try:
PHP:
$body = $xpath->query('/html/body')->item(0);


can't find many pages explaining STX..ETX... ASCII characters and syntax.
You don't need info about the ETX character, you need to read over the PHP documentation on heredoc syntax for strings.
 
Last edited:

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
Thanks Misson

The $body = $xpath->query('/html/body')->item(0); worked a treat... although I don't understand why lol.

Your code for the ETX was the one I tried first (which was giving me the unexpected $end). The others were just trials.

After a lot a reading up, I found that it was having problems with the tabbing, so I removed them and...works a treat! (although it adds the code to the end of the body so I'm having difficulty getting a css relative position at the top of the page without overlaying over existing headers..) Just need to style it up and sort out the <link...> tags and I should have a working solution!

Because of the increased loading time before it returns the html, I was trying to get a progress bar working, but I guess that's for another day and another thread...

Thanks again for all your help.

Rich
 
Last edited:

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
The $body = $xpath->query('/html/body')->item(0); worked a treat... although I don't understand why lol.
Think about what DOMXPath::query returns and what /html/body selects, then what the DOMNodeList::item returns.

Your code for the ETX was the one I tried first (which was giving me the unexpected $end).
Not quite. Here's what I wrote:
PHP:
$formSource = <<<ETX
<form method="get">
    <input name="u" />
    ...
</form>
ETX;
My first line differs from yours: I use "ETX", you use "STX", which makes all the difference.

It also appears that vBulletin is adding spaces after the open and close identifiers in the rendered post (but not the source, as you can see if you quote the original message). Those extra spaces will cause a "Parse error: syntax error, unexpected T_SL" error.

After a lot a reading up, I found that it was having problems with the tabbing, so I removed them and...works a treat!
The only place whitespace should matter is on the lines with the identifiers, where there should be none. Within the heredoc string itself, whitespace won't affect parsing.

(although it adds the code to the end of the body so I'm having difficulty getting a css relative position at the top of the page without overlaying over existing headers..) Just need to style it up and sort out the <link...> tags and I should have a working solution!
You could use insertBefore and DOMNode::$firstchild rather than appendChild to make the form the first child of <body>, or use fixed positioning. If you do the latter, add extra space to the top of the body (by e.g. adding padding) so the top of the page isn't eclipsed by the form.
 

learning_brain

New Member
Messages
206
Reaction score
1
Points
0
Think about what DOMXPath::query returns and what /html/body selects, then what the DOMNodeList::item returns.

...Thinking and reading....

Not quite. Here's what I wrote:

My first line differs from yours: I use "ETX", you use "STX", which makes all the difference.

Your version was the first one I tried. The examples were only variations on a theme. Because of my tabbing, the identifiers were also tabbed which gave me the problem. I have removed these and all is well. I did understand you.

It also appears that vBulletin is adding spaces after the open and close identifiers in the rendered post (but not the source, as you can see if you quote the original message). Those extra spaces will cause a "Parse error: syntax error, unexpected T_SL" error.

Sigh... yes this seems to be happening a lot with other sites returning errors and/or unexpected results too...

I'll do some looking around to see if there are any cleanup scripts.


You could use insertBefore and DOMNode::$firstchild rather than appendChild to make the form the first child of <body>, or use fixed positioning. If you do the latter, add extra space to the top of the body (by e.g. adding padding) so the top of the page isn't eclipsed by the form.

Yep - more thinking here I'm afraid. TBH, DOM's are relatively new to me so I'm just going to go through some tuts.

All sorts of issues with testing but... one step at a time :)

Rich
 
Last edited:

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
If you like a post, use the "like" link. If you have nothing useful to add, please don't post. You're just raising the signal to noise ratio.
 
Top