Client side data processing

cybrax · Nov 13, 2010

Using your web site hosting server to retrieve and mashup data from other web sites is rapidly becoming an "Old School" technique. ( and of course some web hosts are making it nigh on impossible )

So there is a growing trend nowdays to move this CPU intensive work away from the central server and onto the clients PC which is typically doing very little. Android apps are a good example of this in action with mobile devices.

Sadly asking a vistor to download and run an exe file before they can view your website is unlikey to ever be a popular even with the non tech savvy at least for now anyhow.

That leaves us with only ActiveX ( XMLHTTP requests ) which only works with IE. Though vaugely remember a plugin being available for Firefox. Not as versatile as cURL but may be useful for some.

The trick of course is NOT to use the ActiveX component for every single vistor (unless the output mashup requires this) but rather to have just one visitor an hour/day/week upload the scraped data to a server side cache where all can see it regardless of browser type or security settings.

OK, schools out now go play

HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body>
<div id="stats">STANDBY - nb: this only works for IE, accept the ActiveX request. </div>

<script>

function processStateChange(){
  statusDiv = document.getElementById("stats");
  if (req.readyState == 0){ statusDiv.innerHTML = "WAKING UP"; }
  if (req.readyState == 1){ statusDiv.innerHTML = "GETTING THERE"; }
  if (req.readyState == 2){ statusDiv.innerHTML = "GOT IT!"; }
  if (req.readyState == 3){ statusDiv.innerHTML = "TINKERING"; }
  if (req.readyState == 4){
    statusDiv.innerHTML = "ALL DONE";



    var data = req.responseText;  <!-- put html into javascript variable-->

                                           <!--do something with string -->

       document.write (data);                     <!--output data -->



    }
}

req = new ActiveXObject("Msxml2.XMLHTTP");
if (req) {
    req.onreadystatechange = processStateChange;
    req.open("GET", "http://www.google.co.uk/search?q=car", true);
    req.send();
}

</script>

</body>
</html>

misson · Nov 14, 2010

Modern browsers have an XMLHttpRequest object, including IE since version 7, which supports the same API. Fallback to Msxml2.XMLHTTP is only necessary on IE 5 through 6.

cybrax · Nov 14, 2010

misson said:
Fallback to Msxml2.XMLHTTP is only necessary on IE 5 through 6.

Unfortunately IE6 is far from being dead even now, as of last year (2009) it was still commanding a sizeable percentage of visitors browsers so it is allways a good practice to provide 'legacy support'.

Updated browser stats from W3Counter:

1 Internet Explorer 8 27.01%
2 Firefox 3.6 24.22%
3 Internet Explorer 7 9.34%
4 Chrome 6 6.91%
5 Internet Explorer 6 5.24%
6 Safari 5 4.46%
7 Chrome 7 4.38%
8 Firefox 3.5 3.41%
9 Firefox 3 1.86%
10 Safari 4 0.92%

descalzo · Nov 14, 2010

I really don't get your point.

Are you saying that you want to run a XMLHttp Request from a web page that will download content from another site to a users machine, have that machine process the results, and then use another XMLHttp Request to upload the processed data to your server?

lemon-tree · Nov 14, 2010

Using your web site hosting server to retrieve and mashup data from other web sites is rapidly becoming an "Old School" technique. ( and of course some web hosts are making it nigh on impossible )

I don't see why you think it is 'old-school'. PHP can be used to process data with a very high efficiency, not as great as you'd get with a full compiled script, but it is comparatively pretty good. Javascript, on the other hand, is not really designed to do any data intensive work on the client side and doing so may cause your user's browser to lock up or crash. Whilst there are new techniques that avoid this problem (WebWorkers), it still doesn't change the fact that you are trying to do something in Javascript that could be handled with considerably greater efficiently on the server. Either way, farming a data task out to a user is just asking for trouble.
Whilst the new block on port 80 here might seem frustrating, it is mainly there to boost the servers speed by blocking proxies.

Also, this should not work:

req.open("GET", "http://www.google.co.uk/search?q=car", true);

Any browser that does allow it to work is breaking the JS rule for cross-site access; all http requests must go to only the same domain and port as the original page was loaded from.

cybrax · Nov 14, 2010

That's pretty much the idea, get a visitor to perform the 'gathering' and then save the information server side where it re-used as part of another web page(s) Of course passing the processed information back to a server would probably be simpler to achieve with a hidden form field but you have the general gist.

descalzo · Nov 14, 2010

How do you get around the cross-domain problem?

Not to mention an irate user who finds out what you've done.

lemon-tree · Nov 14, 2010

How do you get around the cross-domain problem?

Ideally you shouldn't be able to on the client side, but it would appear that IE is letting it through. I believe jQuery also has support to get around it, but it uses a small piece of flash to do so.

misson · Nov 14, 2010

cybrax said:
misson said:

Fallback to Msxml2.XMLHTTP is only necessary on IE 5 through 6.

Click to expand...

Unfortunately IE6 is far from being dead even now, as of last year (2009) it was still commanding a sizeable percentage of visitors browsers so it is allways a good practice to provide 'legacy support'.

I don't see how that's relevant to my comment. I was saying to try to use Msxml2.XMLHTTP only if XMLHttpRequest doesn't exist.

cybrax · Nov 17, 2010

Easy there lads, the object of the excercise is how to utilise the resources (internet connection and processing power) of the web site visitors own PC.

Not beat each other to death with the sacred manuals.

Lemon Tree.. did come across one article about using Jquery but it still relies on PHP+ cURL to do the intial 'grabbing' of data. Though this does very much look like the way to go.

http://www.redbonzai.com/blog/web-development/using-jquery-and-php-to-scrape-web-page-content/

lemon-tree · Nov 17, 2010

That article is suggesting a method for the complete opposite of what you are trying to acheive:

• The article is using PHP to fetch the feed and sending it straight to Javascript. The PHP here is little more than a dedicated proxy and relies on port 80 being open on the server.

• What you are trying to do is use Javascript to fetch the feed and send it straight to PHP for processing. In this case the Javascript is the makeshift proxy. However, the scope of this 'proxy' is very limited, as it can only fetch from the same domain that the PHP is on anyway, which kind of eliminates the point of it. If you were to try to integrate the two of these then you'd end up with the server scraping the data (or not if port 80 is blocked), sending it to JS, which then sends it straight back; which is a pointless task and leaves you back in the situation you were in to start with.

As a better explanation of why this error occurs and why PHP is therefore the better choice for scraping, I put this together:

What this is saying is that Javascript cannot use a XMLHTTP request to open any file outside of the example.com domain; for example, if your script on the domain example.com tries to request x10hosting.com, Javascript will return an error as this counts as cross-site scripting.
PHP, on the other hand, can request everywhere and has no limitation (unless port 80 is blocked, like it is here), this is why PHP is the better option for trying to retrieve data externally.

JQuery does have a way to bypass this hard-coded block in Javascript, but it relies on farming the request out to a flash object to do the actual requesting and receiving.

I hope this is a better explanation as to why Javascript shouldn't be used to scrape data, because it simply cannot do it without using something external to the script (PHP proxy, Flash, iFrame possibly). If, by some weird chance, you are managing to make raw XMLHTTP requests to external servers, try seeing if it works on another browser; I can pretty much guarantee it won't.

cybrax · Nov 17, 2010

Hmm Flash object.. thanks LT .. will have to blow the dust off that particular set of scrolls, never been a big fan of flash/ actionscript but do like a challenge.

lemon-tree · Nov 17, 2010

I wasn't really recommending the use of Flash and I still stand by PHP being the best method, but whatever works best for you I suppose.

cybrax · Nov 17, 2010

Well it keeps me out of mischief, could you imagine the horror of me playing around with Android apps instead lol

Client side data processing

cybrax

Community Advocate

misson

Community Paragon

cybrax

Community Advocate

descalzo

Grim Squeaker

lemon-tree

x10 Minion

cybrax

Community Advocate

descalzo

Grim Squeaker

lemon-tree

x10 Minion

misson

Community Paragon

cybrax

Community Advocate

lemon-tree

x10 Minion

cybrax

Community Advocate

lemon-tree

x10 Minion

cybrax

Community Advocate

Free Web Hosting

Our Community

Legal