Getting Non-Syndicated Info from Another Site

garrensilverwing · Apr 30, 2010

I want to grab information from another site that is not in an RSS feed. An example of what info I would like to grab can be seen at http://www.uschess.org/msa/MbrDtlMain.php?13923823. The information will change based on the USCF profile i need to pull up which can be done easily with the USCF id. the info will always look like this:

Code:

Regular Rating
</td>

<td>
<b><nobr>
**Variable**&nbsp;&nbsp;

The information has several formats. It can be 3-4 digits long and possibly have a short string at the end.
I know there is probably a way to access the page, grab the information, and store it in a variable using
regular expressions but I'm pretty novice at regex so I was hoping you can provide me with an answer.
the code would probably look like this but I'm afraid it would be more complicated than that:

Code:

Open USCF Webpage;
if(webpage doesnt exist) $rating = "unknown";
else
  {
    $rating = grabbed information;
    echo $rating
  }

lemon-tree · Apr 30, 2010

I am dubious of problems that may arise from using data that isn't meant to be extracted for custom use. Would it not be better to go to the website's owner and either request data access or at least get proper permission to use this method. It is very likely they will deny you direct data access, but they may be OK with you extracting the number from the page.
Either email them or use their forums.

garrensilverwing · Apr 30, 2010

thats a good idea i'll try that first, but seeing as it is public information i don't think it will be a problem extracting it from their website in the manner mentioned above, but i will definitely try that first

lemon-tree · Apr 30, 2010

Bear in mind that just because something is published publicly on the web does not automatically mean it is OK to take it. It's very likely that they'll be just fine with letting you use the data, but it is always worth checking.

misson · Apr 30, 2010

If you get permission to use the data but can only access it from the web pages, you can use the regexp:

Code:

/Regular Rating\s*(?:<[^>]*>\s*)*([^<]+)/s

which matches the first text after "Regular Rating" (the ([^<]+)) and ignores intervening tags and whitespace (the (?:<[^>]*>\s*)*). This should continue to work if the page structure changes in some ways, but can match multiple times if "Regular Rating" occurs more than once.

Alternately, you can use a parsed version of the page (such as obtained with DOMDocument::loadHTML or simplexml_load_string) and access the info with the xpath:

Code:

//td[text()="Regular Rating"]/following-sibling::*//text()

If the data changes infrequently, make sure you cache it to reduce the load on the other server.

Getting Non-Syndicated Info from Another Site

garrensilverwing

New Member

lemon-tree

x10 Minion

garrensilverwing

New Member

lemon-tree

x10 Minion

misson

Community Paragon

Free Web Hosting

Our Community

Legal