Hi there!
I'm using Perl/Tkx to grab webpages (LWP), then parse the HTML (HTML::TreeBuilder), then, once I find the right nodes, put the result of $element->as_text() into a text widget (new_tk__text).
The problem I'm having is that some characters (those with umlauts, or accents, for instance) are incorrectly displayed in the Tkx widget, and sometimes incorrectly in my terminal window, too.
I've tried to fixed this by saying utf8::upgrade($content) but this doesn't seem to have any effect, even though the page says that it's encoded in utf8.
Could anybody shed more light on this? I've looked through a bunch of unicode resources, but so far, nothing I've tried in perl (encode, decode, use encoding 'utf8', utf8::upgrade) seems to have even the slightest effect on my strings (based on the unvarying output of print). I really don't know if this is a unicode/encoding/decoding issue. Also, I've found that sometimes there are character glyphs missing from Tkx, but I don't know whether this is the problem.....or how to rectify that if it is.
my code:
#!/usr/bin/perl
use strict;
use warnings;
use LWP;
use Tkx;
use HTML::TreeBuilder;
use Encode;
my $url = "http://dict.leo.org/ende?lp=ende&lang=de&searchLoc=0&cmpType=relaxed§Hdr=on&spellToler=&search=for";
my $lwp = LWP::UserAgent->new();
my $response = $lwp->get($url);
my $content = ${$response}{_content};
#utf8::upgrade($content);#doesn't seem to do anything
print "encoding: ".$response->content_charset()."\n";
my $root = HTML::TreeBuilder->new();
$root->parse($content);
$root->eof();
my @nodes = $root->look_down(_tag => 'div', 'id', 'singleword');
my $mw = Tkx::widget->new(".");
my $text = $mw->new_tk__text(-width => 100, -height => 30, -wrap => "word");
my $scroll = $mw->new_ttk__scrollbar(-orient => 'vertical', -command => [$text, 'yview']);
$text->configure(-yscrollcommand => [$scroll, 'set']);
my $display_text = $nodes[0]->as_text();
my $encoded_text = Encode::encode("iso-8859-1", $display_text);
print "results: $encoded_text\n\noriginal: $display_text\n";
$text->insert("end", $encoded_text);
$text->g_grid(-column => 0);
$scroll->g_grid(-column => 1, -row => 0, -sticky => 'ns');
&Tkx::MainLoop();
#####################
Mac OS X 10.6.2
Perl 5.10.1
Tkx 1.08
I'm using Perl/Tkx to grab webpages (LWP), then parse the HTML (HTML::TreeBuilder), then, once I find the right nodes, put the result of $element->as_text() into a text widget (new_tk__text).
The problem I'm having is that some characters (those with umlauts, or accents, for instance) are incorrectly displayed in the Tkx widget, and sometimes incorrectly in my terminal window, too.
I've tried to fixed this by saying utf8::upgrade($content) but this doesn't seem to have any effect, even though the page says that it's encoded in utf8.
Could anybody shed more light on this? I've looked through a bunch of unicode resources, but so far, nothing I've tried in perl (encode, decode, use encoding 'utf8', utf8::upgrade) seems to have even the slightest effect on my strings (based on the unvarying output of print). I really don't know if this is a unicode/encoding/decoding issue. Also, I've found that sometimes there are character glyphs missing from Tkx, but I don't know whether this is the problem.....or how to rectify that if it is.
my code:
#!/usr/bin/perl
use strict;
use warnings;
use LWP;
use Tkx;
use HTML::TreeBuilder;
use Encode;
my $url = "http://dict.leo.org/ende?lp=ende&lang=de&searchLoc=0&cmpType=relaxed§Hdr=on&spellToler=&search=for";
my $lwp = LWP::UserAgent->new();
my $response = $lwp->get($url);
my $content = ${$response}{_content};
#utf8::upgrade($content);#doesn't seem to do anything
print "encoding: ".$response->content_charset()."\n";
my $root = HTML::TreeBuilder->new();
$root->parse($content);
$root->eof();
my @nodes = $root->look_down(_tag => 'div', 'id', 'singleword');
my $mw = Tkx::widget->new(".");
my $text = $mw->new_tk__text(-width => 100, -height => 30, -wrap => "word");
my $scroll = $mw->new_ttk__scrollbar(-orient => 'vertical', -command => [$text, 'yview']);
$text->configure(-yscrollcommand => [$scroll, 'set']);
my $display_text = $nodes[0]->as_text();
my $encoded_text = Encode::encode("iso-8859-1", $display_text);
print "results: $encoded_text\n\noriginal: $display_text\n";
$text->insert("end", $encoded_text);
$text->g_grid(-column => 0);
$scroll->g_grid(-column => 1, -row => 0, -sticky => 'ns');
&Tkx::MainLoop();
#####################
Mac OS X 10.6.2
Perl 5.10.1
Tkx 1.08