Perl/Tkx incorrectly displayed characters

bvforth

New Member
Messages
3
Reaction score
0
Points
0
Hi there!

I'm using Perl/Tkx to grab webpages (LWP), then parse the HTML (HTML::TreeBuilder), then, once I find the right nodes, put the result of $element->as_text() into a text widget (new_tk__text).

The problem I'm having is that some characters (those with umlauts, or accents, for instance) are incorrectly displayed in the Tkx widget, and sometimes incorrectly in my terminal window, too.
I've tried to fixed this by saying utf8::upgrade($content) but this doesn't seem to have any effect, even though the page says that it's encoded in utf8.

Could anybody shed more light on this? I've looked through a bunch of unicode resources, but so far, nothing I've tried in perl (encode, decode, use encoding 'utf8', utf8::upgrade) seems to have even the slightest effect on my strings (based on the unvarying output of print). I really don't know if this is a unicode/encoding/decoding issue. Also, I've found that sometimes there are character glyphs missing from Tkx, but I don't know whether this is the problem.....or how to rectify that if it is.


my code:



#!/usr/bin/perl

use strict;
use warnings;
use LWP;
use Tkx;
use HTML::TreeBuilder;
use Encode;

my $url = "http://dict.leo.org/ende?lp=ende&lang=de&searchLoc=0&cmpType=relaxed&sectHdr=on&spellToler=&search=for";

my $lwp = LWP::UserAgent->new();
my $response = $lwp->get($url);
my $content = ${$response}{_content};
#utf8::upgrade($content);#doesn't seem to do anything

print "encoding: ".$response->content_charset()."\n";
my $root = HTML::TreeBuilder->new();
$root->parse($content);
$root->eof();
my @nodes = $root->look_down(_tag => 'div', 'id', 'singleword');

my $mw = Tkx::widget->new(".");
my $text = $mw->new_tk__text(-width => 100, -height => 30, -wrap => "word");
my $scroll = $mw->new_ttk__scrollbar(-orient => 'vertical', -command => [$text, 'yview']);
$text->configure(-yscrollcommand => [$scroll, 'set']);

my $display_text = $nodes[0]->as_text();
my $encoded_text = Encode::encode("iso-8859-1", $display_text);
print "results: $encoded_text\n\noriginal: $display_text\n";

$text->insert("end", $encoded_text);
$text->g_grid(-column => 0);
$scroll->g_grid(-column => 1, -row => 0, -sticky => 'ns');

&Tkx::MainLoop();



#####################


Mac OS X 10.6.2
Perl 5.10.1
Tkx 1.08
 

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
The incoming data needs to be decoded before you can use it.

Code:
my $response = $lwp->get($url);
my $content = Encode::decode($response->content_charset(), ${$response}{_content});

You don't need to encode the text before displaying it.

Code:
$text->insert("end", $display_text); # rather than $encoded_text

Nicely done describing the problem, by the way.
 
Last edited:
Top