Forum: >>> Magnum BBS <<<

Decoding bytes to text strings in Python 2

From Rayner Lucas@21:1/5 to All on Fri Jun 21 16:49:08 2024

I'm curious about something I've encountered while updating a very old
Tk app (originally written in Python 1, but I've ported it to Python 2
as a first step towards getting it running on modern systems). The app downloads emails from a POP server and displays them. At the moment, the
code is completely unaware of character encodings (which is something I
plan to fix), and I have found that I don't understand what Python is
doing when no character encoding is specified.

To demonstrate, I have written this short example program that displays
a variety of UTF-8 characters to check whether they are decoded
properly:

---- Example Code ----
import Tkinter as tk

window = tk.Tk()

mytext = """
\xc3\xa9 LATIN SMALL LETTER E WITH ACUTE
\xc5\x99 LATIN SMALL LETTER R WITH CARON
\xc4\xb1 LATIN SMALL LETTER DOTLESS I
\xef\xac\x84 LATIN SMALL LIGATURE FFL
\xe2\x84\x9a DOUBLE-STRUCK CAPITAL Q
\xc2\xbd VULGAR FRACTION ONE HALF
\xe2\x82\xac EURO SIGN
\xc2\xa5 YEN SIGN
\xd0\x96 CYRILLIC CAPITAL LETTER ZHE
\xea\xb8\x80 HANGUL SYLLABLE GEUL
\xe0\xa4\x93 DEVANAGARI LETTER O
\xe5\xad\x97 CJK UNIFIED IDEOGRAPH-5B57
\xe2\x99\xa9 QUARTER NOTE
\xf0\x9f\x90\x8d SNAKE
\xf0\x9f\x92\x96 SPARKLING HEART
"""

mytext = mytext.decode(encoding="utf-8")
greeting = tk.Label(text=mytext)
greeting.pack()

window.mainloop()
---- End Example Code ----

This works exactly as expected, with all the characters displaying
correctly.

However, if I comment out the line 'mytext = mytext.decode
(encoding="utf-8")', the program still displays *almost* everything
correctly. All of the characters appear correctly apart from the two
four-byte emoji characters at the end, which instead display as four characters. For example, the "SNAKE" character actually displays as:
U+00F0 LATIN SMALL LETTER ETH
U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
U+FF90 HALFWIDTH KATAKANA LETTER MI
U+FF8D HALFWIDTH KATAKANA LETTER HE

What's Python 2 doing here? sys.getdefaultencoding() returns 'ascii',
but it's clearly not attempting to display the bytes as ASCII (or
cp1252, or ISO-8859-1). How is it deciding on some sort of almost-but- not-quite UTF-8 decoding?

I am using Python 2.7.18 on a Windows 10 system. If there's any other
relevant information I should provide please let me know.

Many thanks,
Rayner

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to Rayner Lucas on Fri Jun 21 17:43:13 2024

Rayner Lucas <[email protected]AMPLEASE> wrote or quoted: >What's Python 2 doing here? sys.getdefaultencoding() returns 'ascii',

but it's clearly not attempting to display the bytes as ASCII (or
cp1252, or ISO-8859-1). How is it deciding on some sort of almost-but- >not-quite UTF-8 decoding?

I didn't really do a super thorough deep dive on this,
but I'm just giving the initial impression without
actually being familiar with Tkinter under Python 2,
so I might be wrong!

The Text widget typically expects text in Tcl encoding,
which is usually UTF-8.

This is independent of the result returned by sys.get-
defaultencoding()!

If a UTF-8 string is inserted directly as a bytes object,
its code points will be displayed correctly by the Text
widget as long as they are in the BMP (Basic Multilingual
Plane), as you already found out yourself.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rayner Lucas@21:1/5 to All on Sat Jun 22 13:13:28 2024

In article <[email protected]>, [email protected] says...

If you switch to a Linux system, it should work correctly, and you'll
be able to migrate the rest of the way onto Python 3. Once you achieve
that, you'll be able to operate on Windows or Linux equivalently,
since Python 3 solved this problem. At least, I *think* it will; my
current system has a Python 2 installed, but doesn't have tkinter
(because I never bothered to install it), and it's no longer available
from the upstream Debian repos, so I only tested it in the console.
But the decoding certainly worked.

Thank you for the idea of trying it on a Linux system. I did so, and my
example code generated the error:

_tkinter.TclError: character U+1f40d is above the range (U+0000-U+FFFF)
allowed by Tcl

So it looks like the problem is ultimately due to a limitation of
Tcl/Tk. I'm still not sure why it doesn't give an error on Windows and
instead either works (when UTF-8 encoding is specified) or converts the out-of-range characters to ones it can display (when the encoding isn't specified). But now I know what the root of the problem is, I can deal
with it appropriately (and my curiosity is at least partly satisfied).

This has given me a much better understanding of what I need to do in
order to migrate to Python 3 and add proper support for non-ASCII
characters, so I'm very grateful for your help!

Thanks,
Rayner

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rayner Lucas@21:1/5 to All on Sat Jun 22 13:26:00 2024

In article <[email protected]>, [email protected]- berlin.de says...

I didn't really do a super thorough deep dive on this,
but I'm just giving the initial impression without
actually being familiar with Tkinter under Python 2,
so I might be wrong!

The Text widget typically expects text in Tcl encoding,
which is usually UTF-8.

This is independent of the result returned by sys.get-
defaultencoding()!

If a UTF-8 string is inserted directly as a bytes object,
its code points will be displayed correctly by the Text
widget as long as they are in the BMP (Basic Multilingual
Plane), as you already found out yourself.

Many thanks, you've helped me greatly in understanding what's happening.
When I tried running my example code on a different system (Python
2.7.18 on Linux, with Tcl/Tk 8.5), I got the error:

_tkinter.TclError: character U+1f40d is above the range (U+0000-U+FFFF)
allowed by Tcl

So, as your reply suggests, the problem is ultimately a limitation of
Tcl/Tk itself. Perhaps I should have spent more time studying the docs
for that instead of puzzling over the details of character encodings in
Python! I'm not sure why it doesn't give the same error on Windows, but
at least now I know where the root of the issue is.

I am now much better informed about how to migrate the code I'm working
on, so I am very grateful for your help.

Thanks,
Rayner

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MRAB@21:1/5 to Chris Angelico via Python-list on Mon Jun 24 01:14:22 2024

On 2024-06-24 00:30, Chris Angelico via Python-list wrote:

On Mon, 24 Jun 2024 at 08:20, Rayner Lucas via Python-list <[email protected]> wrote:

In article <[email protected]>,
[email protected] says...

If you switch to a Linux system, it should work correctly, and you'll
be able to migrate the rest of the way onto Python 3. Once you achieve
that, you'll be able to operate on Windows or Linux equivalently,
since Python 3 solved this problem. At least, I *think* it will; my
current system has a Python 2 installed, but doesn't have tkinter
(because I never bothered to install it), and it's no longer available
from the upstream Debian repos, so I only tested it in the console.
But the decoding certainly worked.

Thank you for the idea of trying it on a Linux system. I did so, and my
example code generated the error:

_tkinter.TclError: character U+1f40d is above the range (U+0000-U+FFFF)
allowed by Tcl

So it looks like the problem is ultimately due to a limitation of
Tcl/Tk.

Yep, that seems to be the case. Not sure if that's still true on a
more recent Python, but it does look like you won't get astral
characters in tkinter on the one you're using.

[snip]
Tkinter in recent versions of Python can handle astral characters, at
least back to Python 3.8, the oldest I have on my Windows PC.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet
- Rixter
  Mon Jul 27 13:04:59 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	46:25:58
Calls:	12,444
Calls today:	4
Files:	15,192
Messages:	6,537,108

Decoding bytes to text strings in Python 2

Who's Online

Recent Visitors

System Info