• IDLE "Codepage" Switching?

    From Stephen Tucker@21:1/5 to All on Tue Jan 17 12:47:29 2023
    I have four questions.

    1. Can anybody explain the behaviour in IDLE (Python version 2.7.10)
    reported below? (It seems that the way it renders a given sequence of bytes depends on the sequence.)

    2. Does the IDLE in Python 3.x behave the same way?

    3. If it does, is this as it should behave?

    4. If it is, then why is it as it should behave?
    ==============================
    mylongstr = ""
    for thisCP in range (157, 169):
    mylongstr += chr (thisCP) + " "


    print mylongstr
    ン ゙ ゚ ᅠ ᄀ ᄁ ᆪ ᄂ ᆬ ᆭ ᄃ ᄄ
    mylongstr = ""
    for thisCP in range (158, 169):
    mylongstr += chr (thisCP) + " "


    print mylongstr
    ž Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨
    mylongstr = ""
    for thisCP in range (157, 169):
    mylongstr += chr (thisCP) + " "


    print mylongstr
    ン ゙ ゚ ᅠ ᄀ ᄁ ᆪ ᄂ ᆬ ᆭ ᄃ ᄄ
    ==============================

    Stephen Tucker.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Stephen Tucker on Tue Jan 17 13:22:38 2023
    Stephen Tucker <[email protected]> writes:
    1. Can anybody explain the behaviour in IDLE (Python version 2.7.10)
    reported below? (It seems that the way it renders a given sequence of bytes >depends on the sequence.)

    When I went to school, we already learned Python 3, so I
    have never learned Python 2. I can only offer a wild guess.

    I am using an operating system called "Microsoft® Windows".
    This - as a default - uses a character encoding called
    "cp1252". Maybe even the shell of IDLE of Python version
    2.7.10 on Microsoft® Windows uses this encoding.

    From a table of cp1252, I take it that 0x9D (decimal 157,
    the value your first loop starts with) is "undefined" in
    cp1252.

    So, when you write char 157 to some cp1252-based device,
    you kind of provoke "undefined behavior" of the device.
    It might use char 157 for internal purposes, and char 157
    might change the state of the output device into some
    special mode, which then will alter the appearance of
    subsequent characters.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From rbowman@21:1/5 to Stephen Tucker on Wed Jan 18 01:46:49 2023
    On Tue, 17 Jan 2023 12:47:29 +0000, Stephen Tucker wrote:

    2. Does the IDLE in Python 3.x behave the same way?

    fwiw

    Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
    Type "help", "copyright", "credits" or "license()" for more information.
    str = ""
    for c in range(157, 169):
    str += chr(c) + ""


    print(str)
    žŸ ¡¢£¤¥¦§¨
    str = ""
    for c in range(140, 169):
    str += chr(c) + " "


    print(str)
    Œ  Ž   ‘ ’ “ ” • – — ˜ ™ š › œ  ž Ÿ   ¡ ¢ £ ¤ ¥ ¦ § ¨


    I don't know how this will appear since Pan is showing the icon for a
    character not in its set. However, even with more undefined characters
    the printable one do not change. I get the same output running Python3
    from the terminal so it's not an IDLE thing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to rbowman on Tue Jan 17 22:58:53 2023
    On 1/17/2023 8:46 PM, rbowman wrote:
    On Tue, 17 Jan 2023 12:47:29 +0000, Stephen Tucker wrote:

    2. Does the IDLE in Python 3.x behave the same way?

    fwiw

    Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
    Type "help", "copyright", "credits" or "license()" for more information.
    str = ""
    for c in range(157, 169):
    str += chr(c) + ""


    print(str)
    žŸ ¡¢£¤¥¦§¨
    str = ""
    for c in range(140, 169):
    str += chr(c) + " "


    print(str)
    Œ  Ž   ‘ ’ “ ” • – — ˜ ™ š › œ  ž Ÿ   ¡ ¢ £ ¤ ¥ ¦ § ¨


    I don't know how this will appear since Pan is showing the icon for a character not in its set. However, even with more undefined characters
    the printable one do not change. I get the same output running Python3
    from the terminal so it's not an IDLE thing.

    I'm not sure what explanation is being asked for here. Let's take
    Python3, so we can be sure that the strings are in unicode. The font
    being used by the console isn't mentioned, but there's no reason it
    should have glyphs for any random unicode character. In my case, I see
    the same missing and printable characters as in the previous post
    (above). The font is Source Code Pro Medium.

    Changing the console's code page won't magically provide the missing glyphs.

    I wrote these characters to a file using utf-8 encoding and opened it in
    an editor that recognized the content as utf-8 (EditPlus). It displayed
    the same characters but had fewer leading spaces (i.e., missing glyphs),
    and did not show any default "missing-character" glyphs. The editor is
    using the Cousine font.

    The second factor that could be in play is what the default character
    encoding is, which is set by Windows and could be different in different
    places (locales). I don't recall just now how Python3 handles this.
    Since Python2 strings are not unicode unless specified, and Python2
    probably handles the locale/default encoding differently from Python3,
    it would not be a surprise if the two give different results.

    If you print such a Python2 string, you will get glyphs for (non-ascii) ord(chr) > 127 that come from the Windows code page table, which will be different from what Python3 will display.

    Python3 uses Windows Unicode API functions, and isn't subject to the
    same limitations as Python2 was - Python2 had to go though the Windows
    code page apparatus and didn't use the Unicode API. See PEP 528 - https://peps.python.org/pep-0528/)

    IDLE sets up its own window itself, and probably uses a different font
    from the default Windows console, so there could be some differences
    there too, especially as to whether missing glyphs show a visible symbol
    or not.

    Code Page 65001 was often claimed to be for utf-8. It's not really
    correct in general, but it's OK for many utf-8 characters. But in
    Python2, the codecs module does not know about code page 65001 - unless
    you apply a simple patch - so if you try to set the console to cp65001,
    you cannot get anything printed. You get an exception raised instead.

    Yes, it's all confusing, and especially with Python2.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to All on Wed Jan 18 00:20:23 2023
    Le mardi 17 janvier 2023 à 13:48:08 UTC+1, Stephen Tucker a écrit :
    I have four questions.

    1. Can anybody explain the behaviour in IDLE (Python version 2.7.10) reported below? (It seems that the way it renders a given sequence of bytes depends on the sequence.)

    2. Does the IDLE in Python 3.x behave the same way?

    3. If it does, is this as it should behave?

    4. If it is, then why is it as it should behave? ==============================
    mylongstr = ""
    for thisCP in range (157, 169):
    mylongstr += chr (thisCP) + " "


    print mylongstr
    ン ゙ ゚ ᅠ ᄀ ᄁ ᆪ ᄂ ᆬ ᆭ ᄃ ᄄ
    mylongstr = ""
    for thisCP in range (158, 169):
    mylongstr += chr (thisCP) + " "


    print mylongstr
    ž Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨
    mylongstr = ""
    for thisCP in range (157, 169):
    mylongstr += chr (thisCP) + " "


    print mylongstr
    ン ゙ ゚ ᅠ ᄀ ᄁ ᆪ ᄂ ᆬ ᆭ ᄃ ᄄ ==============================

    Stephen Tucker.

    - You have very interesting questions.
    - This is also true for the "codecs.open()" thread.
    - You are a talented observer.
    - I have an opinion.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Thomas Passin on Wed Jan 18 10:41:05 2023
    On 2023-01-17 22:58:53 -0500, Thomas Passin wrote:
    On 1/17/2023 8:46 PM, rbowman wrote:
    On Tue, 17 Jan 2023 12:47:29 +0000, Stephen Tucker wrote:
    2. Does the IDLE in Python 3.x behave the same way?

    fwiw

    Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
    Type "help", "copyright", "credits" or "license()" for more information. str = ""
    for c in range(140, 169):
    str += chr(c) + " "

    print(str)
    � � � � � � � � � � � � � � � � � � � � � � � � �
    � � �


    I don't know how this will appear since Pan is showing the icon for a character not in its set. However, even with more undefined characters
    the printable one do not change. I get the same output running Python3
    from the terminal so it's not an IDLE thing.

    I'm not sure what explanation is being asked for here. Let's take Python3, so we can be sure that the strings are in unicode. The font being used by the console isn't mentioned, but there's no reason it should have glyphs for any random unicode character.

    Also note that the characters between 128 (U+0080) and 159 (U+009F)
    inclusive aren't printable characters. They are control characters.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | [email protected] | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPHvqsACgkQ8g5IURL+ KF1yLBAAqoHIB4u6HuYYz+RGM0Po4WAqJJxwneakbSu+krPISnhz371e/SGyqr8v X1DV12o2PvML9N9rf3lKZFbz8cNdrJd5JjcTQrr5wpBVzl+JdHstC+Cicobxwe2Y exuhtN2giKZa4GzDRQBtMU3KzIiK0HgYP2agnIHeqthkSfXkTRwocoVXHIJNCtf2 OLohrYV3yNg6K5PhPontRyMhur12Zs7IbGX8GKXFtu7NjxGm4mIB2m4ZOvIGcT/q SqBa+/dnRJYeVRfJEUR4Rq42aOpED5g87aLE2s8Q2E/8Ekp4EhasHjB4y1EhnKMa C13wsQQKw1XFpwDKp+Swku1CCUx2+WBqNOXpssk8fccgz9oih4BTBMN/NQOX4b0u hGSEg0xMhSU+QrZeYE0B52aVZXLtDu7a4xlB2AvF7SVJD90thhimcXkD9Jjvjpi6 2u2mv5QF4VFdfoeEqk/S+IeK/2j16pP9ZLsl7p8DbEpbluu6ke5IRWfMCzukIX0U jT6HF30NOgkIiVFsPdd68n9Iwo4xX3/XHfCetJd9pOWEUCrZGZwsOUfPWOR0/Hiq qYqNhGQu2RgZewwIqcZFKiRAPhGXiz62ldRZn318p3JzCJ2LYyQ+6ZQ+7qpbHVeF AVCT/XBnEVWBb1wK0fZU0NxXcCvDa7XDXkI7TKc
  • From Stephen Tucker@21:1/5 to [email protected] on Wed Jan 18 10:43:01 2023
    Thanks for these responses.

    I was encouraged to read that I'm not the only one to find this all
    confusing.

    I have investigated a little further.

    1. I produced the following IDLE log:

    mylongstr = ""
    for thisCP in range (1, 256):
    mylongstr += chr (thisCP) + " " + str (ord (chr (thisCP))) + ", "


    print mylongstr
    1, 2, 3, 4, 5, 6, 7, 8, 9,
    10, 11, 12,
    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
    31, 32, ! 33, " 34, # 35, $ 36, % 37, & 38, ' 39, ( 40, ) 41, * 42, + 43,
    , 44, - 45, . 46, / 47, 0 48, 1 49, 2 50, 3 51, 4 52, 5 53, 6 54, 7 55, 8
    56, 9 57, : 58, ; 59, < 60, = 61, > 62, ? 63, @ 64, A 65, B 66, C 67, D 68,
    E 69, F 70, G 71, H 72, I 73, J 74, K 75, L 76, M 77, N 78, O 79, P 80, Q
    81, R 82, S 83, T 84, U 85, V 86, W 87, X 88, Y 89, Z 90, [ 91, \ 92, ] 93,
    ^ 94, _ 95, ` 96, a 97, b 98, c 99, d 100, e 101, f 102, g 103, h 104, i
    105, j 106, k 107, l 108, m 109, n 110, o 111, p 112, q 113, r 114, s 115,
    t 116, u 117, v 118, w 119, x 120, y 121, z 122, { 123, | 124, } 125, ~
    126, 127, タ 128, チ 129, ツ 130, テ 131, ト 132, ナ 133, ニ 134, ヌ 135, ネ 136, ノ
    137, ハ 138, ヒ 139, フ 140, ヘ 141, ホ 142, マ 143, ミ 144, ム 145, メ 146, モ 147,
    ヤ 148, ユ 149, ヨ 150, ラ 151, リ 152, ル 153, レ 154, ロ 155, ワ 156, ン 157, ゙
    158, ゚ 159, ᅠ 160, ᄀ 161, ᄁ 162, ᆪ 163, ᄂ 164, ᆬ 165, ᆭ 166, ᄃ 167, ᄄ 168,
    ᄅ 169, ᆰ 170, ᆱ 171, ᆲ 172, ᆳ 173, ᆴ 174, ᆵ 175, ᄚ 176, ᄆ 177, ᄇ 178, ᄈ
    179, ᄡ 180, ᄉ 181, ᄊ 182, ᄋ 183, ᄌ 184, ᄍ 185, ᄎ 186, ᄏ 187, ᄐ 188, ᄑ 189,
    ᄒ 190, ﾿ 191, À 192, Á 193, Â 194, Ã 195, Ä 196, Å 197, Æ 198, Ç 199, È
    200, É 201, Ê 202, Ë 203, Ì 204, Í 205, Î 206, Ï 207, Ð 208, Ñ 209, Ò 210,
    Ó 211, Ô 212, Õ 213, Ö 214, × 215, Ø 216, Ù 217, Ú 218, Û 219, Ü 220, Ý
    221, Þ 222, ß 223, à 224, á 225, â 226, ã 227, ä 228, å 229, æ 230, ç 231,
    è 232, é 233, ê 234, ë 235, ì 236, í 237, î 238, ï 239, ð 240, ñ 241, ò
    242, ó 243, ô 244, õ 245, ö 246, ÷ 247, ø 248, ù 249, ú 250, û 251, ü 252,
    ý 253, þ 254, ÿ 255,


    2. I copied and pasted the IDLE log into a text file and ran a program on
    it that told me about every byte in the log.

    3. I discovered the following:

    Bytes 001 to 127 (01 to 7F hex) inclusive were printed as-is;

    Bytes 128 to 191 (80 to BF) inclusive were output as UTF-8-encoded
    characters whose codepoints were FF00 hex more than the byte values (hence
    the strange glyphs);

    Bytes 192 to 255 (C0 to FF) inclusive were output as UTF-8-encoded
    characters - without any offset being added to their codepoints in the meantime!

    I thought you might just be interested in this - there does seem to be some method in IDLE's mind, at least.

    Stephen Tucker.








    On Wed, Jan 18, 2023 at 9:41 AM Peter J. Holzer <[email protected]> wrote:

    On 2023-01-17 22:58:53 -0500, Thomas Passin wrote:
    On 1/17/2023 8:46 PM, rbowman wrote:
    On Tue, 17 Jan 2023 12:47:29 +0000, Stephen Tucker wrote:
    2. Does the IDLE in Python 3.x behave the same way?

    fwiw

    Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
    Type "help", "copyright", "credits" or "license()" for more
    information.
    str = ""
    for c in range(140, 169):
    str += chr(c) + " "

    print(str)
    Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ ¡ ¢ £ ¤ ¥
    ¦ § ¨


    I don't know how this will appear since Pan is showing the icon for a character not in its set. However, even with more undefined characters the printable one do not change. I get the same output running Python3 from the terminal so it's not an IDLE thing.

    I'm not sure what explanation is being asked for here. Let's take
    Python3,
    so we can be sure that the strings are in unicode. The font being used
    by
    the console isn't mentioned, but there's no reason it should have glyphs
    for
    any random unicode character.

    Also note that the characters between 128 (U+0080) and 159 (U+009F)
    inclusive aren't printable characters. They are control characters.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | [email protected] | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"
    --
    https://mail.python.org/mailman/listinfo/python-list


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Passin@21:1/5 to Stephen Tucker on Wed Jan 18 11:05:24 2023
    On 1/18/2023 5:43 AM, Stephen Tucker wrote:
    Thanks for these responses.

    I was encouraged to read that I'm not the only one to find this all confusing.

    I have investigated a little further.

    1. I produced the following IDLE log:

    mylongstr = ""
    for thisCP in range (1, 256):
    mylongstr += chr (thisCP) + " " + str (ord (chr (thisCP))) + ", "


    print mylongstr
    1, 2, 3, 4, 5, 6, 7, 8, 9,
    10, 11, 12,
    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
    31, 32, ! 33, " 34, # 35, $ 36, % 37, & 38, ' 39, ( 40, ) 41, * 42, + 43,
    , 44, - 45, . 46, / 47, 0 48, 1 49, 2 50, 3 51, 4 52, 5 53, 6 54, 7 55, 8
    56, 9 57, : 58, ; 59, < 60, = 61, > 62, ? 63, @ 64, A 65, B 66, C 67, D 68,
    E 69, F 70, G 71, H 72, I 73, J 74, K 75, L 76, M 77, N 78, O 79, P 80, Q
    81, R 82, S 83, T 84, U 85, V 86, W 87, X 88, Y 89, Z 90, [ 91, \ 92, ] 93,
    ^ 94, _ 95, ` 96, a 97, b 98, c 99, d 100, e 101, f 102, g 103, h 104, i
    105, j 106, k 107, l 108, m 109, n 110, o 111, p 112, q 113, r 114, s 115,
    t 116, u 117, v 118, w 119, x 120, y 121, z 122, { 123, | 124, } 125, ~
    126, 127, タ 128, チ 129, ツ 130, テ 131, ト 132, ナ 133, ニ 134, ヌ 135, ネ 136, ノ
    137, ハ 138, ヒ 139, フ 140, ヘ 141, ホ 142, マ 143, ミ 144, ム 145, メ 146, モ 147,
    ヤ 148, ユ 149, ヨ 150, ラ 151, リ 152, ル 153, レ 154, ロ 155, ワ 156, ン 157, ゙
    158, ゚ 159, ᅠ 160, ᄀ 161, ᄁ 162, ᆪ 163, ᄂ 164, ᆬ 165, ᆭ 166, ᄃ 167, ᄄ 168,
    ᄅ 169, ᆰ 170, ᆱ 171, ᆲ 172, ᆳ 173, ᆴ 174, ᆵ 175, ᄚ 176, ᄆ 177, ᄇ 178, ᄈ
    179, ᄡ 180, ᄉ 181, ᄊ 182, ᄋ 183, ᄌ 184, ᄍ 185, ᄎ 186, ᄏ 187, ᄐ 188, ᄑ 189,
    ᄒ 190, ﾿ 191, À 192, Á 193, Â 194, Ã 195, Ä 196, Å 197, Æ 198, Ç 199, È
    200, É 201, Ê 202, Ë 203, Ì 204, Í 205, Î 206, Ï 207, Ð 208, Ñ 209, Ò 210,
    Ó 211, Ô 212, Õ 213, Ö 214, × 215, Ø 216, Ù 217, Ú 218, Û 219, Ü 220, Ý
    221, Þ 222, ß 223, à 224, á 225, â 226, ã 227, ä 228, å 229, æ 230, ç 231,
    è 232, é 233, ê 234, ë 235, ì 236, í 237, î 238, ï 239, ð 240, ñ 241, ò
    242, ó 243, ô 244, õ 245, ö 246, ÷ 247, ø 248, ù 249, ú 250, û 251, ü 252,
    ý 253, þ 254, ÿ 255,


    2. I copied and pasted the IDLE log into a text file and ran a program on
    it that told me about every byte in the log.

    3. I discovered the following:

    Bytes 001 to 127 (01 to 7F hex) inclusive were printed as-is;

    Bytes 128 to 191 (80 to BF) inclusive were output as UTF-8-encoded
    characters whose codepoints were FF00 hex more than the byte values (hence the strange glyphs);

    Bytes 192 to 255 (C0 to FF) inclusive were output as UTF-8-encoded
    characters - without any offset being added to their codepoints in the meantime!

    I thought you might just be interested in this - there does seem to be some method in IDLE's mind, at least.

    This has nothing to do with IDLE. The UTF-8 encoding of those code
    points uses two bytes instead of one. See

    https://stackoverflow.com/questions/8732025/why-degree-symbol-differs-from-utf-8-from-unicode#:~:text=UTF-8%20encodes%20the%20value%200xB0%20as%20two%20consecutive,on%20endianness%20(I%20suppose%20other%20orderings%20are%20possible).coding-in-vs-code-on-
    ubuntu-leading-to-unicode-error/62652695#62652695




    Stephen Tucker.








    On Wed, Jan 18, 2023 at 9:41 AM Peter J. Holzer <[email protected]> wrote:

    On 2023-01-17 22:58:53 -0500, Thomas Passin wrote:
    On 1/17/2023 8:46 PM, rbowman wrote:
    On Tue, 17 Jan 2023 12:47:29 +0000, Stephen Tucker wrote:
    2. Does the IDLE in Python 3.x behave the same way?

    fwiw

    Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
    Type "help", "copyright", "credits" or "license()" for more
    information.
    str = ""
    for c in range(140, 169):
    str += chr(c) + " "

    print(str)
    Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ ¡ ¢ £ ¤ ¥
    ¦ § ¨


    I don't know how this will appear since Pan is showing the icon for a
    character not in its set. However, even with more undefined characters >>>> the printable one do not change. I get the same output running Python3 >>>> from the terminal so it's not an IDLE thing.

    I'm not sure what explanation is being asked for here. Let's take
    Python3,
    so we can be sure that the strings are in unicode. The font being used
    by
    the console isn't mentioned, but there's no reason it should have glyphs
    for
    any random unicode character.

    Also note that the characters between 128 (U+0080) and 159 (U+009F)
    inclusive aren't printable characters. They are control characters.

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | [email protected] | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"
    --
    https://mail.python.org/mailman/listinfo/python-list


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Peter J. Holzer@21:1/5 to Thomas Passin on Wed Jan 18 17:57:46 2023
    On 2023-01-18 11:05:24 -0500, Thomas Passin wrote:
    On 1/18/2023 5:43 AM, Stephen Tucker wrote:
    Thanks for these responses.

    I was encouraged to read that I'm not the only one to find this all confusing.

    I have investigated a little further.

    1. I produced the following IDLE log:

    mylongstr = ""
    for thisCP in range (1, 256):
    mylongstr += chr (thisCP) + " " + str (ord (chr (thisCP))) + ", "


    print mylongstr
    1, 2, 3, 4, 5, 6, 7, 8, 9,
    10, 11, 12,
    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, ! 33, " 34, # 35, $ 36, % 37, & 38, ' 39, ( 40, ) 41, * 42, + 43, , 44, - 45, . 46, / 47, 0 48, 1 49, 2 50, 3 51, 4 52, 5 53, 6 54, 7 55, 8 56, 9 57, : 58, ; 59, < 60, = 61, > 62, ? 63, @ 64, A 65, B 66, C 67, D 68, E 69, F 70, G 71, H 72, I 73, J 74, K 75, L 76, M 77, N 78, O 79, P 80, Q 81, R 82, S 83, T 84, U 85, V 86, W 87, X 88, Y 89, Z 90, [ 91, \ 92, ] 93, ^ 94, _ 95, ` 96, a 97, b 98, c 99, d 100, e 101, f 102, g 103, h 104, i 105, j 106, k 107, l 108, m 109, n 110, o 111, p 112, q 113, r 114, s 115, t 116, u 117, v 118, w 119, x 120, y 121, z 122, { 123, | 124, } 125, ~ 126, 127, タ 128, チ 129, ツ 130, テ 131, ト 132, ナ 133, ニ 134, ヌ 135, ネ 136, ノ
    137, ハ 138, ヒ 139, フ 140, ヘ 141, ホ 142, マ 143, ミ 144, ム 145, メ 146, モ 147,
    ヤ 148, ユ 149, ヨ 150, ラ 151, リ 152, ル 153, レ 154, ロ 155, ワ 156, ン 157, ゙
    158, ゚ 159, ᅠ 160, ᄀ 161, ᄁ 162, ᆪ 163, ᄂ 164, ᆬ 165, ᆭ 166, ᄃ 167, ᄄ 168,
    ᄅ 169, ᆰ 170, ᆱ 171, ᆲ 172, ᆳ 173, ᆴ 174, ᆵ 175, ᄚ 176, ᄆ 177, ᄇ 178, ᄈ
    179, ᄡ 180, ᄉ 181, ᄊ 182, ᄋ 183, ᄌ 184, ᄍ 185, ᄎ 186, ᄏ 187, ᄐ 188, ᄑ 189,
    ᄒ 190, ﾿ 191, À 192, Á 193, Â 194, Ã 195, Ä 196, Å 197, Æ 198, Ç 199, È
    200, É 201, Ê 202, Ë 203, Ì 204, Í 205, Î 206, Ï 207, Ð 208, Ñ 209, Ò 210,
    Ó 211, Ô 212, Õ 213, Ö 214, × 215, Ø 216, Ù 217, Ú 218, Û 219, Ü 220, Ý
    221, Þ 222, ß 223, à 224, á 225, â 226, ã 227, ä 228, å 229, æ 230, ç 231,
    è 232, é 233, ê 234, ë 235, ì 236, í 237, î 238, ï 239, ð 240, ñ 241, ò
    242, ó 243, ô 244, õ 245, ö 246, ÷ 247, ø 248, ù 249, ú 250, û 251, ü 252,
    ý 253, þ 254, ÿ 255,


    2. I copied and pasted the IDLE log into a text file and ran a program on it that told me about every byte in the log.

    3. I discovered the following:

    Bytes 001 to 127 (01 to 7F hex) inclusive were printed as-is;

    Which might mean that they are also UTF-8-encoded (there is no
    difference between UTF-8-encoding and ASCII-encoding for these
    characters).


    Bytes 128 to 191 (80 to BF) inclusive were output as UTF-8-encoded characters whose codepoints were FF00 hex more than the byte values (hence the strange glyphs);

    Bytes 192 to 255 (C0 to FF) inclusive were output as UTF-8-encoded characters - without any offset being added to their codepoints in the meantime!

    I thought you might just be interested in this - there does seem to be some method in IDLE's mind, at least.

    This has nothing to do with IDLE. The UTF-8 encoding of those code points uses two bytes instead of one. See

    That's not the peculiar thing. The peculiar thing is that characters
    U+0080 to U+00BF are recoded to U+FF80 to U+FFBF (but U+00C0 to U+00FF
    are printed normally).

    I have no idea what's happening here. I can only urge Stephen to use
    Python 3.x instead of Python 2.7. Python2 has been deprecated for years
    has has reached its official end of life 3 years ago. There really
    shouldn't be any reason to use Python 2.7 any more except
    reverse-engineering old applications in order to port them to Python 3.

    In particular, the type "str" is very different in Python2 and Python3.
    In Python2 it is a sequence of bytes (similar to the Python3 type
    "bytes") and in Python3 it is a sequence of (Unicode) characters
    (similar to the Python2 type "unicode").

    hp

    --
    _ | Peter J. Holzer | Story must make more sense than reality.
    |_|_) | |
    | | | [email protected] | -- Charles Stross, "Creative writing
    __/ | http://www.hjp.at/ | challenge!"

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmPIJQMACgkQ8g5IURL+ KF1KiQ/+LzHcBzW9yPN/51RirrHGX4mKT9IYSvww5RAuVPWW8JSGxP1n3Gdr3od3 K94rEMs62NDDAmMfJ+Qhf5SvHWa5s8RF7/nkFR9vqokf3YpQ5e0SUbqHwRgI7qEH 0DIK3saco72Tfx7BbON0B6Mex/0ULC1bfaESyxOtD7JI0kGb1bVXkI8ocnZZIYjW 2T+eVONBzLqUL8+QdlKqkFngWsTK/2viCzPU+e0D2jwLbkZ1JbtNYIGM+zWwdw4B gcXZzbK4pEHavSItALp3T47cCObCtPN68YrJxmz36s2Q+W54hx3KGM2dA8gNMiV5 jPIV2BU4ggs7C81w0juebp6eIyMp19vYowTUrD14qImKE5c5zYApL4DNvK+J7cLj 6zhdaqE5NYmgwW9H7rneVfN6bePtWqizpriSUKDpOMiF2JF+GNr5Hv96oEeiSTzX JcQFJTP4oBqDjpeiXgAZucevtcHzPgyM8PpxzQbjReWIcMMTDfL4qmZSUkfafORx Ejz/Xr+zGNgi322iQxPTFNjdfxec92yDzeK+Wr/AwZM+MbdDv5zbDSV8qFyLztAp uxBGcRCaLh+9aFbaB0No4A/rtYYXGcjY6eTHSPVFHCyFTcQlq6p9OtVHqTAT2F6s 3spJZIPBzweNaYZLWHqIzWeSjoyy+losUzwPOaK
  • From Eryk Sun@21:1/5 to Stephen Tucker on Wed Jan 18 14:41:01 2023
    On 1/17/23, Stephen Tucker <[email protected]> wrote:

    1. Can anybody explain the behaviour in IDLE (Python version 2.7.10)
    reported below? (It seems that the way it renders a given sequence of bytes depends on the sequence.)

    In 2.x, IDLE tries to decode a byte string via unicode() before
    writing to the Tk text widget. However, if the locale encoding (e.g.
    the process ANSI code page) fails to decode one or more characters,
    IDLE lets Tk figure out how to decode the byte string.

    Python 2.7 has an older version of Tk that has peculiar behavior on
    Windows when bytes in the range 0x80-0xBF are written to a text box.
    Bytes in this range get translated to native wide characters (16-bit characters) in the halfwidth/fullwidth Unicode block, i.e. translated
    to Unicode U+FF80 - U+FFBF.

    If IDLE decodes using code page 1252, then the ordinals 0x81, 0x8d,
    0x8f, 0x90 and 0x9d can't be decoded. IDLE thus passes the undecoded
    byte string to Tk. The example you provided that demonstrates the
    behavior contains ordinal 0x9d (157).

    I get similar behavior for the other undefined ordinal values in code
    page 1252. For example, using IDLE 2.7.18 on Windows:

    >>> print '\x81\xa1'
    チᄀ
    >>> print 'a\xa1'


    In the first case, ordinal 0x81 causes decoding to fail in IDLE, so
    the byte string is passed as is to Tk, which maps it to
    '\uff81\uffa1'. In the second case, OTOH, "\xa1" is decoded by IDLE as
    "¡".

    2. Does the IDLE in Python 3.x behave the same way?

    No, in 3.x only Unicode str() objects are written to the Tk text
    widget. Moreover, the text widget doesn't have the same behavior in
    newer versions. It ignores bytes in the control-block range 0x80-0x9F,
    and it decodes bytes in the range 0xA0-0xBF normally.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)