• Dealing with encodings

    From Luc@21:1/5 to All on Fri Feb 24 15:24:09 2023
    I have a basic text editor that well, edits text files.

    I've been using it for a long time without ever giving
    a thought about encodings. I just open, edit and save.
    I never knew or cared about what encodings were involved.

    I want to change that.

    I know how to tell Tcl to write with a certain encoding.
    But I never implemented that and I've been thinking that
    I should probably keep the existing encoding in most cases,
    and for that I have to be able to tell what encoding is
    there already.

    I believe Tcl cannot do that. I've been researching and
    it seems that we need external software to do that, namely
    'file' and 'enca' neither of which is super reliable.

    What experience do you have with that? Can you share any
    suggestions or recommendations?

    --
    Luc


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Luc on Sat Feb 25 06:10:20 2023
    Luc <[email protected]d> wrote:
    I know how to tell Tcl to write with a certain encoding. But I never implemented that and I've been thinking that I should probably keep
    the existing encoding in most cases, and for that I have to be able
    to tell what encoding is there already.

    I believe Tcl cannot do that. I've been researching and it seems
    that we need external software to do that, namely 'file' and 'enca'
    neither of which is super reliable.

    Actually, absent side-channel information, it is impossible to tell
    with 100% certainty what 'encoding' a given file has been encoded with.

    The best you can do is verify that a given file does not contain any
    illegal sequences for the expected encoding. These kinds of
    hieuristics will get you 95% there, but it will always be possible for something to slip through.

    What experience do you have with that? Can you share any suggestions
    or recommendations?

    For reading, if you assume UTF-8, you'll be right more often than wrong
    for anything modern. The older the "text file" you plan to edit, the
    greater probability for UTF-8 to be an incorrect choice. And there
    will always end up being a few where you just have to make a guess and
    see if it looks like it worked.

    For writing, just create everything as UTF-8 unless you have a *very*
    good reason to do otherwise.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)