• tar::create with encoding?

    From Alexandru@21:1/5 to All on Sun Oct 16 09:06:23 2022
    Hi,

    It seems, that there is no encoding option for the command tar:encoding.

    I use this command to create an archive of multiple files:

    set fd [open $zipfile wb]
    zlib push gzip $fd -level 9
    tar::create $fd $paths -chan
    close $fd

    Now I realize, all file with Umlaute in the path/name are wrongly encoded when unpacking the archive with the Windows program 7z.

    What could be the solution to this issue?

    Many thanks
    Alexandru

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Alexandru on Sun Oct 16 17:08:28 2022
    Alexandru <[email protected]> wrote:
    Hi,

    It seems, that there is no encoding option for the command
    tar:encoding.

    There is also no comamnd tar::encoding.

    Tar (the archive format) is so old that it does not have an 'encoding'.
    It just stores bytes, and upper level code has to decide what to do
    with the bytes.

    I use this command to create an archive of multiple files:

    set fd [open $zipfile wb]
    zlib push gzip $fd -level 9
    tar::create $fd $paths -chan
    close $fd

    Now I realize, all file with Umlaute in the path/name are wrongly
    encoded when unpacking the archive with the Windows program 7z.

    The issue here could be Tcllib tar, or it could be 7z. Right now you
    don't know, and Tar (the format) has no way to communicate a flag that
    says "filenames herein are UTF8 (or any other encoding)".

    What could be the solution to this issue?

    Several:

    1) (easiest, but may not be practical) -- don't use Umlaute's (or other non-ascii characters) in filenames.

    2) If you look through the source of Tcllib's tar, you will find that
    it inserts the filenames into the tar header block using binary format
    "a" (which simply inserts the codepoint value modulo 256, and that will
    only be correct for an 8-bit fixed length encoding). Which likely
    means the breakage happens during tar::create.

    If you look further up the call chain, you find that directories are
    resolved to lists of filenames via glob, and the proc which writes each
    tar component is fed a filename to work with.

    So, you could use tcllib's find to pre-aquire the filenames you want to
    pack into the Tar file, pre-encode them into the appropriate encoding
    using 'encoding convertto', and output the tar file by calling
    'formatHeader' with the 'encoded' name, and fcopying the file contents yourself.

    3) You could patch tcllib's tar to encode filenames to an encoding
    (including allowing specification of that encoding type via an option
    to tar::create). And then contribute the patches back to Tcllib so
    everyone benefits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alexandru@21:1/5 to Rich on Mon Oct 17 03:58:23 2022
    Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:
    Alexandru <[email protected]> wrote:
    Hi,

    It seems, that there is no encoding option for the command
    tar:encoding.
    There is also no comamnd tar::encoding.

    Tar (the archive format) is so old that it does not have an 'encoding'.
    It just stores bytes, and upper level code has to decide what to do
    with the bytes.
    I use this command to create an archive of multiple files:

    set fd [open $zipfile wb]
    zlib push gzip $fd -level 9
    tar::create $fd $paths -chan
    close $fd

    Now I realize, all file with Umlaute in the path/name are wrongly
    encoded when unpacking the archive with the Windows program 7z.
    The issue here could be Tcllib tar, or it could be 7z. Right now you
    don't know, and Tar (the format) has no way to communicate a flag that
    says "filenames herein are UTF8 (or any other encoding)".
    What could be the solution to this issue?
    Several:

    1) (easiest, but may not be practical) -- don't use Umlaute's (or other non-ascii characters) in filenames.

    2) If you look through the source of Tcllib's tar, you will find that
    it inserts the filenames into the tar header block using binary format
    "a" (which simply inserts the codepoint value modulo 256, and that will
    only be correct for an 8-bit fixed length encoding). Which likely
    means the breakage happens during tar::create.

    If you look further up the call chain, you find that directories are
    resolved to lists of filenames via glob, and the proc which writes each
    tar component is fed a filename to work with.

    So, you could use tcllib's find to pre-aquire the filenames you want to
    pack into the Tar file, pre-encode them into the appropriate encoding
    using 'encoding convertto', and output the tar file by calling
    'formatHeader' with the 'encoded' name, and fcopying the file contents yourself.

    3) You could patch tcllib's tar to encode filenames to an encoding
    (including allowing specification of that encoding type via an option
    to tar::create). And then contribute the patches back to Tcllib so
    everyone benefits.

    Thanks Rich. After looking at tar.tcl, I see that "-encoding binary" is used for the output chanel (which must be the archive file) and also for encoding data inside the file, e.g. for header composing.
    I think, if I start playing arround with the code, I might even make it work for my case but most probably it won't work for other cases. This is due to my limited undestanding of the whole encoding stuff.
    But I think changing the source is the best way.
    What I don't quite undestand, is why pre-encoding the paths does not work:

    tar::create $fd [encoding convertto $enc $paths] -chan

    I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but both didn't work.
    Shouldn't this be enough?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alexandru@21:1/5 to Rich on Mon Oct 17 05:14:20 2022
    Rich schrieb am Montag, 17. Oktober 2022 um 14:00:42 UTC+2:
    Alexandru <[email protected]> wrote:
    Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:
    Alexandru <[email protected]> wrote:
    Now I realize, all file with Umlaute in the path/name are wrongly
    encoded when unpacking the archive with the Windows program 7z.
    What I don't quite undestand, is why pre-encoding the paths does not work:

    tar::create $fd [encoding convertto $enc $paths] -chan

    I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but
    both didn't work. Shouldn't this be enough?
    If you look through the source, the tar module uses the paths you
    supply to also open each file and copy its contents to the output tar
    file. If you pre-encode the strings, then those opens likely will not
    find the correct file (because the name used to open will have been
    changed by the encoding process).

    The patch to tar, presuming it would work, would be to perform encoding convertto on the path/name inside the writeheader proc that outputs the paths/names into the tar header. That way the open gets back the
    string it needs to open the correct file, but non-ascii characters get encoded just before being output into the header.

    Thanks. I changed this next paragraph in the source by adding "encoding convertto cp1252"

    set header [binary format a100A8A8A8A12A12A8a1a100A6a2a32a32a8a8a155a12 \
    [encoding convertto cp1252 $name] $A(mode)\x00 $ouid\x00 $ogid\x00\
    $osize\x00 $omtime\x00 {} $type \
    $A(linkname) ustar\x00 00 $A(uname) $A(gname)\
    $A(devmajor) $A(devminor) $prefix {}]

    Also tried with utf-8. The result is a valid archive but the names in the archive, when I open it with Windows 7z shows different special chars, not the Umlaute I actually have in the original file names.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Alexandru on Mon Oct 17 12:00:37 2022
    Alexandru <[email protected]> wrote:
    Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:
    Alexandru <[email protected]> wrote:
    Now I realize, all file with Umlaute in the path/name are wrongly
    encoded when unpacking the archive with the Windows program 7z.
    What I don't quite undestand, is why pre-encoding the paths does not work:

    tar::create $fd [encoding convertto $enc $paths] -chan

    I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but
    both didn't work. Shouldn't this be enough?

    If you look through the source, the tar module uses the paths you
    supply to also open each file and copy its contents to the output tar
    file. If you pre-encode the strings, then those opens likely will not
    find the correct file (because the name used to open will have been
    changed by the encoding process).

    The patch to tar, presuming it would work, would be to perform encoding convertto on the path/name inside the writeheader proc that outputs the paths/names into the tar header. That way the open gets back the
    string it needs to open the correct file, but non-ascii characters get
    encoded just before being output into the header.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Alexandru on Mon Oct 17 13:35:52 2022
    Alexandru <[email protected]> wrote:
    Rich schrieb am Montag, 17. Oktober 2022 um 14:00:42 UTC+2:
    Alexandru <[email protected]> wrote:
    Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:
    Alexandru <[email protected]> wrote:
    Now I realize, all file with Umlaute in the path/name are wrongly
    encoded when unpacking the archive with the Windows program 7z.
    What I don't quite undestand, is why pre-encoding the paths does not work: >> >
    tar::create $fd [encoding convertto $enc $paths] -chan

    I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but
    both didn't work. Shouldn't this be enough?
    If you look through the source, the tar module uses the paths you
    supply to also open each file and copy its contents to the output tar
    file. If you pre-encode the strings, then those opens likely will not
    find the correct file (because the name used to open will have been
    changed by the encoding process).

    The patch to tar, presuming it would work, would be to perform encoding
    convertto on the path/name inside the writeheader proc that outputs the
    paths/names into the tar header. That way the open gets back the
    string it needs to open the correct file, but non-ascii characters get
    encoded just before being output into the header.

    Thanks. I changed this next paragraph in the source by adding "encoding convertto cp1252"

    set header [binary format a100A8A8A8A12A12A8a1a100A6a2a32a32a8a8a155a12 \
    [encoding convertto cp1252 $name] $A(mode)\x00 $ouid\x00 $ogid\x00\
    $osize\x00 $omtime\x00 {} $type \
    $A(linkname) ustar\x00 00 $A(uname) $A(gname)\
    $A(devmajor) $A(devminor) $prefix {}]

    Also tried with utf-8. The result is a valid archive but the names
    in the archive, when I open it with Windows 7z shows different
    special chars, not the Umlaute I actually have in the original file
    names.

    'encoding names' will give you all the possibilities your Tcl supports.
    Whether one of them works is unknown, and is dependent upon what 7z
    expects to see in the names inside the tar file (which is the big
    unknown here, what does 7z expect, you need to insert what it expects,
    but without knowing that fact, you are left with trying all to see if
    any work).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Alexandru on Mon Oct 17 13:54:20 2022
    Alexandru <[email protected]> wrote:
    Also tried with utf-8. The result is a valid archive but the names
    in the archive, when I open it with Windows 7z shows different
    special chars, not the Umlaute I actually have in the original file
    names.

    Try one more test (if you can).

    Create a tar file using 7z (if possible) and see if:

    1) the Umlaute is encoded correctly (if the answer is no here, then
    this may not be possible with 7z)

    2) if the answer is yes to #1, then open up the tar file in a hex
    display/hex editor and try to work out what the encoding used by 7z
    for the filename (and Umlaute's) likely was.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alexandru@21:1/5 to Rich on Mon Oct 17 06:59:27 2022
    Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
    Alexandru <[email protected]> wrote:
    Also tried with utf-8. The result is a valid archive but the names
    in the archive, when I open it with Windows 7z shows different
    special chars, not the Umlaute I actually have in the original file
    names.
    Try one more test (if you can).

    Create a tar file using 7z (if possible) and see if:

    1) the Umlaute is encoded correctly (if the answer is no here, then
    this may not be possible with 7z)

    2) if the answer is yes to #1, then open up the tar file in a hex
    display/hex editor and try to work out what the encoding used by 7z
    for the filename (and Umlaute's) likely was.

    Creating with Windows 7z is no problem.
    I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

    377a bcaf 271c 0004 cf4b c197 6006 1b00
    0000 0000 2400 0000 0000 0000 5fca de53
    e3ac bcc0 235d 0006 82cf 6346 7fed db19
    67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
    864b 7da8 71ed dc72 9494 456e b474 a34e
    7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

    How should that work? Which editor to you recommend (for Windows)?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alexandru@21:1/5 to Alexandru on Mon Oct 17 08:46:28 2022
    Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
    Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
    Alexandru <[email protected]> wrote:
    Also tried with utf-8. The result is a valid archive but the names
    in the archive, when I open it with Windows 7z shows different
    special chars, not the Umlaute I actually have in the original file names.
    Try one more test (if you can).

    Create a tar file using 7z (if possible) and see if:

    1) the Umlaute is encoded correctly (if the answer is no here, then
    this may not be possible with 7z)

    2) if the answer is yes to #1, then open up the tar file in a hex display/hex editor and try to work out what the encoding used by 7z
    for the filename (and Umlaute's) likely was.
    Creating with Windows 7z is no problem.
    I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

    377a bcaf 271c 0004 cf4b c197 6006 1b00
    0000 0000 2400 0000 0000 0000 5fca de53
    e3ac bcc0 235d 0006 82cf 6346 7fed db19
    67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
    864b 7da8 71ed dc72 9494 456e b474 a34e
    7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

    How should that work? Which editor to you recommend (for Windows)?
    Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alexandru@21:1/5 to Alexandru on Mon Oct 17 08:52:35 2022
    Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:
    Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
    Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
    Alexandru <[email protected]> wrote:
    Also tried with utf-8. The result is a valid archive but the names
    in the archive, when I open it with Windows 7z shows different
    special chars, not the Umlaute I actually have in the original file names.
    Try one more test (if you can).

    Create a tar file using 7z (if possible) and see if:

    1) the Umlaute is encoded correctly (if the answer is no here, then
    this may not be possible with 7z)

    2) if the answer is yes to #1, then open up the tar file in a hex display/hex editor and try to work out what the encoding used by 7z
    for the filename (and Umlaute's) likely was.
    Creating with Windows 7z is no problem.
    I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

    377a bcaf 271c 0004 cf4b c197 6006 1b00
    0000 0000 2400 0000 0000 0000 5fca de53
    e3ac bcc0 235d 0006 82cf 6346 7fed db19
    67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
    864b 7da8 71ed dc72 9494 456e b474 a34e
    7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

    How should that work? Which editor to you recommend (for Windows)?
    Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

    Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
    I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Heller@21:1/5 to Alexandru on Mon Oct 17 16:08:06 2022
    At Mon, 17 Oct 2022 08:52:35 -0700 (PDT) Alexandru <[email protected]> wrote:


    Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:
    Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
    Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
    Alexandru <[email protected]> wrote:
    Also tried with utf-8. The result is a valid archive but the names
    in the archive, when I open it with Windows 7z shows different special chars, not the Umlaute I actually have in the original file names.
    Try one more test (if you can).

    Create a tar file using 7z (if possible) and see if:

    1) the Umlaute is encoded correctly (if the answer is no here, then this may not be possible with 7z)

    2) if the answer is yes to #1, then open up the tar file in a hex display/hex editor and try to work out what the encoding used by 7z
    for the filename (and Umlaute's) likely was.
    Creating with Windows 7z is no problem.
    I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

    377a bcaf 271c 0004 cf4b c197 6006 1b00
    0000 0000 2400 0000 0000 0000 5fca de53
    e3ac bcc0 235d 0006 82cf 6346 7fed db19
    67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
    864b 7da8 71ed dc72 9494 456e b474 a34e
    7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

    How should that work? Which editor to you recommend (for Windows)?
    Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

    Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
    I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.


    Copy the 7z created tar file to a Linux machine and see what 'tar tvf' gives for file names.

    It 7z compressing the tar file? If so, you need to uncompress it before looking for filename encoding.



    --
    Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
    Deepwoods Software -- Custom Software Services
    http://www.deepsoft.com/ -- Linux Administration Services
    [email protected] -- Webhosting Services

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Mon Oct 17 19:18:43 2022
    Am 17.10.22 um 17:52 schrieb Alexandru:
    Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:
    Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
    Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
    Alexandru <[email protected]> wrote:
    Also tried with utf-8. The result is a valid archive but the names
    in the archive, when I open it with Windows 7z shows different
    special chars, not the Umlaute I actually have in the original file
    names.
    Try one more test (if you can).

    Create a tar file using 7z (if possible) and see if:

    1) the Umlaute is encoded correctly (if the answer is no here, then
    this may not be possible with 7z)

    2) if the answer is yes to #1, then open up the tar file in a hex
    display/hex editor and try to work out what the encoding used by 7z
    for the filename (and Umlaute's) likely was.
    Creating with Windows 7z is no problem.
    I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

    377a bcaf 271c 0004 cf4b c197 6006 1b00
    0000 0000 2400 0000 0000 0000 5fca de53
    e3ac bcc0 235d 0006 82cf 6346 7fed db19
    67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
    864b 7da8 71ed dc72 9494 456e b474 a34e
    7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

    How should that work? Which editor to you recommend (for Windows)?
    Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

    Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
    I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.

    A tarred archive is usually first tarred, then gzipped. YOu need to undo
    the gzip first to see the tar (the metadata is also compressed, unlike a
    ZIP file)

    On Linux you would do e.g.

    zcat compressedfile > uncompressedfile

    Or, if the file has the usual ending (e.g. data.tar.gz) then this should
    do the trick:

    gunzip data.tar.gz

    which then yields data.tar

    Christian

    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Alexandru on Mon Oct 17 18:22:45 2022
    Alexandru <[email protected]> wrote:
    Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:
    Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
    Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
    Alexandru <[email protected]> wrote:
    Also tried with utf-8. The result is a valid archive but the names
    in the archive, when I open it with Windows 7z shows different
    special chars, not the Umlaute I actually have in the original file
    names.
    Try one more test (if you can).

    Create a tar file using 7z (if possible) and see if:

    1) the Umlaute is encoded correctly (if the answer is no here, then
    this may not be possible with 7z)

    2) if the answer is yes to #1, then open up the tar file in a hex
    display/hex editor and try to work out what the encoding used by 7z
    for the filename (and Umlaute's) likely was.
    Creating with Windows 7z is no problem.
    I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

    377a bcaf 271c 0004 cf4b c197 6006 1b00
    0000 0000 2400 0000 0000 0000 5fca de53
    e3ac bcc0 235d 0006 82cf 6346 7fed db19
    67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
    864b 7da8 71ed dc72 9494 456e b474 a34e
    7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

    How should that work? Which editor to you recommend (for Windows)?
    Ok I added the HexViewer package to SublimeText and now I can see
    the typical hex view in typical hex editors. Let's see where this
    leads...

    Beats me, I only see chars glibber. Now the hell can I see from the
    hex view what the damn encoding is? I just run "file
    --mime-encoding" in the GIT console and the return value is binary.
    No surprize here.

    First you have to find the bytes that are the filename/path within the
    header, then compare what bytes are present with what the displayed
    names are when viewed in windows/7z. What differs bytes wise vs. the
    view will then begin to suggest an encoding.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Alexandru on Mon Oct 17 18:21:13 2022
    Alexandru <[email protected]> wrote:
    Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
    Alexandru <[email protected]> wrote:
    Also tried with utf-8. The result is a valid archive but the names
    in the archive, when I open it with Windows 7z shows different
    special chars, not the Umlaute I actually have in the original file
    names.
    Try one more test (if you can).

    Create a tar file using 7z (if possible) and see if:

    1) the Umlaute is encoded correctly (if the answer is no here, then
    this may not be possible with 7z)

    2) if the answer is yes to #1, then open up the tar file in a hex
    display/hex editor and try to work out what the encoding used by 7z
    for the filename (and Umlaute's) likely was.

    Creating with Windows 7z is no problem.
    I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

    377a bcaf 271c 0004 cf4b c197 6006 1b00
    0000 0000 2400 0000 0000 0000 5fca de53
    e3ac bcc0 235d 0006 82cf 6346 7fed db19
    67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
    864b 7da8 71ed dc72 9494 456e b474 a34e
    7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

    How should that work? Which editor to you recommend (for Windows)?

    No idea, as I avoid windows like the plague it is.

    Under linux I'd use xxd or hexdump, which can both be asked to give a
    hex dump plus ascii equivalents in the same view.

    If you can post the first 512 bytes of the tar file (that's the tar
    header) as hex like the above, I can use the hex to convert it back to
    binary, then dump with xxd or hexdump.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Heller@21:1/5 to [email protected] on Mon Oct 17 19:29:12 2022
    At Mon, 17 Oct 2022 19:18:43 +0200 Christian Gollwitzer <[email protected]> wrote:


    Am 17.10.22 um 17:52 schrieb Alexandru:
    Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:
    Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
    Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
    Alexandru <[email protected]> wrote:
    Also tried with utf-8. The result is a valid archive but the names >>>>> in the archive, when I open it with Windows 7z shows different
    special chars, not the Umlaute I actually have in the original file >>>>> names.
    Try one more test (if you can).

    Create a tar file using 7z (if possible) and see if:

    1) the Umlaute is encoded correctly (if the answer is no here, then
    this may not be possible with 7z)

    2) if the answer is yes to #1, then open up the tar file in a hex
    display/hex editor and try to work out what the encoding used by 7z
    for the filename (and Umlaute's) likely was.
    Creating with Windows 7z is no problem.
    I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

    377a bcaf 271c 0004 cf4b c197 6006 1b00
    0000 0000 2400 0000 0000 0000 5fca de53
    e3ac bcc0 235d 0006 82cf 6346 7fed db19
    67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
    864b 7da8 71ed dc72 9494 456e b474 a34e
    7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

    How should that work? Which editor to you recommend (for Windows)?
    Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

    Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
    I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.

    A tarred archive is usually first tarred, then gzipped. YOu need to undo
    the gzip first to see the tar (the metadata is also compressed, unlike a
    ZIP file)

    More correctly, tarballs are not compressed as part of being tarballs, but the tarball itself is compressed. Unlike a ZIP file which compress each entry separately, with the overall ZIP file itself not compressed (eg the ZIP directory is not compressed, even though some/all of the contents is compressed).


    On Linux you would do e.g.

    zcat compressedfile > uncompressedfile

    Or, if the file has the usual ending (e.g. data.tar.gz) then this should
    do the trick:

    gunzip data.tar.gz

    (Modern) Linux tar itself reconizes the endings and will do the decompress on-the-fly as needed:

    tar tvf data.tar.gz
    (or:
    tar tvf data.tar.bz2
    etc.)

    No need to separately decompress the tarfile, if you are going to use the tar command on it. This might actually be a good first step for the OP. Seeing
    what tar displays for the file name might yield some enlightenment.

    It should be noted that tar originated under UNIX and predates the use of non-ASCII characters in file names.


    which then yields data.tar

    Christian

    Christian



    --
    Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
    Deepwoods Software -- Custom Software Services
    http://www.deepsoft.com/ -- Linux Administration Services
    [email protected] -- Webhosting Services

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Tue Oct 18 07:48:38 2022
    Am 17.10.22 um 21:29 schrieb Robert Heller:
    At Mon, 17 Oct 2022 19:18:43 +0200 Christian Gollwitzer <[email protected]> wrote:
    On Linux you would do e.g.

    zcat compressedfile > uncompressedfile

    Or, if the file has the usual ending (e.g. data.tar.gz) then this should
    do the trick:

    gunzip data.tar.gz

    (Modern) Linux tar itself reconizes the endings and will do the decompress on-the-fly as needed:

    tar tvf data.tar.gz
    (or:
    tar tvf data.tar.bz2
    etc.)

    No need to separately decompress the tarfile, if you are going to use the tar command on it. This might actually be a good first step for the OP. Seeing what tar displays for the file name might yield some enlightenment.

    It should be noted that tar originated under UNIX and predates the use of non-ASCII characters in file names.

    But then you see again an interpretation of the file names through tar
    and the terminal. Alexandru wants to see the raw data in order to
    replicate tar in Tcl code. I would guess that tar simply stores the file
    name as a bytestream, since on Linux the file systems do not have an
    encoding as opposed to Windows - the file names you see depend on how
    you set LOCALE on Linux, whereas they are converted to UTF16 on NTFS
    file systems on Windows.

    7z on Windows might again have a different idea of the tar format. Not
    to mention that there are multiple tar formats out there; e.g. see here:

    https://www.gnu.org/software/tar/manual/html_section/Formats.html



    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alexandru@21:1/5 to Christian Gollwitzer on Wed Oct 19 01:40:46 2022
    Christian Gollwitzer schrieb am Dienstag, 18. Oktober 2022 um 07:48:43 UTC+2:
    Am 17.10.22 um 21:29 schrieb Robert Heller:
    At Mon, 17 Oct 2022 19:18:43 +0200 Christian Gollwitzer <[email protected]> wrote:
    On Linux you would do e.g.

    zcat compressedfile > uncompressedfile

    Or, if the file has the usual ending (e.g. data.tar.gz) then this should >> do the trick:

    gunzip data.tar.gz

    (Modern) Linux tar itself reconizes the endings and will do the decompress on-the-fly as needed:

    tar tvf data.tar.gz
    (or:
    tar tvf data.tar.bz2
    etc.)

    No need to separately decompress the tarfile, if you are going to use the tar
    command on it. This might actually be a good first step for the OP. Seeing what tar displays for the file name might yield some enlightenment.

    It should be noted that tar originated under UNIX and predates the use of non-ASCII characters in file names.
    But then you see again an interpretation of the file names through tar
    and the terminal. Alexandru wants to see the raw data in order to
    replicate tar in Tcl code. I would guess that tar simply stores the file
    name as a bytestream, since on Linux the file systems do not have an
    encoding as opposed to Windows - the file names you see depend on how
    you set LOCALE on Linux, whereas they are converted to UTF16 on NTFS
    file systems on Windows.

    7z on Windows might again have a different idea of the tar format. Not
    to mention that there are multiple tar formats out there; e.g. see here:

    https://www.gnu.org/software/tar/manual/html_section/Formats.html



    Christian

    This whole encoding stuff is crazy to follow up.
    I was thinking, maybe I can get an workarround, if I use Tcl to unpack the archive.
    Since I could not find a manual for the tar package, I hat to read the source code and other forums online to get close to a solution, which is not working right now.
    So here is the code to unpack a 7z archive, which contains a tar archive:

    set f [open $zipfile]
    zlib push gunzip $f
    set result [tar::untar $f -chan]
    close $f

    I get this type:
    couldn't open "<filename> 100644 0 0 226212 1432372742" : filename is invalid on this platform

    Seems like the untar cannot handle correctly multiple file names in the header. The first file name in ther archive is handled correctly. The second one is somehow cut in the middle and instead following data is attached to the file name.

    The method used to create the 7z file is

    set fd [open $zipfile wb]
    zlib push gzip $fd -level 9
    tar::create $fd $paths -chan
    close $fd

    where as $paths a list of full file paths is.

    Is this a bug? Or am I using the wrong method to unpack?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)