Forum: >>> Magnum BBS <<<

tar::create with encoding?

From Alexandru@21:1/5 to All on Sun Oct 16 09:06:23 2022

Hi,

It seems, that there is no encoding option for the command tar:encoding.

I use this command to create an archive of multiple files:

set fd [open $zipfile wb]
zlib push gzip $fd -level 9
tar::create $fd $paths -chan
close $fd

Now I realize, all file with Umlaute in the path/name are wrongly encoded when unpacking the archive with the Windows program 7z.

What could be the solution to this issue?

Many thanks
Alexandru

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Alexandru on Sun Oct 16 17:08:28 2022

Alexandru <[email protected]> wrote:

Hi,

It seems, that there is no encoding option for the command
tar:encoding.

There is also no comamnd tar::encoding.

Tar (the archive format) is so old that it does not have an 'encoding'.
It just stores bytes, and upper level code has to decide what to do
with the bytes.

I use this command to create an archive of multiple files:

set fd [open $zipfile wb]
zlib push gzip $fd -level 9
tar::create $fd $paths -chan
close $fd

Now I realize, all file with Umlaute in the path/name are wrongly
encoded when unpacking the archive with the Windows program 7z.

The issue here could be Tcllib tar, or it could be 7z. Right now you
don't know, and Tar (the format) has no way to communicate a flag that
says "filenames herein are UTF8 (or any other encoding)".

What could be the solution to this issue?

Several:

1) (easiest, but may not be practical) -- don't use Umlaute's (or other non-ascii characters) in filenames.

2) If you look through the source of Tcllib's tar, you will find that
it inserts the filenames into the tar header block using binary format
"a" (which simply inserts the codepoint value modulo 256, and that will
only be correct for an 8-bit fixed length encoding). Which likely
means the breakage happens during tar::create.

If you look further up the call chain, you find that directories are
resolved to lists of filenames via glob, and the proc which writes each
tar component is fed a filename to work with.

So, you could use tcllib's find to pre-aquire the filenames you want to
pack into the Tar file, pre-encode them into the appropriate encoding
using 'encoding convertto', and output the tar file by calling
'formatHeader' with the 'encoded' name, and fcopying the file contents yourself.

3) You could patch tcllib's tar to encode filenames to an encoding
(including allowing specification of that encoding type via an option
to tar::create). And then contribute the patches back to Tcllib so
everyone benefits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alexandru@21:1/5 to Rich on Mon Oct 17 03:58:23 2022

Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:

Alexandru <[email protected]> wrote:

Hi,

It seems, that there is no encoding option for the command
tar:encoding.

There is also no comamnd tar::encoding.

Tar (the archive format) is so old that it does not have an 'encoding'.
It just stores bytes, and upper level code has to decide what to do
with the bytes.

I use this command to create an archive of multiple files:

set fd [open $zipfile wb]
zlib push gzip $fd -level 9
tar::create $fd $paths -chan
close $fd

Now I realize, all file with Umlaute in the path/name are wrongly
encoded when unpacking the archive with the Windows program 7z.

The issue here could be Tcllib tar, or it could be 7z. Right now you
don't know, and Tar (the format) has no way to communicate a flag that
says "filenames herein are UTF8 (or any other encoding)".

What could be the solution to this issue?

Several:

1) (easiest, but may not be practical) -- don't use Umlaute's (or other non-ascii characters) in filenames.

2) If you look through the source of Tcllib's tar, you will find that
it inserts the filenames into the tar header block using binary format
"a" (which simply inserts the codepoint value modulo 256, and that will
only be correct for an 8-bit fixed length encoding). Which likely
means the breakage happens during tar::create.

If you look further up the call chain, you find that directories are
resolved to lists of filenames via glob, and the proc which writes each
tar component is fed a filename to work with.

So, you could use tcllib's find to pre-aquire the filenames you want to
pack into the Tar file, pre-encode them into the appropriate encoding
using 'encoding convertto', and output the tar file by calling
'formatHeader' with the 'encoded' name, and fcopying the file contents yourself.

3) You could patch tcllib's tar to encode filenames to an encoding
(including allowing specification of that encoding type via an option
to tar::create). And then contribute the patches back to Tcllib so
everyone benefits.

Thanks Rich. After looking at tar.tcl, I see that "-encoding binary" is used for the output chanel (which must be the archive file) and also for encoding data inside the file, e.g. for header composing.
I think, if I start playing arround with the code, I might even make it work for my case but most probably it won't work for other cases. This is due to my limited undestanding of the whole encoding stuff.
But I think changing the source is the best way.
What I don't quite undestand, is why pre-encoding the paths does not work:

tar::create $fd [encoding convertto $enc $paths] -chan

I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but both didn't work.
Shouldn't this be enough?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alexandru@21:1/5 to Rich on Mon Oct 17 05:14:20 2022

Rich schrieb am Montag, 17. Oktober 2022 um 14:00:42 UTC+2:

Alexandru <[email protected]> wrote:

Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:

Alexandru <[email protected]> wrote:

Now I realize, all file with Umlaute in the path/name are wrongly
encoded when unpacking the archive with the Windows program 7z.

What I don't quite undestand, is why pre-encoding the paths does not work:

tar::create $fd [encoding convertto $enc $paths] -chan

I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but
both didn't work. Shouldn't this be enough?

If you look through the source, the tar module uses the paths you
supply to also open each file and copy its contents to the output tar
file. If you pre-encode the strings, then those opens likely will not
find the correct file (because the name used to open will have been
changed by the encoding process).

The patch to tar, presuming it would work, would be to perform encoding convertto on the path/name inside the writeheader proc that outputs the paths/names into the tar header. That way the open gets back the
string it needs to open the correct file, but non-ascii characters get encoded just before being output into the header.

Thanks. I changed this next paragraph in the source by adding "encoding convertto cp1252"

set header [binary format a100A8A8A8A12A12A8a1a100A6a2a32a32a8a8a155a12 \
[encoding convertto cp1252 $name] $A(mode)\x00 $ouid\x00 $ogid\x00\
$osize\x00 $omtime\x00 {} $type \
$A(linkname) ustar\x00 00 $A(uname) $A(gname)\
$A(devmajor) $A(devminor) $prefix {}]

Also tried with utf-8. The result is a valid archive but the names in the archive, when I open it with Windows 7z shows different special chars, not the Umlaute I actually have in the original file names.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Alexandru on Mon Oct 17 12:00:37 2022

Alexandru <[email protected]> wrote:

Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:

Alexandru <[email protected]> wrote:

Now I realize, all file with Umlaute in the path/name are wrongly
encoded when unpacking the archive with the Windows program 7z.

What I don't quite undestand, is why pre-encoding the paths does not work:

tar::create $fd [encoding convertto $enc $paths] -chan

I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but
both didn't work. Shouldn't this be enough?

If you look through the source, the tar module uses the paths you
supply to also open each file and copy its contents to the output tar
file. If you pre-encode the strings, then those opens likely will not
find the correct file (because the name used to open will have been
changed by the encoding process).

The patch to tar, presuming it would work, would be to perform encoding convertto on the path/name inside the writeheader proc that outputs the paths/names into the tar header. That way the open gets back the
string it needs to open the correct file, but non-ascii characters get
encoded just before being output into the header.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Alexandru on Mon Oct 17 13:35:52 2022

Alexandru <[email protected]> wrote:

Rich schrieb am Montag, 17. Oktober 2022 um 14:00:42 UTC+2:

Alexandru <[email protected]> wrote:

Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:

Alexandru <[email protected]> wrote:

Now I realize, all file with Umlaute in the path/name are wrongly
encoded when unpacking the archive with the Windows program 7z.

What I don't quite undestand, is why pre-encoding the paths does not work: >> >
tar::create $fd [encoding convertto $enc $paths] -chan

I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but
both didn't work. Shouldn't this be enough?

If you look through the source, the tar module uses the paths you
supply to also open each file and copy its contents to the output tar
file. If you pre-encode the strings, then those opens likely will not
find the correct file (because the name used to open will have been
changed by the encoding process).

The patch to tar, presuming it would work, would be to perform encoding
convertto on the path/name inside the writeheader proc that outputs the
paths/names into the tar header. That way the open gets back the
string it needs to open the correct file, but non-ascii characters get
encoded just before being output into the header.

Thanks. I changed this next paragraph in the source by adding "encoding convertto cp1252"

set header [binary format a100A8A8A8A12A12A8a1a100A6a2a32a32a8a8a155a12 \
[encoding convertto cp1252 $name] $A(mode)\x00 $ouid\x00 $ogid\x00\
$osize\x00 $omtime\x00 {} $type \
$A(linkname) ustar\x00 00 $A(uname) $A(gname)\
$A(devmajor) $A(devminor) $prefix {}]

Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.

'encoding names' will give you all the possibilities your Tcl supports.
Whether one of them works is unknown, and is dependent upon what 7z
expects to see in the names inside the tar file (which is the big
unknown here, what does 7z expect, you need to insert what it expects,
but without knowing that fact, you are left with trying all to see if
any work).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Alexandru on Mon Oct 17 13:54:20 2022

Alexandru <[email protected]> wrote:

Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.

Try one more test (if you can).

Create a tar file using 7z (if possible) and see if:

1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)

2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alexandru@21:1/5 to Rich on Mon Oct 17 06:59:27 2022

Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:

Alexandru <[email protected]> wrote:

Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.

Try one more test (if you can).

Create a tar file using 7z (if possible) and see if:

1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)

2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.

Creating with Windows 7z is no problem.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

How should that work? Which editor to you recommend (for Windows)?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alexandru@21:1/5 to Alexandru on Mon Oct 17 08:46:28 2022

Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:

Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:

Alexandru <[email protected]> wrote:

Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file names.

Try one more test (if you can).

Create a tar file using 7z (if possible) and see if:

1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)

2) if the answer is yes to #1, then open up the tar file in a hex display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.

Creating with Windows 7z is no problem.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

How should that work? Which editor to you recommend (for Windows)?

Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alexandru@21:1/5 to Alexandru on Mon Oct 17 08:52:35 2022

Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:

Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:

Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:

Alexandru <[email protected]> wrote:

Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file names.

Try one more test (if you can).

Create a tar file using 7z (if possible) and see if:

1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)

2) if the answer is yes to #1, then open up the tar file in a hex display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.

Creating with Windows 7z is no problem.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

How should that work? Which editor to you recommend (for Windows)?

Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Heller@21:1/5 to Alexandru on Mon Oct 17 16:08:06 2022

At Mon, 17 Oct 2022 08:52:35 -0700 (PDT) Alexandru <[email protected]> wrote:

Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:

Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:

Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:

Alexandru <[email protected]> wrote:

Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different special chars, not the Umlaute I actually have in the original file names.

Try one more test (if you can).

Create a tar file using 7z (if possible) and see if:

1) the Umlaute is encoded correctly (if the answer is no here, then this may not be possible with 7z)

2) if the answer is yes to #1, then open up the tar file in a hex display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.

Creating with Windows 7z is no problem.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

How should that work? Which editor to you recommend (for Windows)?

Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.

Copy the 7z created tar file to a Linux machine and see what 'tar tvf' gives for file names.

It 7z compressing the tar file? If so, you need to uncompress it before looking for filename encoding.

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
[email protected] -- Webhosting Services

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Mon Oct 17 19:18:43 2022

Am 17.10.22 um 17:52 schrieb Alexandru:

Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:

Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:

Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:

Alexandru <[email protected]> wrote:

Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.

Try one more test (if you can).

Create a tar file using 7z (if possible) and see if:

1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)

2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.

Creating with Windows 7z is no problem.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

How should that work? Which editor to you recommend (for Windows)?

Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.

A tarred archive is usually first tarred, then gzipped. YOu need to undo
the gzip first to see the tar (the metadata is also compressed, unlike a
ZIP file)

On Linux you would do e.g.

zcat compressedfile > uncompressedfile

Or, if the file has the usual ending (e.g. data.tar.gz) then this should
do the trick:

gunzip data.tar.gz

which then yields data.tar

Christian

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Alexandru on Mon Oct 17 18:22:45 2022

Alexandru <[email protected]> wrote:

Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:

Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:

Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:

Alexandru <[email protected]> wrote:

Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.

Try one more test (if you can).

Create a tar file using 7z (if possible) and see if:

1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)

2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.

Creating with Windows 7z is no problem.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

How should that work? Which editor to you recommend (for Windows)?

Ok I added the HexViewer package to SublimeText and now I can see
the typical hex view in typical hex editors. Let's see where this
leads...

Beats me, I only see chars glibber. Now the hell can I see from the
hex view what the damn encoding is? I just run "file
--mime-encoding" in the GIT console and the return value is binary.
No surprize here.

First you have to find the bytes that are the filename/path within the
header, then compare what bytes are present with what the displayed
names are when viewed in windows/7z. What differs bytes wise vs. the
view will then begin to suggest an encoding.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Alexandru on Mon Oct 17 18:21:13 2022

Alexandru <[email protected]> wrote:

Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:

Alexandru <[email protected]> wrote:

Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.

Try one more test (if you can).

Create a tar file using 7z (if possible) and see if:

1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)

2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.

Creating with Windows 7z is no problem.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

How should that work? Which editor to you recommend (for Windows)?

No idea, as I avoid windows like the plague it is.

Under linux I'd use xxd or hexdump, which can both be asked to give a
hex dump plus ascii equivalents in the same view.

If you can post the first 512 bytes of the tar file (that's the tar
header) as hex like the above, I can use the hex to convert it back to
binary, then dump with xxd or hexdump.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Heller@21:1/5 to [email protected] on Mon Oct 17 19:29:12 2022

At Mon, 17 Oct 2022 19:18:43 +0200 Christian Gollwitzer <[email protected]> wrote:

Am 17.10.22 um 17:52 schrieb Alexandru:

Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:

Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:

Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:

Alexandru <[email protected]> wrote:

Also tried with utf-8. The result is a valid archive but the names >>>>> in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file >>>>> names.

Try one more test (if you can).

Create a tar file using 7z (if possible) and see if:

1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)

2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.

Creating with Windows 7z is no problem.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:

377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab

How should that work? Which editor to you recommend (for Windows)?

Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...

Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.

A tarred archive is usually first tarred, then gzipped. YOu need to undo
the gzip first to see the tar (the metadata is also compressed, unlike a
ZIP file)

More correctly, tarballs are not compressed as part of being tarballs, but the tarball itself is compressed. Unlike a ZIP file which compress each entry separately, with the overall ZIP file itself not compressed (eg the ZIP directory is not compressed, even though some/all of the contents is compressed).

On Linux you would do e.g.

zcat compressedfile > uncompressedfile

Or, if the file has the usual ending (e.g. data.tar.gz) then this should
do the trick:

gunzip data.tar.gz

(Modern) Linux tar itself reconizes the endings and will do the decompress on-the-fly as needed:

tar tvf data.tar.gz
(or:
tar tvf data.tar.bz2
etc.)

No need to separately decompress the tarfile, if you are going to use the tar command on it. This might actually be a good first step for the OP. Seeing
what tar displays for the file name might yield some enlightenment.

It should be noted that tar originated under UNIX and predates the use of non-ASCII characters in file names.

which then yields data.tar

Christian

Christian

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
[email protected] -- Webhosting Services

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Tue Oct 18 07:48:38 2022

Am 17.10.22 um 21:29 schrieb Robert Heller:

At Mon, 17 Oct 2022 19:18:43 +0200 Christian Gollwitzer <[email protected]> wrote:

On Linux you would do e.g.

zcat compressedfile > uncompressedfile

Or, if the file has the usual ending (e.g. data.tar.gz) then this should
do the trick:

gunzip data.tar.gz

(Modern) Linux tar itself reconizes the endings and will do the decompress on-the-fly as needed:

tar tvf data.tar.gz
(or:
tar tvf data.tar.bz2
etc.)

No need to separately decompress the tarfile, if you are going to use the tar command on it. This might actually be a good first step for the OP. Seeing what tar displays for the file name might yield some enlightenment.

It should be noted that tar originated under UNIX and predates the use of non-ASCII characters in file names.

But then you see again an interpretation of the file names through tar
and the terminal. Alexandru wants to see the raw data in order to
replicate tar in Tcl code. I would guess that tar simply stores the file
name as a bytestream, since on Linux the file systems do not have an
encoding as opposed to Windows - the file names you see depend on how
you set LOCALE on Linux, whereas they are converted to UTF16 on NTFS
file systems on Windows.

7z on Windows might again have a different idea of the tar format. Not
to mention that there are multiple tar formats out there; e.g. see here:

https://www.gnu.org/software/tar/manual/html_section/Formats.html

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alexandru@21:1/5 to Christian Gollwitzer on Wed Oct 19 01:40:46 2022

Christian Gollwitzer schrieb am Dienstag, 18. Oktober 2022 um 07:48:43 UTC+2:

Am 17.10.22 um 21:29 schrieb Robert Heller:

At Mon, 17 Oct 2022 19:18:43 +0200 Christian Gollwitzer <[email protected]> wrote:

On Linux you would do e.g.

zcat compressedfile > uncompressedfile

Or, if the file has the usual ending (e.g. data.tar.gz) then this should >> do the trick:

gunzip data.tar.gz

(Modern) Linux tar itself reconizes the endings and will do the decompress on-the-fly as needed:

tar tvf data.tar.gz
(or:
tar tvf data.tar.bz2
etc.)

No need to separately decompress the tarfile, if you are going to use the tar
command on it. This might actually be a good first step for the OP. Seeing what tar displays for the file name might yield some enlightenment.

It should be noted that tar originated under UNIX and predates the use of non-ASCII characters in file names.

But then you see again an interpretation of the file names through tar
and the terminal. Alexandru wants to see the raw data in order to
replicate tar in Tcl code. I would guess that tar simply stores the file
name as a bytestream, since on Linux the file systems do not have an
encoding as opposed to Windows - the file names you see depend on how
you set LOCALE on Linux, whereas they are converted to UTF16 on NTFS
file systems on Windows.

7z on Windows might again have a different idea of the tar format. Not
to mention that there are multiple tar formats out there; e.g. see here:

https://www.gnu.org/software/tar/manual/html_section/Formats.html

Christian

This whole encoding stuff is crazy to follow up.
I was thinking, maybe I can get an workarround, if I use Tcl to unpack the archive.
Since I could not find a manual for the tar package, I hat to read the source code and other forums online to get close to a solution, which is not working right now.
So here is the code to unpack a 7z archive, which contains a tar archive:

set f [open $zipfile]
zlib push gunzip $f
set result [tar::untar $f -chan]
close $f

I get this type:
couldn't open "<filename> 100644 0 0 226212 1432372742" : filename is invalid on this platform

Seems like the untar cannot handle correctly multiple file names in the header. The first file name in ther archive is handled correctly. The second one is somehow cut in the middle and instead following data is attached to the file name.

The method used to create the 7z file is

set fd [open $zipfile wb]
zlib push gzip $fd -level 9
tar::create $fd $paths -chan
close $fd

where as $paths a list of full file paths is.

Is this a bug? Or am I using the wrong method to unpack?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Fri Jul 31 15:23:30 2026
  from Wales, Uk via Telnet
- Rixter
  Fri Jul 31 12:17:09 2026
  from Madison, Nc via Telnet
- Krenn
  Fri Jul 31 10:41:58 2026
  from Sydney, Nsw via Telnet
- Krenn
  Fri Jul 31 10:34:35 2026
  from Sydney, Nsw via Telnet
- Shift
  Fri Jul 31 06:46:34 2026
  from Leeds, England via SSH
- Centurion
  Fri Jul 31 00:59:56 2026
  from Berea, Ohio via Telnet
- Rixter
  Fri Jul 31 00:00:46 2026
  from Madison, Nc via Telnet
- Bob Worm
  Thu Jul 30 20:01:55 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	119:42:16
Calls:	12,465
Calls today:	7
Files:	15,200
Messages:	6,538,283

tar::create with encoding?

Who's Online

Recent Visitors

System Info