Hi,
It seems, that there is no encoding option for the command
tar:encoding.
I use this command to create an archive of multiple files:
set fd [open $zipfile wb]
zlib push gzip $fd -level 9
tar::create $fd $paths -chan
close $fd
Now I realize, all file with Umlaute in the path/name are wrongly
encoded when unpacking the archive with the Windows program 7z.
What could be the solution to this issue?
Alexandru <[email protected]> wrote:
Hi,
It seems, that there is no encoding option for the commandThere is also no comamnd tar::encoding.
tar:encoding.
Tar (the archive format) is so old that it does not have an 'encoding'.
It just stores bytes, and upper level code has to decide what to do
with the bytes.
I use this command to create an archive of multiple files:
set fd [open $zipfile wb]
zlib push gzip $fd -level 9
tar::create $fd $paths -chan
close $fd
Now I realize, all file with Umlaute in the path/name are wronglyThe issue here could be Tcllib tar, or it could be 7z. Right now you
encoded when unpacking the archive with the Windows program 7z.
don't know, and Tar (the format) has no way to communicate a flag that
says "filenames herein are UTF8 (or any other encoding)".
What could be the solution to this issue?Several:
1) (easiest, but may not be practical) -- don't use Umlaute's (or other non-ascii characters) in filenames.
2) If you look through the source of Tcllib's tar, you will find that
it inserts the filenames into the tar header block using binary format
"a" (which simply inserts the codepoint value modulo 256, and that will
only be correct for an 8-bit fixed length encoding). Which likely
means the breakage happens during tar::create.
If you look further up the call chain, you find that directories are
resolved to lists of filenames via glob, and the proc which writes each
tar component is fed a filename to work with.
So, you could use tcllib's find to pre-aquire the filenames you want to
pack into the Tar file, pre-encode them into the appropriate encoding
using 'encoding convertto', and output the tar file by calling
'formatHeader' with the 'encoded' name, and fcopying the file contents yourself.
3) You could patch tcllib's tar to encode filenames to an encoding
(including allowing specification of that encoding type via an option
to tar::create). And then contribute the patches back to Tcllib so
everyone benefits.
Alexandru <[email protected]> wrote:
Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:
Alexandru <[email protected]> wrote:What I don't quite undestand, is why pre-encoding the paths does not work:
Now I realize, all file with Umlaute in the path/name are wrongly
encoded when unpacking the archive with the Windows program 7z.
tar::create $fd [encoding convertto $enc $paths] -chan
I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 butIf you look through the source, the tar module uses the paths you
both didn't work. Shouldn't this be enough?
supply to also open each file and copy its contents to the output tar
file. If you pre-encode the strings, then those opens likely will not
find the correct file (because the name used to open will have been
changed by the encoding process).
The patch to tar, presuming it would work, would be to perform encoding convertto on the path/name inside the writeheader proc that outputs the paths/names into the tar header. That way the open gets back the
string it needs to open the correct file, but non-ascii characters get encoded just before being output into the header.
Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:
Alexandru <[email protected]> wrote:What I don't quite undestand, is why pre-encoding the paths does not work:
Now I realize, all file with Umlaute in the path/name are wrongly
encoded when unpacking the archive with the Windows program 7z.
tar::create $fd [encoding convertto $enc $paths] -chan
I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but
both didn't work. Shouldn't this be enough?
Rich schrieb am Montag, 17. Oktober 2022 um 14:00:42 UTC+2:
Alexandru <[email protected]> wrote:
Rich schrieb am Sonntag, 16. Oktober 2022 um 19:08:32 UTC+2:If you look through the source, the tar module uses the paths you
Alexandru <[email protected]> wrote:What I don't quite undestand, is why pre-encoding the paths does not work: >> >
Now I realize, all file with Umlaute in the path/name are wrongly
encoded when unpacking the archive with the Windows program 7z.
tar::create $fd [encoding convertto $enc $paths] -chan
I tried enc=utf-8 and the Windows compatible encoding enc=cp1252 but
both didn't work. Shouldn't this be enough?
supply to also open each file and copy its contents to the output tar
file. If you pre-encode the strings, then those opens likely will not
find the correct file (because the name used to open will have been
changed by the encoding process).
The patch to tar, presuming it would work, would be to perform encoding
convertto on the path/name inside the writeheader proc that outputs the
paths/names into the tar header. That way the open gets back the
string it needs to open the correct file, but non-ascii characters get
encoded just before being output into the header.
Thanks. I changed this next paragraph in the source by adding "encoding convertto cp1252"
set header [binary format a100A8A8A8A12A12A8a1a100A6a2a32a32a8a8a155a12 \
[encoding convertto cp1252 $name] $A(mode)\x00 $ouid\x00 $ogid\x00\
$osize\x00 $omtime\x00 {} $type \
$A(linkname) ustar\x00 00 $A(uname) $A(gname)\
$A(devmajor) $A(devminor) $prefix {}]
Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.
Also tried with utf-8. The result is a valid archive but the names
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.
Alexandru <[email protected]> wrote:
Also tried with utf-8. The result is a valid archive but the namesTry one more test (if you can).
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.
Create a tar file using 7z (if possible) and see if:
1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)
2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.
Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...
Alexandru <[email protected]> wrote:
Also tried with utf-8. The result is a valid archive but the namesTry one more test (if you can).
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file names.
Create a tar file using 7z (if possible) and see if:
1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)
2) if the answer is yes to #1, then open up the tar file in a hex display/hex editor and try to work out what the encoding used by 7zCreating with Windows 7z is no problem.
for the filename (and Umlaute's) likely was.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:
377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab
How should that work? Which editor to you recommend (for Windows)?
Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
Alexandru <[email protected]> wrote:
Also tried with utf-8. The result is a valid archive but the namesTry one more test (if you can).
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file names.
Create a tar file using 7z (if possible) and see if:
1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)
2) if the answer is yes to #1, then open up the tar file in a hex display/hex editor and try to work out what the encoding used by 7zCreating with Windows 7z is no problem.
for the filename (and Umlaute's) likely was.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:
377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab
How should that work? Which editor to you recommend (for Windows)?Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...
Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:
Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
Alexandru <[email protected]> wrote:
Also tried with utf-8. The result is a valid archive but the namesTry one more test (if you can).
in the archive, when I open it with Windows 7z shows different special chars, not the Umlaute I actually have in the original file names.
Create a tar file using 7z (if possible) and see if:
1) the Umlaute is encoded correctly (if the answer is no here, then this may not be possible with 7z)
2) if the answer is yes to #1, then open up the tar file in a hex display/hex editor and try to work out what the encoding used by 7zCreating with Windows 7z is no problem.
for the filename (and Umlaute's) likely was.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:
377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab
How should that work? Which editor to you recommend (for Windows)?Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...
Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.
Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:
Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...
Alexandru <[email protected]> wrote:Creating with Windows 7z is no problem.
Also tried with utf-8. The result is a valid archive but the namesTry one more test (if you can).
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.
Create a tar file using 7z (if possible) and see if:
1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)
2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:
377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab
How should that work? Which editor to you recommend (for Windows)?
Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.
Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:
Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:Ok I added the HexViewer package to SublimeText and now I can see
Alexandru <[email protected]> wrote:Creating with Windows 7z is no problem.
Also tried with utf-8. The result is a valid archive but the namesTry one more test (if you can).
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.
Create a tar file using 7z (if possible) and see if:
1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)
2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:
377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab
How should that work? Which editor to you recommend (for Windows)?
the typical hex view in typical hex editors. Let's see where this
leads...
Beats me, I only see chars glibber. Now the hell can I see from the
hex view what the damn encoding is? I just run "file
--mime-encoding" in the GIT console and the return value is binary.
No surprize here.
Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:
Alexandru <[email protected]> wrote:
Also tried with utf-8. The result is a valid archive but the namesTry one more test (if you can).
in the archive, when I open it with Windows 7z shows different
special chars, not the Umlaute I actually have in the original file
names.
Create a tar file using 7z (if possible) and see if:
1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)
2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.
Creating with Windows 7z is no problem.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:
377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab
How should that work? Which editor to you recommend (for Windows)?
Am 17.10.22 um 17:52 schrieb Alexandru:
Alexandru schrieb am Montag, 17. Oktober 2022 um 17:46:31 UTC+2:
Alexandru schrieb am Montag, 17. Oktober 2022 um 15:59:29 UTC+2:
Rich schrieb am Montag, 17. Oktober 2022 um 15:54:24 UTC+2:Ok I added the HexViewer package to SublimeText and now I can see the typical hex view in typical hex editors. Let's see where this leads...
Alexandru <[email protected]> wrote:Creating with Windows 7z is no problem.
Also tried with utf-8. The result is a valid archive but the names >>>>> in the archive, when I open it with Windows 7z shows differentTry one more test (if you can).
special chars, not the Umlaute I actually have in the original file >>>>> names.
Create a tar file using 7z (if possible) and see if:
1) the Umlaute is encoded correctly (if the answer is no here, then
this may not be possible with 7z)
2) if the answer is yes to #1, then open up the tar file in a hex
display/hex editor and try to work out what the encoding used by 7z
for the filename (and Umlaute's) likely was.
I opened the archive with SublimeText as Hexadecimal file and I only see binary stuff:
377a bcaf 271c 0004 cf4b c197 6006 1b00
0000 0000 2400 0000 0000 0000 5fca de53
e3ac bcc0 235d 0006 82cf 6346 7fed db19
67c2 b2aa c224 0a02 1e57 167f 3a28 63ef
864b 7da8 71ed dc72 9494 456e b474 a34e
7646 3e62 b0bc fb35 b31f 98ec 0cde 30ab
How should that work? Which editor to you recommend (for Windows)?
Beats me, I only see chars glibber. Now the hell can I see from the hex view what the damn encoding is?
I just run "file --mime-encoding" in the GIT console and the return value is binary. No surprize here.
A tarred archive is usually first tarred, then gzipped. YOu need to undo
the gzip first to see the tar (the metadata is also compressed, unlike a
ZIP file)
On Linux you would do e.g.
zcat compressedfile > uncompressedfile
Or, if the file has the usual ending (e.g. data.tar.gz) then this should
do the trick:
gunzip data.tar.gz
which then yields data.tar
Christian
Christian
At Mon, 17 Oct 2022 19:18:43 +0200 Christian Gollwitzer <[email protected]> wrote:
On Linux you would do e.g.
zcat compressedfile > uncompressedfile
Or, if the file has the usual ending (e.g. data.tar.gz) then this should
do the trick:
gunzip data.tar.gz
(Modern) Linux tar itself reconizes the endings and will do the decompress on-the-fly as needed:
tar tvf data.tar.gz
(or:
tar tvf data.tar.bz2
etc.)
No need to separately decompress the tarfile, if you are going to use the tar command on it. This might actually be a good first step for the OP. Seeing what tar displays for the file name might yield some enlightenment.
It should be noted that tar originated under UNIX and predates the use of non-ASCII characters in file names.
Am 17.10.22 um 21:29 schrieb Robert Heller:
At Mon, 17 Oct 2022 19:18:43 +0200 Christian Gollwitzer <[email protected]> wrote:
On Linux you would do e.g.
zcat compressedfile > uncompressedfile
Or, if the file has the usual ending (e.g. data.tar.gz) then this should >> do the trick:
gunzip data.tar.gz
(Modern) Linux tar itself reconizes the endings and will do the decompress on-the-fly as needed:
tar tvf data.tar.gz
(or:
tar tvf data.tar.bz2
etc.)
No need to separately decompress the tarfile, if you are going to use the tar
command on it. This might actually be a good first step for the OP. Seeing what tar displays for the file name might yield some enlightenment.
It should be noted that tar originated under UNIX and predates the use of non-ASCII characters in file names.But then you see again an interpretation of the file names through tar
and the terminal. Alexandru wants to see the raw data in order to
replicate tar in Tcl code. I would guess that tar simply stores the file
name as a bytestream, since on Linux the file systems do not have an
encoding as opposed to Windows - the file names you see depend on how
you set LOCALE on Linux, whereas they are converted to UTF16 on NTFS
file systems on Windows.
7z on Windows might again have a different idea of the tar format. Not
to mention that there are multiple tar formats out there; e.g. see here:
https://www.gnu.org/software/tar/manual/html_section/Formats.html
Christian
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 24:54:00 |
| Calls: | 12,106 |
| Calls today: | 6 |
| Files: | 15,006 |
| Messages: | 6,518,172 |