Forum: >>> Magnum BBS <<<

untar file by file in a loop

From Alexandru@21:1/5 to All on Tue Nov 1 00:39:05 2022

I have a procedure that unpacks files given by a list of file paths from an archive like this:

proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {
set f [open $zipfile rb]
fconfigure $f -encoding binary -translation lf -eofchar {}
zlib push gunzip $f
if {[llength $paths]==0} {
set result [tar::untar $f -chan]
} else {
foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set code [catch {file mkdir $dir} err]
if {$code} {
::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
continue
}
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}
}
close $f
return 1
}

The main part is the foreach:

foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set code [catch {file mkdir $dir} err]
if {$code} {
::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
continue
}
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}

It can be further reduces to:

foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}

The problem is that it only works for first file in list.
Second file is not unpacked and if a third file is given I get the error:

*** START OF ERROR MESSAGE ***
can't read "name": no such variable
can't read "name": no such variable
while executing
"set $x"
(procedure "readHeader" line 5)
invoked from within
"readHeader [read $fh 512]"
(procedure "tar::untar" line 24)
invoked from within
"tar::untar $f -file $path -dir $dir -chan"

For me, It looks like the untar procedure has a bug.
The "seek $f 0" command I added it while trying to make it work.
No success until now.
I think, while the read channel stays open, the untar procedure read until the end of the file, so the next untar command does not find the needed file.
But then, the "seek $f 0" should actually solve the problem.
But it doesn't.

Here is the untar procedure, maybe some trained eyes can see the issue better than me.

proc ::tar::untar {tar args} {
set nooverwrite 0
set data 0
set nomtime 0
set noperms 0
set chan 0
parseOpts {dir 1 file 1 glob 1 nooverwrite 0 nomtime 0 noperms 0 chan 0} $args
if {![info exists dir]} {set dir [pwd]}
set pattern *
if {[info exists file]} {
set pattern [string map {* \\* ? \\? \\ \\\\ \[ \\\[ \] \\\]} $file]
} elseif {[info exists glob]} {
set pattern $glob
}

set ret {}
if {$chan} {
set fh $tar
} else {
set fh [::open $tar]
fconfigure $fh -encoding binary -translation lf -eofchar {}
}
while {![eof $fh]} {
array set header [readHeader [read $fh 512]]
HandleLongLink $fh header
if {$header(name) == ""} break
if {$header(prefix) != ""} {append header(prefix) /}
set name [string trimleft $header(prefix)$header(name) /]
if {![string match $pattern $name] || ($nooverwrite && [file exists $name])} {
seekorskip $fh [expr {$header(size) + [pad $header(size)]}] current
continue
}

if {$dir!=""} {
if {[::tar::isabsolute $name]} {
set name [file join $dir [file tail $name]]
} else {
set name [file join $dir $name]
}
}
if {![file isdirectory [file dirname $name]]} {
file mkdir [file dirname $name]
lappend ret [file dirname $name] {}
}
if {[string match {[0346]} $header(type)]} {
if {[catch {::open $name w+} new]} {
# sometimes if we dont have write permission we can still delete
catch {file delete -force $name}
set new [::open $name w+]
}
fconfigure $new -encoding binary -translation lf -eofchar {}
fcopy $fh $new -size $header(size)
close $new
lappend ret $name $header(size)
} elseif {$header(type) == 5} {
file mkdir $name
lappend ret $name {}
} elseif {[string match {[12]} $header(type)] && $::tcl_platform(platform) == "unix"} {
catch {file delete $name}
if {![catch {file link [string map {1 -hard 2 -symbolic} $header(type)] $name $header(linkname)}]} {
lappend ret $name {}
}
}
seekorskip $fh [pad $header(size)] current
if {![file exists $name]} continue

if {$::tcl_platform(platform) == "unix"} {
if {!$noperms} {
catch {file attributes $name -permissions 0[string range $header(mode) 2 end]}
}
catch {file attributes $name -owner $header(uid) -group $header(gid)}
catch {file attributes $name -owner $header(uname) -group $header(gname)}
}
if {!$nomtime} {
file mtime $name $header(mtime)
}
}
if {!$chan} {
close $fh
}
return $ret
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Alexandru on Tue Nov 1 14:57:22 2022

Alexandru <[email protected]> wrote:

I have a procedure that unpacks files given by a list of file paths from an archive like this:

proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {

Confustion above for yourself in the future. A zip file is not a tar
file, and a tar file is not a zip file (zip and tar are two very
different formats). Having the variable of the name be 'zipfile'
implies a "zip" not a "tar" at first glance.

set f [open $zipfile rb]
fconfigure $f -encoding binary -translation lf -eofchar {}
zlib push gunzip $f
if {[llength $paths]==0} {
set result [tar::untar $f -chan]
} else {
foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set code [catch {file mkdir $dir} err]
if {$code} {
::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
continue
}
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}
}
close $f
return 1
}

If your tar file is indeed gzipped, implied by this:

zlib push gunzip $f

then simply doing this:

seek $f 0

will not work, because just seeking to the beginning does not reset the
gunzip state to the same as it was at initial file opening. Which is
most likely why things are failing for you.

Try closing and reopening the file inside the loop. If that works,
then this was the cause.

For me, It looks like the untar procedure has a bug.

Looks to me like you are creating the problem by trying to seek around
inside gzipped data. You also have to be able to reset the gunzip
uncompress state to the identical state it was in for the file offset to
make that work.

If you can't formulate a glob pattern for the set of files you want to
extract, then you'll have to do one of four things:

1) unpack the entire tar file into a temporary location, then move out
the files of interest and delete the unwanted files

2) close and reopen the file inside the loop around tar::untar. But you
are still left with scanning all of the preceeding tar data up to the
file of interest, which means you are quite close to an O(N^2)
complexity here

3) Create your own 'untar' by making calls into the tar module
internals to read file headers, decide if the header is for a file of
interest, and extract the file if so. This, however, does mean you are
calling procs that are not documented as part of the visible api to the
tar module, so should the internals change, your code would break until
you adapted. This method, however, does give you the most efficient
extract, because only a single pass over the tar file is needed.

4) Extend the tar module's untar proc to take an additional parameter
that is a list of filenames to match tar entries against and extract
each when found, and consider contributing the changes back to Tcllib.
This has the identical benefits of #3, with the added benefit that if
accepted, your change becomes part of the documented API so less likely
to change "out from under you" in the future.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Schelte@21:1/5 to Alexandru on Tue Nov 1 16:59:09 2022

On 01/11/2022 16:35, Alexandru wrote:

Option 2 is of course a "no go".

Instead of closing/reopening, you can also pop the gunzip channel transformation, seek to the beginning, and then push the transformation
again. But I doubt that will make a big difference in performance.
Parsing the file multiple times is what makes it slow. Closing/opening
the file is probably negligible in comparison.

Schelte.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alexandru@21:1/5 to Rich on Tue Nov 1 08:35:56 2022

Rich schrieb am Dienstag, 1. November 2022 um 15:57:26 UTC+1:

Alexandru <[email protected]> wrote:

I have a procedure that unpacks files given by a list of file paths from an archive like this:

proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {

Confustion above for yourself in the future. A zip file is not a tar
file, and a tar file is not a zip file (zip and tar are two very
different formats). Having the variable of the name be 'zipfile'
implies a "zip" not a "tar" at first glance.

set f [open $zipfile rb]
fconfigure $f -encoding binary -translation lf -eofchar {}
zlib push gunzip $f
if {[llength $paths]==0} {
set result [tar::untar $f -chan]
} else {
foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set code [catch {file mkdir $dir} err]
if {$code} {
::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
continue
}
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}
}
close $f
return 1
}

If your tar file is indeed gzipped, implied by this:

zlib push gunzip $f

then simply doing this:

seek $f 0

will not work, because just seeking to the beginning does not reset the gunzip state to the same as it was at initial file opening. Which is
most likely why things are failing for you.

Try closing and reopening the file inside the loop. If that works,
then this was the cause.

For me, It looks like the untar procedure has a bug.

Looks to me like you are creating the problem by trying to seek around
inside gzipped data. You also have to be able to reset the gunzip
uncompress state to the identical state it was in for the file offset to
make that work.

If you can't formulate a glob pattern for the set of files you want to extract, then you'll have to do one of four things:

1) unpack the entire tar file into a temporary location, then move out
the files of interest and delete the unwanted files

2) close and reopen the file inside the loop around tar::untar. But you
are still left with scanning all of the preceeding tar data up to the
file of interest, which means you are quite close to an O(N^2)
complexity here

3) Create your own 'untar' by making calls into the tar module
internals to read file headers, decide if the header is for a file of interest, and extract the file if so. This, however, does mean you are calling procs that are not documented as part of the visible api to the
tar module, so should the internals change, your code would break until
you adapted. This method, however, does give you the most efficient
extract, because only a single pass over the tar file is needed.

4) Extend the tar module's untar proc to take an additional parameter
that is a list of filenames to match tar entries against and extract
each when found, and consider contributing the changes back to Tcllib.
This has the identical benefits of #3, with the added benefit that if accepted, your change becomes part of the documented API so less likely
to change "out from under you" in the future.

Thanks Rich,

I must admit, I still don't understand, how "read" can work on the channel but "seek" not.
I'll just follow your advice and see if I can add a -files option to the "untar" procedure and propose a change on github (your option 4).

Option 2 is of course a "no go". I can already see now the time needed to open the archive and finding one file is huge. Doing this for multiple files would be a party braker.

BTW: I know tar and zip are different formats. I have this habbit of calling all types of archives a zip file.

Regards
Alexandru

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Alexandru on Tue Nov 1 16:47:02 2022

Alexandru <[email protected]> wrote:

Thanks Rich,

I must admit, I still don't understand, how "read" can work on the
channel but "seek" not.

The seek works. You move the file pointer back and start reading from
a different offset.

But, your file is a gzip file. The gzip compressed format needs to be
read from the front, because to unpack byte X, you need the gzip
compression state that was created by unpacking bytes 0 through X-1.

If you are at offset Y, you have the gzip compression state created
from 0 through Y-1. If you now seek to X, you'll get the wrong result
from trying to decompress X using the gzip state of 0 through Y-1.

I'll just follow your advice and see if I can add a -files option to
the "untar" procedure and propose a change on github (your option 4).

Option 2 is of course a "no go". I can already see now the time
needed to open the archive and finding one file is huge. Doing this
for multiple files would be a party braker.

Tar is not zip. The expanded acrynym gives a clue (T)ape (Ar)chive.
It was created (originally) to package files onto magnetic tape. As
tape does not have "random seek ability" tar contains no features to
allow random access within the tar file. You have to either read it
from the start in a linear manner, or pre-index once up front (by
reading it from from to back in a linear manner) and then use your
index to randomly grab files out.

Zip files include index data as part of the format, so one can directly
access a single file in a zip without having to read the whole file
from the front in order to do so.

BTW: I know tar and zip are different formats. I have this habbit of
calling all types of archives a zip file.

Which is fine, but it confuses others who call a tar file a tar file
and a zip file a zip file because they are two very different file
formats.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robert Heller@21:1/5 to Alexandru on Tue Nov 1 16:44:05 2022

At Tue, 1 Nov 2022 08:35:56 -0700 (PDT) Alexandru <[email protected]> wrote:

Rich schrieb am Dienstag, 1. November 2022 um 15:57:26 UTC+1:

Alexandru <[email protected]> wrote:

I have a procedure that unpacks files given by a list of file paths from an archive like this:

proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {

Confustion above for yourself in the future. A zip file is not a tar
file, and a tar file is not a zip file (zip and tar are two very
different formats). Having the variable of the name be 'zipfile'
implies a "zip" not a "tar" at first glance.

set f [open $zipfile rb]
fconfigure $f -encoding binary -translation lf -eofchar {}
zlib push gunzip $f
if {[llength $paths]==0} {
set result [tar::untar $f -chan]
} else {
foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set code [catch {file mkdir $dir} err]
if {$code} {
::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
continue
}
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}
}
close $f
return 1
}

If your tar file is indeed gzipped, implied by this:

zlib push gunzip $f

then simply doing this:

seek $f 0

will not work, because just seeking to the beginning does not reset the gunzip state to the same as it was at initial file opening. Which is
most likely why things are failing for you.

Try closing and reopening the file inside the loop. If that works,
then this was the cause.

For me, It looks like the untar procedure has a bug.

Looks to me like you are creating the problem by trying to seek around inside gzipped data. You also have to be able to reset the gunzip uncompress state to the identical state it was in for the file offset to make that work.

If you can't formulate a glob pattern for the set of files you want to extract, then you'll have to do one of four things:

1) unpack the entire tar file into a temporary location, then move out
the files of interest and delete the unwanted files

2) close and reopen the file inside the loop around tar::untar. But you
are still left with scanning all of the preceeding tar data up to the
file of interest, which means you are quite close to an O(N^2)
complexity here

3) Create your own 'untar' by making calls into the tar module
internals to read file headers, decide if the header is for a file of interest, and extract the file if so. This, however, does mean you are calling procs that are not documented as part of the visible api to the
tar module, so should the internals change, your code would break until
you adapted. This method, however, does give you the most efficient extract, because only a single pass over the tar file is needed.

4) Extend the tar module's untar proc to take an additional parameter
that is a list of filenames to match tar entries against and extract
each when found, and consider contributing the changes back to Tcllib.
This has the identical benefits of #3, with the added benefit that if accepted, your change becomes part of the documented API so less likely
to change "out from under you" in the future.

Thanks Rich,

I must admit, I still don't understand, how "read" can work on the channel but "seek" not.
I'll just follow your advice and see if I can add a -files option to the "untar" procedure and propose a change on github (your option 4).

When you "read" a compressed tar file, you are not actually reading the tar file itself, but the output of a pipeline from gunzip (or something like gunzip). You can't seek on a pipeline -- I don't know if this is an actual pipe device or a 'faked' pipe using VFS hackery and it does not matter which, the effect is the same.

Option 2 is of course a "no go". I can already see now the time needed to open the archive and finding one file is huge. Doing this for multiple files would be a party braker.

BTW: I know tar and zip are different formats. I have this habbit of calling all types of archives a zip file.

This confusing tar and zip is probably what is getting you into lots of trouble, esp. if you are confusing a gziped tar file.

Some important things to understand about tar and zip files:

Tar was originally designed for *tapes* (yes, those reels of plastic film coated with Iron Oxide). Nobody uses tapes anymore. Tar files don't have compressed elements, the whole tar file get compressed as a single blob. Tar files are meant to be read and written sequentially and not randomly accessed.

*Zip* files contain an *uncompress* table of contents, and each member element is separately compressed (or not). Zip files were specificly designed to be randomly accessed -- one can seek to the end and read the TOC and then seek to specific files in the Zip archive and extract (and uncompress) them, in any order you like.

Regards
Alexandru

--
Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
[email protected] -- Webhosting Services

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Schelte on Tue Nov 1 16:48:32 2022

Schelte <[email protected]> wrote:

On 01/11/2022 16:35, Alexandru wrote:

Option 2 is of course a "no go".

Instead of closing/reopening, you can also pop the gunzip channel transformation, seek to the beginning, and then push the transformation again.

Ah, that would reset the gzip state as well. I forgot about that
option.

But I doubt that will make a big difference in performance. Parsing
the file multiple times is what makes it slow. Closing/opening the
file is probably negligible in comparison.

Agreed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Alexandru@21:1/5 to Rich on Tue Nov 1 11:44:13 2022

Rich schrieb am Dienstag, 1. November 2022 um 17:48:36 UTC+1:

Schelte <[email protected]> wrote:

On 01/11/2022 16:35, Alexandru wrote:

Option 2 is of course a "no go".

Instead of closing/reopening, you can also pop the gunzip channel transformation, seek to the beginning, and then push the transformation again.

Ah, that would reset the gzip state as well. I forgot about that
option.

But I doubt that will make a big difference in performance. Parsing
the file multiple times is what makes it slow. Closing/opening the
file is probably negligible in comparison.

Agreed.

Thanks all for the help.
I added the -files and -dirs options to the untar procedure and commited the changes:
https://github.com/Meshparts/tcllib/blob/master/modules/tar/tar.tcl

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet
- Rixter
  Mon Jul 27 13:04:59 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	50:24:27
Calls:	12,444
Calls today:	4
Files:	15,192
Messages:	6,537,155

untar file by file in a loop

Who's Online

Recent Visitors

System Info