I have a procedure that unpacks files given by a list of file paths from an archive like this:
proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {
set f [open $zipfile rb]
fconfigure $f -encoding binary -translation lf -eofchar {}
zlib push gunzip $f
if {[llength $paths]==0} {
set result [tar::untar $f -chan]
} else {
foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set code [catch {file mkdir $dir} err]
if {$code} {
::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
continue
}
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}
}
close $f
return 1
}
zlib push gunzip $fthen simply doing this:
seek $f 0will not work, because just seeking to the beginning does not reset the
For me, It looks like the untar procedure has a bug.
Option 2 is of course a "no go".Instead of closing/reopening, you can also pop the gunzip channel transformation, seek to the beginning, and then push the transformation
Alexandru <[email protected]> wrote:
I have a procedure that unpacks files given by a list of file paths from an archive like this:
proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {Confustion above for yourself in the future. A zip file is not a tar
file, and a tar file is not a zip file (zip and tar are two very
different formats). Having the variable of the name be 'zipfile'
implies a "zip" not a "tar" at first glance.
set f [open $zipfile rb]If your tar file is indeed gzipped, implied by this:
fconfigure $f -encoding binary -translation lf -eofchar {}
zlib push gunzip $f
if {[llength $paths]==0} {
set result [tar::untar $f -chan]
} else {
foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set code [catch {file mkdir $dir} err]
if {$code} {
::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
continue
}
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}
}
close $f
return 1
}
zlib push gunzip $fthen simply doing this:
seek $f 0will not work, because just seeking to the beginning does not reset the gunzip state to the same as it was at initial file opening. Which is
most likely why things are failing for you.
Try closing and reopening the file inside the loop. If that works,
then this was the cause.
For me, It looks like the untar procedure has a bug.Looks to me like you are creating the problem by trying to seek around
inside gzipped data. You also have to be able to reset the gunzip
uncompress state to the identical state it was in for the file offset to
make that work.
If you can't formulate a glob pattern for the set of files you want to extract, then you'll have to do one of four things:
1) unpack the entire tar file into a temporary location, then move out
the files of interest and delete the unwanted files
2) close and reopen the file inside the loop around tar::untar. But you
are still left with scanning all of the preceeding tar data up to the
file of interest, which means you are quite close to an O(N^2)
complexity here
3) Create your own 'untar' by making calls into the tar module
internals to read file headers, decide if the header is for a file of interest, and extract the file if so. This, however, does mean you are calling procs that are not documented as part of the visible api to the
tar module, so should the internals change, your code would break until
you adapted. This method, however, does give you the most efficient
extract, because only a single pass over the tar file is needed.
4) Extend the tar module's untar proc to take an additional parameter
that is a list of filenames to match tar entries against and extract
each when found, and consider contributing the changes back to Tcllib.
This has the identical benefits of #3, with the added benefit that if accepted, your change becomes part of the documented API so less likely
to change "out from under you" in the future.
Thanks Rich,
I must admit, I still don't understand, how "read" can work on the
channel but "seek" not.
I'll just follow your advice and see if I can add a -files option to
the "untar" procedure and propose a change on github (your option 4).
Option 2 is of course a "no go". I can already see now the time
needed to open the archive and finding one file is huge. Doing this
for multiple files would be a party braker.
BTW: I know tar and zip are different formats. I have this habbit of
calling all types of archives a zip file.
Rich schrieb am Dienstag, 1. November 2022 um 15:57:26 UTC+1:
Alexandru <[email protected]> wrote:
I have a procedure that unpacks files given by a list of file paths from an archive like this:
proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {Confustion above for yourself in the future. A zip file is not a tar
file, and a tar file is not a zip file (zip and tar are two very
different formats). Having the variable of the name be 'zipfile'
implies a "zip" not a "tar" at first glance.
set f [open $zipfile rb]If your tar file is indeed gzipped, implied by this:
fconfigure $f -encoding binary -translation lf -eofchar {}
zlib push gunzip $f
if {[llength $paths]==0} {
set result [tar::untar $f -chan]
} else {
foreach path $paths targetpath $targetpaths {
set dir [file dirname $targetpath]
set code [catch {file mkdir $dir} err]
if {$code} {
::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
continue
}
set result [tar::untar $f -file $path -dir $dir -chan]
seek $f 0
}
}
close $f
return 1
}
zlib push gunzip $fthen simply doing this:
seek $f 0will not work, because just seeking to the beginning does not reset the gunzip state to the same as it was at initial file opening. Which is
most likely why things are failing for you.
Try closing and reopening the file inside the loop. If that works,
then this was the cause.
For me, It looks like the untar procedure has a bug.Looks to me like you are creating the problem by trying to seek around inside gzipped data. You also have to be able to reset the gunzip uncompress state to the identical state it was in for the file offset to make that work.
If you can't formulate a glob pattern for the set of files you want to extract, then you'll have to do one of four things:
1) unpack the entire tar file into a temporary location, then move out
the files of interest and delete the unwanted files
2) close and reopen the file inside the loop around tar::untar. But you
are still left with scanning all of the preceeding tar data up to the
file of interest, which means you are quite close to an O(N^2)
complexity here
3) Create your own 'untar' by making calls into the tar module
internals to read file headers, decide if the header is for a file of interest, and extract the file if so. This, however, does mean you are calling procs that are not documented as part of the visible api to the
tar module, so should the internals change, your code would break until
you adapted. This method, however, does give you the most efficient extract, because only a single pass over the tar file is needed.
4) Extend the tar module's untar proc to take an additional parameter
that is a list of filenames to match tar entries against and extract
each when found, and consider contributing the changes back to Tcllib.
This has the identical benefits of #3, with the added benefit that if accepted, your change becomes part of the documented API so less likely
to change "out from under you" in the future.
Thanks Rich,
I must admit, I still don't understand, how "read" can work on the channel but "seek" not.
I'll just follow your advice and see if I can add a -files option to the "untar" procedure and propose a change on github (your option 4).
Option 2 is of course a "no go". I can already see now the time needed to open the archive and finding one file is huge. Doing this for multiple files would be a party braker.
BTW: I know tar and zip are different formats. I have this habbit of calling all types of archives a zip file.
Regards
Alexandru
On 01/11/2022 16:35, Alexandru wrote:
Option 2 is of course a "no go".Instead of closing/reopening, you can also pop the gunzip channel transformation, seek to the beginning, and then push the transformation again.
But I doubt that will make a big difference in performance. Parsing
the file multiple times is what makes it slow. Closing/opening the
file is probably negligible in comparison.
Schelte <[email protected]> wrote:
On 01/11/2022 16:35, Alexandru wrote:Ah, that would reset the gzip state as well. I forgot about that
Option 2 is of course a "no go".Instead of closing/reopening, you can also pop the gunzip channel transformation, seek to the beginning, and then push the transformation again.
option.
But I doubt that will make a big difference in performance. ParsingAgreed.
the file multiple times is what makes it slow. Closing/opening the
file is probably negligible in comparison.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 46:55:07 |
| Calls: | 12,112 |
| Calls today: | 3 |
| Files: | 15,010 |
| Messages: | 6,518,497 |