• untar file by file in a loop

    From Alexandru@21:1/5 to All on Tue Nov 1 00:39:05 2022
    I have a procedure that unpacks files given by a list of file paths from an archive like this:

    proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {
    set f [open $zipfile rb]
    fconfigure $f -encoding binary -translation lf -eofchar {}
    zlib push gunzip $f
    if {[llength $paths]==0} {
    set result [tar::untar $f -chan]
    } else {
    foreach path $paths targetpath $targetpaths {
    set dir [file dirname $targetpath]
    set code [catch {file mkdir $dir} err]
    if {$code} {
    ::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
    continue
    }
    set result [tar::untar $f -file $path -dir $dir -chan]
    seek $f 0
    }
    }
    close $f
    return 1
    }

    The main part is the foreach:

    foreach path $paths targetpath $targetpaths {
    set dir [file dirname $targetpath]
    set code [catch {file mkdir $dir} err]
    if {$code} {
    ::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
    continue
    }
    set result [tar::untar $f -file $path -dir $dir -chan]
    seek $f 0
    }

    It can be further reduces to:

    foreach path $paths targetpath $targetpaths {
    set dir [file dirname $targetpath]
    set result [tar::untar $f -file $path -dir $dir -chan]
    seek $f 0
    }

    The problem is that it only works for first file in list.
    Second file is not unpacked and if a third file is given I get the error:

    *** START OF ERROR MESSAGE ***
    can't read "name": no such variable
    can't read "name": no such variable
    while executing
    "set $x"
    (procedure "readHeader" line 5)
    invoked from within
    "readHeader [read $fh 512]"
    (procedure "tar::untar" line 24)
    invoked from within
    "tar::untar $f -file $path -dir $dir -chan"

    For me, It looks like the untar procedure has a bug.
    The "seek $f 0" command I added it while trying to make it work.
    No success until now.
    I think, while the read channel stays open, the untar procedure read until the end of the file, so the next untar command does not find the needed file.
    But then, the "seek $f 0" should actually solve the problem.
    But it doesn't.

    Here is the untar procedure, maybe some trained eyes can see the issue better than me.

    proc ::tar::untar {tar args} {
    set nooverwrite 0
    set data 0
    set nomtime 0
    set noperms 0
    set chan 0
    parseOpts {dir 1 file 1 glob 1 nooverwrite 0 nomtime 0 noperms 0 chan 0} $args
    if {![info exists dir]} {set dir [pwd]}
    set pattern *
    if {[info exists file]} {
    set pattern [string map {* \\* ? \\? \\ \\\\ \[ \\\[ \] \\\]} $file]
    } elseif {[info exists glob]} {
    set pattern $glob
    }

    set ret {}
    if {$chan} {
    set fh $tar
    } else {
    set fh [::open $tar]
    fconfigure $fh -encoding binary -translation lf -eofchar {}
    }
    while {![eof $fh]} {
    array set header [readHeader [read $fh 512]]
    HandleLongLink $fh header
    if {$header(name) == ""} break
    if {$header(prefix) != ""} {append header(prefix) /}
    set name [string trimleft $header(prefix)$header(name) /]
    if {![string match $pattern $name] || ($nooverwrite && [file exists $name])} {
    seekorskip $fh [expr {$header(size) + [pad $header(size)]}] current
    continue
    }

    if {$dir!=""} {
    if {[::tar::isabsolute $name]} {
    set name [file join $dir [file tail $name]]
    } else {
    set name [file join $dir $name]
    }
    }
    if {![file isdirectory [file dirname $name]]} {
    file mkdir [file dirname $name]
    lappend ret [file dirname $name] {}
    }
    if {[string match {[0346]} $header(type)]} {
    if {[catch {::open $name w+} new]} {
    # sometimes if we dont have write permission we can still delete
    catch {file delete -force $name}
    set new [::open $name w+]
    }
    fconfigure $new -encoding binary -translation lf -eofchar {}
    fcopy $fh $new -size $header(size)
    close $new
    lappend ret $name $header(size)
    } elseif {$header(type) == 5} {
    file mkdir $name
    lappend ret $name {}
    } elseif {[string match {[12]} $header(type)] && $::tcl_platform(platform) == "unix"} {
    catch {file delete $name}
    if {![catch {file link [string map {1 -hard 2 -symbolic} $header(type)] $name $header(linkname)}]} {
    lappend ret $name {}
    }
    }
    seekorskip $fh [pad $header(size)] current
    if {![file exists $name]} continue

    if {$::tcl_platform(platform) == "unix"} {
    if {!$noperms} {
    catch {file attributes $name -permissions 0[string range $header(mode) 2 end]}
    }
    catch {file attributes $name -owner $header(uid) -group $header(gid)}
    catch {file attributes $name -owner $header(uname) -group $header(gname)}
    }
    if {!$nomtime} {
    file mtime $name $header(mtime)
    }
    }
    if {!$chan} {
    close $fh
    }
    return $ret
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Alexandru on Tue Nov 1 14:57:22 2022
    Alexandru <[email protected]> wrote:
    I have a procedure that unpacks files given by a list of file paths from an archive like this:

    proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {

    Confustion above for yourself in the future. A zip file is not a tar
    file, and a tar file is not a zip file (zip and tar are two very
    different formats). Having the variable of the name be 'zipfile'
    implies a "zip" not a "tar" at first glance.

    set f [open $zipfile rb]
    fconfigure $f -encoding binary -translation lf -eofchar {}
    zlib push gunzip $f
    if {[llength $paths]==0} {
    set result [tar::untar $f -chan]
    } else {
    foreach path $paths targetpath $targetpaths {
    set dir [file dirname $targetpath]
    set code [catch {file mkdir $dir} err]
    if {$code} {
    ::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
    continue
    }
    set result [tar::untar $f -file $path -dir $dir -chan]
    seek $f 0
    }
    }
    close $f
    return 1
    }

    If your tar file is indeed gzipped, implied by this:
    zlib push gunzip $f
    then simply doing this:
    seek $f 0
    will not work, because just seeking to the beginning does not reset the
    gunzip state to the same as it was at initial file opening. Which is
    most likely why things are failing for you.

    Try closing and reopening the file inside the loop. If that works,
    then this was the cause.


    For me, It looks like the untar procedure has a bug.

    Looks to me like you are creating the problem by trying to seek around
    inside gzipped data. You also have to be able to reset the gunzip
    uncompress state to the identical state it was in for the file offset to
    make that work.

    If you can't formulate a glob pattern for the set of files you want to
    extract, then you'll have to do one of four things:

    1) unpack the entire tar file into a temporary location, then move out
    the files of interest and delete the unwanted files

    2) close and reopen the file inside the loop around tar::untar. But you
    are still left with scanning all of the preceeding tar data up to the
    file of interest, which means you are quite close to an O(N^2)
    complexity here

    3) Create your own 'untar' by making calls into the tar module
    internals to read file headers, decide if the header is for a file of
    interest, and extract the file if so. This, however, does mean you are
    calling procs that are not documented as part of the visible api to the
    tar module, so should the internals change, your code would break until
    you adapted. This method, however, does give you the most efficient
    extract, because only a single pass over the tar file is needed.

    4) Extend the tar module's untar proc to take an additional parameter
    that is a list of filenames to match tar entries against and extract
    each when found, and consider contributing the changes back to Tcllib.
    This has the identical benefits of #3, with the added benefit that if
    accepted, your change becomes part of the documented API so less likely
    to change "out from under you" in the future.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Schelte@21:1/5 to Alexandru on Tue Nov 1 16:59:09 2022
    On 01/11/2022 16:35, Alexandru wrote:
    Option 2 is of course a "no go".
    Instead of closing/reopening, you can also pop the gunzip channel transformation, seek to the beginning, and then push the transformation
    again. But I doubt that will make a big difference in performance.
    Parsing the file multiple times is what makes it slow. Closing/opening
    the file is probably negligible in comparison.


    Schelte.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alexandru@21:1/5 to Rich on Tue Nov 1 08:35:56 2022
    Rich schrieb am Dienstag, 1. November 2022 um 15:57:26 UTC+1:
    Alexandru <[email protected]> wrote:
    I have a procedure that unpacks files given by a list of file paths from an archive like this:

    proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {
    Confustion above for yourself in the future. A zip file is not a tar
    file, and a tar file is not a zip file (zip and tar are two very
    different formats). Having the variable of the name be 'zipfile'
    implies a "zip" not a "tar" at first glance.
    set f [open $zipfile rb]
    fconfigure $f -encoding binary -translation lf -eofchar {}
    zlib push gunzip $f
    if {[llength $paths]==0} {
    set result [tar::untar $f -chan]
    } else {
    foreach path $paths targetpath $targetpaths {
    set dir [file dirname $targetpath]
    set code [catch {file mkdir $dir} err]
    if {$code} {
    ::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
    continue
    }
    set result [tar::untar $f -file $path -dir $dir -chan]
    seek $f 0
    }
    }
    close $f
    return 1
    }
    If your tar file is indeed gzipped, implied by this:
    zlib push gunzip $f
    then simply doing this:
    seek $f 0
    will not work, because just seeking to the beginning does not reset the gunzip state to the same as it was at initial file opening. Which is
    most likely why things are failing for you.

    Try closing and reopening the file inside the loop. If that works,
    then this was the cause.
    For me, It looks like the untar procedure has a bug.
    Looks to me like you are creating the problem by trying to seek around
    inside gzipped data. You also have to be able to reset the gunzip
    uncompress state to the identical state it was in for the file offset to
    make that work.

    If you can't formulate a glob pattern for the set of files you want to extract, then you'll have to do one of four things:

    1) unpack the entire tar file into a temporary location, then move out
    the files of interest and delete the unwanted files

    2) close and reopen the file inside the loop around tar::untar. But you
    are still left with scanning all of the preceeding tar data up to the
    file of interest, which means you are quite close to an O(N^2)
    complexity here

    3) Create your own 'untar' by making calls into the tar module
    internals to read file headers, decide if the header is for a file of interest, and extract the file if so. This, however, does mean you are calling procs that are not documented as part of the visible api to the
    tar module, so should the internals change, your code would break until
    you adapted. This method, however, does give you the most efficient
    extract, because only a single pass over the tar file is needed.

    4) Extend the tar module's untar proc to take an additional parameter
    that is a list of filenames to match tar entries against and extract
    each when found, and consider contributing the changes back to Tcllib.
    This has the identical benefits of #3, with the added benefit that if accepted, your change becomes part of the documented API so less likely
    to change "out from under you" in the future.

    Thanks Rich,

    I must admit, I still don't understand, how "read" can work on the channel but "seek" not.
    I'll just follow your advice and see if I can add a -files option to the "untar" procedure and propose a change on github (your option 4).

    Option 2 is of course a "no go". I can already see now the time needed to open the archive and finding one file is huge. Doing this for multiple files would be a party braker.

    BTW: I know tar and zip are different formats. I have this habbit of calling all types of archives a zip file.

    Regards
    Alexandru

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Alexandru on Tue Nov 1 16:47:02 2022
    Alexandru <[email protected]> wrote:
    Thanks Rich,

    I must admit, I still don't understand, how "read" can work on the
    channel but "seek" not.

    The seek works. You move the file pointer back and start reading from
    a different offset.

    But, your file is a gzip file. The gzip compressed format needs to be
    read from the front, because to unpack byte X, you need the gzip
    compression state that was created by unpacking bytes 0 through X-1.

    If you are at offset Y, you have the gzip compression state created
    from 0 through Y-1. If you now seek to X, you'll get the wrong result
    from trying to decompress X using the gzip state of 0 through Y-1.

    I'll just follow your advice and see if I can add a -files option to
    the "untar" procedure and propose a change on github (your option 4).

    Option 2 is of course a "no go". I can already see now the time
    needed to open the archive and finding one file is huge. Doing this
    for multiple files would be a party braker.

    Tar is not zip. The expanded acrynym gives a clue (T)ape (Ar)chive.
    It was created (originally) to package files onto magnetic tape. As
    tape does not have "random seek ability" tar contains no features to
    allow random access within the tar file. You have to either read it
    from the start in a linear manner, or pre-index once up front (by
    reading it from from to back in a linear manner) and then use your
    index to randomly grab files out.

    Zip files include index data as part of the format, so one can directly
    access a single file in a zip without having to read the whole file
    from the front in order to do so.

    BTW: I know tar and zip are different formats. I have this habbit of
    calling all types of archives a zip file.

    Which is fine, but it confuses others who call a tar file a tar file
    and a zip file a zip file because they are two very different file
    formats.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robert Heller@21:1/5 to Alexandru on Tue Nov 1 16:44:05 2022
    At Tue, 1 Nov 2022 08:35:56 -0700 (PDT) Alexandru <[email protected]> wrote:


    Rich schrieb am Dienstag, 1. November 2022 um 15:57:26 UTC+1:
    Alexandru <[email protected]> wrote:
    I have a procedure that unpacks files given by a list of file paths from an archive like this:

    proc ::meshparts::AssemblyArchiveUnpack {zipfile {paths {}} {targetpaths {}}} {
    Confustion above for yourself in the future. A zip file is not a tar
    file, and a tar file is not a zip file (zip and tar are two very
    different formats). Having the variable of the name be 'zipfile'
    implies a "zip" not a "tar" at first glance.
    set f [open $zipfile rb]
    fconfigure $f -encoding binary -translation lf -eofchar {}
    zlib push gunzip $f
    if {[llength $paths]==0} {
    set result [tar::untar $f -chan]
    } else {
    foreach path $paths targetpath $targetpaths {
    set dir [file dirname $targetpath]
    set code [catch {file mkdir $dir} err]
    if {$code} {
    ::meshparts::message "*** [mc {%1$s} $err]" -errorlog 0
    continue
    }
    set result [tar::untar $f -file $path -dir $dir -chan]
    seek $f 0
    }
    }
    close $f
    return 1
    }
    If your tar file is indeed gzipped, implied by this:
    zlib push gunzip $f
    then simply doing this:
    seek $f 0
    will not work, because just seeking to the beginning does not reset the gunzip state to the same as it was at initial file opening. Which is
    most likely why things are failing for you.

    Try closing and reopening the file inside the loop. If that works,
    then this was the cause.
    For me, It looks like the untar procedure has a bug.
    Looks to me like you are creating the problem by trying to seek around inside gzipped data. You also have to be able to reset the gunzip uncompress state to the identical state it was in for the file offset to make that work.

    If you can't formulate a glob pattern for the set of files you want to extract, then you'll have to do one of four things:

    1) unpack the entire tar file into a temporary location, then move out
    the files of interest and delete the unwanted files

    2) close and reopen the file inside the loop around tar::untar. But you
    are still left with scanning all of the preceeding tar data up to the
    file of interest, which means you are quite close to an O(N^2)
    complexity here

    3) Create your own 'untar' by making calls into the tar module
    internals to read file headers, decide if the header is for a file of interest, and extract the file if so. This, however, does mean you are calling procs that are not documented as part of the visible api to the
    tar module, so should the internals change, your code would break until
    you adapted. This method, however, does give you the most efficient extract, because only a single pass over the tar file is needed.

    4) Extend the tar module's untar proc to take an additional parameter
    that is a list of filenames to match tar entries against and extract
    each when found, and consider contributing the changes back to Tcllib.
    This has the identical benefits of #3, with the added benefit that if accepted, your change becomes part of the documented API so less likely
    to change "out from under you" in the future.

    Thanks Rich,

    I must admit, I still don't understand, how "read" can work on the channel but "seek" not.
    I'll just follow your advice and see if I can add a -files option to the "untar" procedure and propose a change on github (your option 4).


    When you "read" a compressed tar file, you are not actually reading the tar file itself, but the output of a pipeline from gunzip (or something like gunzip). You can't seek on a pipeline -- I don't know if this is an actual pipe device or a 'faked' pipe using VFS hackery and it does not matter which, the effect is the same.

    Option 2 is of course a "no go". I can already see now the time needed to open the archive and finding one file is huge. Doing this for multiple files would be a party braker.

    BTW: I know tar and zip are different formats. I have this habbit of calling all types of archives a zip file.

    This confusing tar and zip is probably what is getting you into lots of trouble, esp. if you are confusing a gziped tar file.

    Some important things to understand about tar and zip files:

    Tar was originally designed for *tapes* (yes, those reels of plastic film coated with Iron Oxide). Nobody uses tapes anymore. Tar files don't have compressed elements, the whole tar file get compressed as a single blob. Tar files are meant to be read and written sequentially and not randomly accessed.


    *Zip* files contain an *uncompress* table of contents, and each member element is separately compressed (or not). Zip files were specificly designed to be randomly accessed -- one can seek to the end and read the TOC and then seek to specific files in the Zip archive and extract (and uncompress) them, in any order you like.





    Regards
    Alexandru



    --
    Robert Heller -- Cell: 413-658-7953 GV: 978-633-5364
    Deepwoods Software -- Custom Software Services
    http://www.deepsoft.com/ -- Linux Administration Services
    [email protected] -- Webhosting Services

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Schelte on Tue Nov 1 16:48:32 2022
    Schelte <[email protected]> wrote:
    On 01/11/2022 16:35, Alexandru wrote:
    Option 2 is of course a "no go".
    Instead of closing/reopening, you can also pop the gunzip channel transformation, seek to the beginning, and then push the transformation again.

    Ah, that would reset the gzip state as well. I forgot about that
    option.

    But I doubt that will make a big difference in performance. Parsing
    the file multiple times is what makes it slow. Closing/opening the
    file is probably negligible in comparison.

    Agreed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alexandru@21:1/5 to Rich on Tue Nov 1 11:44:13 2022
    Rich schrieb am Dienstag, 1. November 2022 um 17:48:36 UTC+1:
    Schelte <[email protected]> wrote:
    On 01/11/2022 16:35, Alexandru wrote:
    Option 2 is of course a "no go".
    Instead of closing/reopening, you can also pop the gunzip channel transformation, seek to the beginning, and then push the transformation again.
    Ah, that would reset the gzip state as well. I forgot about that
    option.
    But I doubt that will make a big difference in performance. Parsing
    the file multiple times is what makes it slow. Closing/opening the
    file is probably negligible in comparison.
    Agreed.

    Thanks all for the help.
    I added the -files and -dirs options to the untar procedure and commited the changes:
    https://github.com/Meshparts/tcllib/blob/master/modules/tar/tar.tcl

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)