• prepending a counter for number of lines that match the first field

    From Lloyd Houghton@21:1/5 to All on Fri Apr 28 23:17:24 2023
    Hi, I had a script for this purpose, from about 30 years ago which was the last time I needed it, it doesn't seem to work and I'm very rusty, and I wonder if someone could offer a solution.

    I have a file where each line has two fields. The first field is sometimes identical between one line and the next. I need to prepend a new field on every line to say how many lines (including the current one) share the same first field. We can assume
    the file is sorted. For example, if the file is:

    abc 647389
    abc 12354
    abd 7563
    cdf 152384
    cdf 8761523
    cdf 1253
    ghj 78654
    klm 12634
    pqr 9864

    then when I run the script, the output should be:

    2 abc 647389
    2 abc 12354
    1 abd 7563
    3 cdf 152384
    3 cdf 8761523
    3 cdf 1253
    1 ghj 78654
    1 klm 12634
    1 pqr 9864

    The script that I used to do this (as best as I guess from looking in the directory with my data) looks like this:

    sort -o tempid tempid
    awk 'NR>1 && $1 != key { for (i=0; ++i<n) print n, line[i]; n=0 }
    { key=$1; line[++n]=$0 }
    END { for (i=0; ++i<n) print n, line[i] }' tempid >tempid2

    I can't say that I understand the loop specification format, or even the overall behaviour (someone must have helped me), but this script was in the directory and appears to be related to the task...

    Could anyone help me to fix this?

    Many many thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lloyd Houghton on Sat Apr 29 10:44:41 2023
    On 29.04.2023 08:17, Lloyd Houghton wrote:
    Hi, I had a script for this purpose, from about 30 years ago which was the last time I needed it, it doesn't seem to work and I'm very rusty, and I wonder if someone could offer a solution.

    I have a file where each line has two fields. The first field is sometimes identical between one line and the next. I need to prepend a new field on every line to say how many lines (including the current one) share the same first field. We can assume
    the file is sorted. For example, if the file is:

    abc 647389
    abc 12354
    abd 7563
    cdf 152384
    cdf 8761523
    cdf 1253
    ghj 78654
    klm 12634
    pqr 9864

    then when I run the script, the output should be:

    2 abc 647389
    2 abc 12354
    1 abd 7563
    3 cdf 152384
    3 cdf 8761523
    3 cdf 1253
    1 ghj 78654
    1 klm 12634
    1 pqr 9864

    The script that I used to do this (as best as I guess from looking in the directory with my data) looks like this:

    sort -o tempid tempid
    awk 'NR>1 && $1 != key { for (i=0; ++i<n) print n, line[i]; n=0 }
    { key=$1; line[++n]=$0 }
    END { for (i=0; ++i<n) print n, line[i] }' tempid >tempid2


    This script has obvious syntactical errors.

    I can't say that I understand the loop specification format, or even the overall behaviour (someone must have helped me), but this script was in the directory and appears to be related to the task...

    You need information in the lines that you can only determine by later
    lines, so you need to (temporarily) store the contents of the lines as
    you seem to have tried.


    Could anyone help me to fix this?

    No, because there's a much simpler and more obvious solution; two-pass processing across your (sorted) data.

    awk '
    NR==FNR { n[$1]++ ; next }
    { print n[$1], $0 }
    ' tempid tempid >tempid2


    Janis


    Many many thanks.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lloyd Houghton@21:1/5 to Janis Papanagnou on Sat Apr 29 15:00:39 2023
    Thank you very much Janis,, this has solved my problem.

    I remember your name from helping me in this same forum many years ago with a shell script. For a hobby, I end up neeing such scripts a couple of times no more than 2 or 3 times a decade, and I'm grateful to people like you who help others with problems
    that must seem tediously obvious to you.

    regards - Lloyd

    On Saturday, April 29, 2023 at 4:44:48 AM UTC-4, Janis Papanagnou wrote:

    awk '
    NR==FNR { n[$1]++ ; next }
    { print n[$1], $0 }
    ' tempid tempid >tempid2


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lloyd Houghton on Sun Apr 30 00:31:56 2023
    Thanks for your feedback. Glad my suggestion helped. (It's not tedious,
    don't worry.)

    Janis

    On 30.04.2023 00:00, Lloyd Houghton wrote:
    Thank you very much Janis,, this has solved my problem.

    I remember your name from helping me in this same forum many years
    ago with a shell script. For a hobby, I end up neeing such scripts a
    couple of times no more than 2 or 3 times a decade, and I'm grateful
    to people like you who help others with problems that must seem
    tediously obvious to you.

    regards - Lloyd

    On Saturday, April 29, 2023 at 4:44:48 AM UTC-4, Janis Papanagnou
    wrote:

    awk ' NR==FNR { n[$1]++ ; next } { print n[$1], $0 }
    ' tempid tempid >tempid2


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)