On 29.04.2023 08:17, Lloyd Houghton wrote:
Hi, I had a script for this purpose, from about 30 years ago which was the last time I needed it, it doesn't seem to work and I'm very rusty, and I wonder if someone could offer a solution.
I have a file where each line has two fields. The first field is sometimes identical between one line and the next. I need to prepend a new field on every line to say how many lines (including the current one) share the same first field. We can assume
the file is sorted. For example, if the file is:
abc 647389
abc 12354
abd 7563
cdf 152384
cdf 8761523
cdf 1253
ghj 78654
klm 12634
pqr 9864
then when I run the script, the output should be:
2 abc 647389
2 abc 12354
1 abd 7563
3 cdf 152384
3 cdf 8761523
3 cdf 1253
1 ghj 78654
1 klm 12634
1 pqr 9864
The script that I used to do this (as best as I guess from looking in the directory with my data) looks like this:
sort -o tempid tempid
awk 'NR>1 && $1 != key { for (i=0; ++i<n) print n, line[i]; n=0 }
{ key=$1; line[++n]=$0 }
END { for (i=0; ++i<n) print n, line[i] }' tempid >tempid2
This script has obvious syntactical errors.
I can't say that I understand the loop specification format, or even the overall behaviour (someone must have helped me), but this script was in the directory and appears to be related to the task...
You need information in the lines that you can only determine by later
lines, so you need to (temporarily) store the contents of the lines as
you seem to have tried.
Could anyone help me to fix this?
No, because there's a much simpler and more obvious solution; two-pass processing across your (sorted) data.
awk '
NR==FNR { n[$1]++ ; next }
{ print n[$1], $0 }
' tempid tempid >tempid2
Janis
Many many thanks.
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)