• Re: [OT?] Attempting to extract tabular data from PDF -- approriate for

    From Richard Owlett@21:1/5 to Daniele Forsi on Sat Jul 19 21:30:01 2025
    On 7/19/25 11:43 AM, Daniele Forsi wrote:
    Hello Richard,

    the PDF format is not suitable for structured data,

    You err slightly. {smile ;}
    The _creators_ of PDF had no explicit interest in "structured data".
    Their creation was a tool to create machine readable data, which once
    printed, would what came from the existing printing industry.

    what do you want to do with it?

    To quote myself ;}
    >
    > Table A4.14 ... has information meeting my immediate personal need.
    > My goal is to document a typical generic weekly grocery list for
    > the "standard" 2000 calorie/day diet as a spreadsheet and/or
    database. >


    [SNIP]
    If you want something that you can modify, use "pdftotext" which is
    available in Debian in the "poppler-utils" package
    This will work for you:
    pdftotext -layout -f 116 -l 116 /tmp/TFP2021.pdf

    Thank you.
    I had tried "poppler-utils" on another edition of "TFP2021.pdf".
    The result was a *MESS* unsuitable as input to a scriptable editor
    such as Kate.
    [An aside. Back in the 70's, when working for DEC as an Engineering
    Tech, I was surrounded by TECO fanatics. It caused me to appreciate
    powerful text editors. That prompted my interest in Kate.]

    Using Pluma I started a trial edit of /tmp/TFP2021.txt created by
    pdftotext. I think creating a Kate macro is feasible. I'm a Kate newbie
    and it will at least an educational experience.



    Now let's talk radio!
    I wanted to convert the band plan https://www.iaru-r1.org/wp-content/uploads/2021/03/UHF-Bandplan.pdf

    I tried different ways:
    first I did a copy and paste in Libreoffice Writer, I got all the
    contents, but the columns where gone as expected
    then I did a copy and paste in Libreoffice Calc, but there isn't an
    easy way to get the columns
    finally I ran: pdftotext -layout -f 1 -l 2 UHF-Bandplan.pdf
    and also in this case pdftotext is doing a better job than a simple
    copy and paste, but it can't be easily read with a software so I
    wonder if a machine-readable list of frequencies is already available somewhere


    I believe you are overly pessimistic.
    When I ran: pdftotext -layout -f 1 -l 2 UHF-Bandplan.pdf the result was
    similar enough to TFP2021.txt that I believe Kate may be suitable.

    I'll use editing TFP2021.txt as a learning experience and
    UHF-Bandplan.txt as a feasibility to task experiment with Kate.

    I should have preliminary in a week or so. To match some goals of the
    project that prompted my investigation of Kate, the output will be HTML.

    Later.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)