PDF to Text, a challenging problem

357 points by ingve a month ago

90s_dev a month ago

Have any of you ever thought to yourself, this is new and interesting, and then vaguely remembered that you spent months or years becoming an expert at it earlier in life but entirely forgot it? And in fact large chunks of the very interesting things you've done just completely flew out of your mind long ago, to the point where you feel absolutely new at life, like you've accomplished relatively nothing, until something like this jars you out of that forgetfulness?

I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.

bazzargh a month ago

Back in... 2006ish? I got annoyed with being unable to copy text from multicolumn scientific papers on my iRex (an early ereader that was somewhat hackable) so dug a bit into why that was. Under the hood, the pdf reader used poppler, so I modified poppler to infer reading order in multicolumn documents using algorithms that tessaract's author (Thomas Breuel) had published for OCR.
It was a bit of a heuristic hack; it was 20 years ago but as I recall poppler's ancient API didn't really represent text runs in a way you'd want for an accessibility API. A version of the multicolumn select made it in but it was a pain to try to persuade poppler's maintainer that subsequent suggestions to improve performance were ok - because they used slightly different heuristics so had different text selections in some circumstances. There was no 'right' answer, so wanting the results to match didn't make sense.
And that's how kpdf got multicolumn select, of a sort.
Using tessaract directly for this has probably made more sense for some years now.
- steeeeeve a month ago
  
  I too went down that rabbithole. Haha. Anything around that time to get an edge in a fantasy football league. I found a bunch of historical NFL stats pdfs and it took forever to make usable data out of them.
hallman76 a month ago

We will never get back the collective man-decades of time that has been burned by this format. When will the madness stop?
- GuB-42 a month ago
  
  PDF is effectively digital paper, and it works really well for this. When I made PDFs 20 years ago, I knew they will always look the same on every device, including on paper, and they did, and they still do. In addition, a document is a single file, reasonably compact, looks good on any resolution, and is generally searchable. Even if not ideal, it can also support scans of paper documents in a way that can be sent to a printer on the other side of the planet and you will get the same result as if you had used a copier.
  Data extraction is hard, but that's not what it is designed for, it is for people to read, like paper documents.
  Far from being "mad", it is remarkably stable. It has some crazy features, and it is not designed for data extraction (but doesn't actively prevent it!). But look at the alternative. Word documents? Html? Svg? One of the zillion XML-based document formats? Markdown? Is any one of these suitable for writing, say, a scientific paper (with maths, tables, graphics...) in a way that is readable by a human on a computer or in print and will still be in decades and that is easier to process by a machine than a PDF?
- theamk a month ago
  
  When we get an alternative that can:
  (1) be stored in a single file
  (2) Allow tables, images and anything else that can be shown on a piece paper
  (3) Won't have animation, fold-out text, or anything that cannot be be shown on a piece of paper
  (4) won't require Javascript or access to external sites
  that means never.. We've got lucky we at least got PDF before "web designers" made (3) impossible, and marketers made (4) impossible
  
  dqv a month ago
  
  > (3) Won't have animation, fold-out text, or anything that cannot be be shown on a piece of paper
  > (4) won't require Javascript or access to external sites
  So about that... https://opensource.adobe.com/dc-acrobat-sdk-docs/library/jsa...
  
  theamk a month ago
  
  that's the power of legacy. Adobe may think they can add junk to PDF like Javascript support, or lmz's "3D PDF" link below, but since PDFs viewers have a diverse ecosystem, those features won't have a great adoption.
  And this is actually pretty great, maybe even the best part of PDFs! Companies _know_ that publishing PDF that require 3d-graphics or Javascript means many people won't be able to see them, so they publish good, static PDFs, maintaining virtuous cycle.
  
  lmz a month ago
  
  Also: https://pdfa.org/3d-pdf-showcase/
  
  izacus a month ago
  
  Did you miss the meaning of the word "require"?
  
  numpad0 a month ago
  
  (-1) be vector format that never gets pixelated
  (0) that reproduce everywhere on any OS perfectly
  (0.5) that supports (everything) any typographical engineers ever wanted past and future
  Bitmap formats are out from clause -1, Office file formats disqualify from clause 0, Markdown doesn't satisfy clause 0.5. Otherwise a Word .doc format covers most of clauses 1-4.
  
  Timwi a month ago
  
  > (0) that reproduce everywhere on any OS perfectly
  Can somebody explain why this isn't the case for HTML? I'm frequently in a situation where a website that mimics printed pages fails to render the same between Firefox and Chrome. I wish to understand the primary culprit here. I thought all of the CSS units are completely defined?
  
  scajanus a month ago
  
  I think this is the result of 1) it being a moving target and 2) HTML and CSS being a de facto standard rather than de jure, where the (differing) implementations define at least part of the spec.
  You also can't really embed fonts in a HTML file, you rely on linking instead -- and those can rot. Apparently there has been some work towards it (base64 encoded), but support may vary. And you need to embed the whole font, I don't think you can do character subsets easily.
  
  djxfade a month ago
  
  Probably due to different font rendering in the OS.
  
  protocolture a month ago
  
  Behold a Bitmap.
  But for real, thats a pretty easy set of hurdles. Really the barrier is the psychological fallacy that PDF's are immutable.
  
  theamk a month ago
  
  Should have added "looks good on screen and on paper", "stores text compactly" and "multiple pages supported" :) And yes, that's a pretty easy set of hurdles. I wish we'd standardized on DjVu instead.
  Re "PDF's are immutable." - that's not a psychological fallacy, that's a primary advantage of PDFs. If I wanted mutable format, I'd take an odt (or rtf or a doc). "Output only" format allows one to use the very latest version of editor app, while having the result working even in ancient readers, something very desirable in many contexts.
  
  imtringued a month ago
  
  PDFs are not really immutable. I use Okular all the time to write my "notes" (it's just text that you can place anywhere) on top of a PDF form and then print out a new completely filled out PDF. The only thing I do by hand is sign the physical paper.
  
  iLemming a month ago
  
  Your understanding of immutability feels skewed here. Every time you annotate the PDF, it creates a new version. Even when you overwrite the same file, the structure of the original document changes, therefore creating a new document, ultimately making it "the ship of Theseus.pdf"
  Sure, someone may try using the same argument, applying it to .doc and .txt documents, yet there is a general consensus saying that pdfs were designed to "resist the change". You can probably self-illustrate the point by making changes to a .txt document and then removing your changes - the md5 of the file would remain the same.
  
  me-vs-cat a month ago
  
  Have you ever used Acrobat? Not "Acrobat Reader", but regular Acrobat, the most popular PDF editor. It's from Adobe, and it definitely does not "resist" edits.
  
  iLemming a month ago
  
  I got what you're saying the first time, and you still seem to be entirely missing the point. Immutability means that an object cannot be modified after it's created, and any changes result in a new object rather than altering the original.
  You're saying "well, look, I can modify this pdf and I can even undo my changes...", what I'm saying is that whenever you modify a PDF, you're essentially creating a new file rather than truly "undoing" changes in the original. PDFs have complex internal structures with metadata, object references, and possibly compression that make bit-perfect restoration challenging.
  Unlike plain text files where changes can be precisely tracked and reversed at the character level, PDFs don't easily support this kind of granular reversibility. Even "undoing" in PDF editors often means generating yet another variant rather than returning to the exact binary state of the original.
  Take a look at how Git stores PDFs - when the delta approach doesn't work efficiently since even small logical changes can result in significantly different binary files with completely different checksums, it stores EVERY version of the same document in a separate blob object.
  When you annotate a pdf and then later change your mind, undo all the annotations and save it — only to your eyes it may look the same as the original — in digital reality, it will be a different file.
  
  protocolture a month ago
  
  The people who act as if PDFs are legally immutable are not performing MD5 comparisons.
  Also, that isnt even an intention of the file format as far as I can see, its mostly a byproduct of cruft and backwards compatibility.
  No one would call .doc immutable because its very difficult to move an image and then restore that image to the original location.
  In this context, people will save something out as pdf and store it because they dont think it cannot be modified.
  But as has been rightly pointed out, thats not the case.
  
  iLemming a month ago
  
  I feel like I'm talking to a toddler, sigh. Let me try again.
  Immutability doesn't mean that an "object cannot be modified", it means that in order to modify an object, you must create a new (clone) object. That's all what I meant to say. Sure, you can get pedantic or otherwise and say "yes, pdfs are immutable; or no, pdfs aren't immutable in some contexts", etc., and depending on the point of view, both of these claims could be correct — I'm not arguing about the specifics.
  I'm just saying that your explanation of why you think pdfs are not immutable hinges on an incorrect idea of what immutability actually is.
  There's a rigorous definition for "immutability" in computer science, e.g., strings in many programming languages are immutable, but that doesn't mean you can't manipulate them, it just means that operations that appear to modify strings actually create new string objects.
  The greatest illustration for immutability is imbued in programming languages with immutability-by-default, e.g., Clojure. Once someone groks the basics, it becomes really clear what that thing is about.
  
  me-vs-cat a month ago
  
  > I feel like I'm talking to a toddler, sigh.
  Me too, but I'm done. Have fun!
  
  me-vs-cat a month ago
  
  > I got what you're saying the first time,
  That wasn't me. Multiple people were taking the time to help you understand.
  
  harshreality a month ago
  
  What's immutable, without tools to decompress and possibly perform further de-obfuscation of text streams, is the typical way publishing software encodes text into streams inside PDFs.
  It remains possible to have a pdf with text that is easily mutable with any text editor.
  Even if text inside a pdf is annoyingly encoded, you can always just replace the appropriate object/text streams... if you can identify the right one(s). You can extract and edit and re-insert, or simply replace, embedded images as well.
  I don't think "this format promotes, as the norm, so much obfuscation of basic text objects that it becomes impractical to edit them in situ without wholesale replacement" is the win you think it is.
  "Looks good on paper" has to do with the rendering engine (largely high-DPI and good font handling/spacing/kerning), not PDF as a content layout/presentation format. A high-quality software rasterizer (for postscript or PDF, often embedded in the printer)—not the PDF file format—has been the magic sauce.
  Today, some large portion of end-user interaction with PDFs is via rendering into a web browser DOM via javascript. Text in PDFs is rendered as text in the browser. Perhaps nothing else demonstrates more clearly that the "PDF is superior" argument is invalid.
  
  me-vs-cat a month ago
  
  > you can always just replace the appropriate object/text streams
  Or right-click and select Edit. Works in several PDF editors, on both text and image content.
  
  protocolture a month ago
  
  Word can edit pretty much any pdf these days, the issue is that it will often garble the attempt.
  
  me-vs-cat a month ago
  
  PDFs are not immutable.
  
  majora2007 a month ago
  
  Why can't this be done with epub? Single file, all files are packed within the zip, no javascript needed but can be included. Allows for markup and forms, just like pdf.
  
  currency a month ago
  
  EPub is, like html, reformattable by the reader, so documents aren't fixed in the way PDFs are.
  
  ninalanyon a month ago
  
  A subset of HTML and CSS surely does that to a large degree. Data urls solve the single file problem.
  
  gpvos a month ago
  
  Postscript fits that bill better.
  
  coderatlarge a month ago
  
  dvi ?
  https://en.m.wikipedia.org/wiki/Device_independent_file_form...
  
  theamk a month ago
  
  > For a DVI file to be printed or even properly previewed, the fonts it references must be already installed.
  If you want alternatives, I'd choose DjVu. But it's too late now, everyone is converged on PDFs, and the alternatives are not good enough to warrant the switch.
  
  drfuchs a month ago
  
  DVI isn’t suitable as you’d still have to intuit where the paragraph- and even word-breaks are; what’s body text vs. headers/footers, sidebars, captions, etc; never mind what math expression a particular jumble of characters and rules came from.
- xvilka a month ago
  
  The solution always has been in plain sight - just make XML-based format. Nobody liked it, except OpenDocument and eventually Microsoft. Though these formats serve different purpose, new, similar one, could be created with picture-perfect features.
anon373839 a month ago

Tesseract was the best open-source OCR for a long time. But I’d argue that docTR is better now, as it’s more accurate out of the box and GPU accelerated. It implements a variety of different text detection and recognition model architectures that you can combine in a modular pipeline. And you can train or fine-tune in PyTorch or TensorFlow to get even better performance on your domain.
selcuka a month ago

A while ago someone asked me a C++ question and I said "sorry, never worked seriously with C++". Then I remembered that I wrote the client code for a private instant messenger in Borland C++ approximately 20 years ago, that was used by thousands of people.
So yeah, it happens.
- 90s_dev a month ago
  
  Exactly. Every time I look at Rust, it looks foreign and unfamiliar, until I remember that I wrote Rust for a client for maybe a year straight many years ago. Then it starts to come back to me.
pimlottc a month ago

This is life. So many times I’ve finished a project and thought to myself: “Now I am an expert at doing this. Yet I probably won’t ever do this again.” Because the next thing will completely in a different subject area and I’ll start again from the basics.
korkybuchek a month ago

Not that I'm privy to your mind, but it probably was tesseract (and this is my exact experience too...although for me it was about 12 years ago).
didericis a month ago

I built an auto HQ solver with tesseract when HQ was blowing up over thanksgiving (HQ was the gameshow by the vine people with live hosts). I would take a screenshot of the app during a question, share it/send it to a little local api, do a google query for the question, see how many times each answer on the first page appeared in the results, then rank the answers by probability.
Didn't work well/was a very naive way to search for answers (which is prob good/idk what kind of trouble I'd have gotten in if it let me or anyone else who used it win all the time), but it was fun to build.
TZubiri a month ago

It's still fresh with me, 7 or 8 years ago in my 20s, perhaps you are a bit older? Otherwise wouldn't hurt to do a checkup with a physician
- 90s_dev a month ago
  
  No I'm just good at forgetting things, I've been practicing since I was a kid.
  
  LordGrignard a month ago
  
  OH. MY. GOD. THATS SO COOL, IM STEALING THAT
downboots a month ago

No different than a fire ant whose leaf got knocked over by the wind and it moved on to the next.
- 90s_dev a month ago
  
  Well I sure do feel different than a fire ant.
  
  downboots a month ago
  
  anttention is all we have
  
  90s_dev a month ago
  
  Not true, I also have a nice cigar waiting for the rain to go away.
  
  90s_dev a month ago
  
  Hmm, it's gone now. Well I used to have one anyway.

svat a month ago

One thing I wish someone would write is something like the browser's developer tools ("inspect elements") for PDF — it would be great to be able to "view source" a PDF's content streams (the BT … ET operators that enclose text, each Tj operator for setting down text in the currently chosen font, etc), to see how every “pixel” of the PDF is being specified/generated. I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains.

There are a few tools that allow inspecting a PDF's contents (https://news.ycombinator.com/item?id=41379101) but they stop at the level of the PDF's objects, so entire content streams are single objects. For example, to use one of the PDFs mentioned in this post, the file https://bfi.uchicago.edu/wp-content/uploads/2022/06/BFI_WP_2... has, corresponding to page number 6 (PDF page 8), a content stream that starts like (some newlines added by me):

    0 g 0 G
    0 g 0 G
    BT
    /F19 10.9091 Tf 88.936 709.041 Td
    [(Subsequen)28(t)-374(to)-373(the)-373(p)-28(erio)-28(d)-373(analyzed)-373(in)-374(our)-373(study)83(,)-383(Bridge's)-373(paren)27(t)-373(compan)28(y)-373(Ne)-1(wGlob)-27(e)-374(reduced)]TJ
    -16.936 -21.922 Td
    [(the)-438(n)28(um)28(b)-28(er)-437(of)-438(priv)56(ate)-438(sc)28(ho)-28(ols)-438(op)-27(erated)-438(b)28(y)-438(Bridge)-437(from)-438(405)-437(to)-438(112,)-464(and)-437(launc)28(hed)-438(a)-437(new)-438(mo)-28(del)]TJ
    0 -21.923 Td

and it would be really cool to be able to see the above “source” and the rendered PDF side-by-side, hover over one to see the corresponding region of the other, etc, the way we can do for a HTML page.

kccqzy a month ago

When you use PDF.js from Mozilla to render a PDF file in DOM, I think you might actually get something pretty close. For example I suppose each Tj becomes a <span> and each TJ becomes a collection of <span>s. (I'm fairly certain it doesn't use <canvas>.) And I suppose it must be very faithful to the original document to make it work.
- chaps a month ago
  
  Indeed! I've used it to parse documents I've received through FOIA -- sometimes it's just easier to write beautifulsoup code compared to having to deal with PDF's oddities.

whenc a month ago

Try with cpdf (disclaimer, wrote it):

  cpdf -output-json -output-json-parse-content-streams in.pdf -o out.json

Then you can play around with the JSON, and turn it back to PDF with

  cpdf -j out.json -o out.pdf

No live back-and-forth though.

svat a month ago

The live back-and-forth is the main point of what I'm asking for — I tried your cpdf (thanks for the mention; will add it to my list) and it too doesn't help; all it does is, somewhere 9000-odd lines into the JSON file, turn the part of the content stream corresponding to what I mentioned in the earlier comment into:

        [
          [ { "F": 0.0 }, "g" ],
          [ { "F": 0.0 }, "G" ],
          [ { "F": 0.0 }, "g" ],
          [ { "F": 0.0 }, "G" ],
          [ "BT" ],
          [ "/F19", { "F": 10.9091 }, "Tf" ],
          [ { "F": 88.93600000000001 }, { "F": 709.0410000000001 }, "Td" ],
          [
            [
              "Subsequen",
              { "F": 28.0 },
              "t",
              { "F": -374.0 },
              "to",
              { "F": -373.0 },
              "the",
              { "F": -373.0 },
              "p",
              { "F": -28.0 },
              "erio",
              { "F": -28.0 },
              "d",
              { "F": -373.0 },
              "analyzed",
              { "F": -373.0 },
              "in",
              { "F": -374.0 },
              "our",
              { "F": -373.0 },
              "study",
              { "F": 83.0 },
              ",",
              { "F": -383.0 },
              "Bridge's",
              { "F": -373.0 },
              "paren",
              { "F": 27.0 },
              "t",
              { "F": -373.0 },
              "compan",
              { "F": 28.0 },
              "y",
              { "F": -373.0 },
              "Ne",
              { "F": -1.0 },
              "wGlob",
              { "F": -27.0 },
              "e",
              { "F": -374.0 },
              "reduced"
            ],
            "TJ"
          ],
          [ { "F": -16.936 }, { "F": -21.922 }, "Td" ],

This is just a more verbose restatement of what's in the PDF file; the real questions I'm asking are:

- How can a user get to this part, from viewing the PDF file? (Note that the PDF page objects are not necessarily a flat list; they are often nested at different levels of “kids”.)

- How can a user understand these instructions, and “see” how they correspond to what is visually displayed on the PDF file?

IIAOPSW a month ago

This might actually be something very valuable to me.
I have a bunch of documents right now that are annual statutory and financial disclosures of a large institute, and they are just barely differently organized from each year to the next to make it too tedious to cross compare them manually. I've been looking around for a tool that could break out the content and let me reorder it so that the same section is on the same page for every report.
This might be it.

hnick a month ago

I assume you mean open source or free, but just noting Acrobat Pro was almost there when I last used it years ago. The problem was you had it in reverse, browsing the content tree not inspecting the page, but it did highlight the object on the page. Not down to the command though, just the object/stream.
dleeftink a month ago

Have a look at this notebook[0], not exactly what you're looking for but does provide a 'live' inspector of the various drawing operations contained in a PDF.
[0]: https://observablehq.com/@player1537/pdf-utilities
- svat a month ago
  
  Thanks, but I was not able to figure out how to get any use out of the notebook above. In what sense is it a 'live' inspector? All it seems to do is to just decompose the PDF into separate “ops” and “args” arrays (neither of which is meaningful without the other), but it does not seem “live” in any sense — how can one find the ops (and args) corresponding to a region of the PDF page, or vice-versa?
  
  dleeftink a month ago
  
  You can load up your own PDF and select a page up front after which it will display the opcodes for this page. Operations are not structurally grouped, but decomposed in three aligned arrays which can be grouped to your liking based on opcode or used as coordinates for intersection queries (e.g. combining the ops and args arrays).
  The 'liveness' here is that you can derive multiple downstream cells (e.g. filters, groupings, drawing instructions) from the initial parsed PDF, which will update as you swap out the PDF file.
drguthals a month ago

"I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains."
Some combination of this is what we're building at Tensorlake (full disclosure I work there). Where you can "see" the PDF like a human and "understand" the contents, not JUST "read" the text. Because the contents of PDFs are usually in tables, images, text, formulas, hand-writing.
Being able to then "understand what a PDF file contains" is important (I think) for that understand part though. And so then we parse the PDF and run multiple models to extract markdown chunks/JSON so that you can ingest the actual data into other applications (AI agents, LLMs, or frankly whatever you want).
https://tensorlake.ai

kbyatnal a month ago

"PDF to Text" is a bit simplified IMO. There's actually a few class of problems within this category:

1. reliable OCR from documents (to index for search, feed into a vector DB, etc)

2. structured data extraction (pull out targeted values)

3. end-to-end document pipelines (e.g. automate mortgage applications)

Marginalia needs to solve problem #1 (OCR), which is luckily getting commoditized by the day thanks to models like Gemini Flash. I've now seen multiple companies replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.

Problems #2 and #3 are much more tricky. There's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. The future is definitely moving in this direction though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai)

miki123211 a month ago

There's also #4, reliable OCR and semantics extraction that works across many diverse classes of documents, which is relevant for accessibility.
This is hard because:
1. Unlike a business workflow which often only deals with a few specific kinds of documents, you never know what the user is going to get. You're making an abstract PDF reader, not an app that can process court documents in bankruptcy cases in Delaware.
2. You don't just need the text (like in traditional OCR), you need to recognize tables, page headers and footers, footnotes, headings, mathematics etc.
3. Because this is for human consumption, you want to minimize errors as much as possible, which means not using OCR when not needed, and relying on the underlying text embedded within the PDF while still extracting semantics. This means you essentially need two different paths, when the PDF only consists of images and when there are content streams you can get some information from.
3.1. But the content streams may contain different text from what's actually on the page, e.g. white-on-white text to hide information the user isn't supposed to see, or diacritics emulation with commands that manually draw acute accents instead of using proper unicode diacritics (LaTeX works that way).
4. You're likely running as a local app on the user's (possibly very underpowered) device, and likely don't have an associated server and subscription, so you can't use any cloud AI models.
5. You need to support forms. Since the user is using accessibility software, presumably they can't print and use a pen, so you need to handle the ones meant for printing too, not just the nice, spec-compatible ones.
This is very much an open problem and is not even remotely close to being solved. People have been taking stabs at it for years, but all current solutions suck in some way, and there's no single one that solves all 5 points correctly.
noosphr a month ago

>replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.
As someone who had to build custom tools because VLMs are so unreliable: anyone that uses VLMs for unprocessed images is in for more pain than all the providers which let LLMs without guard rails interact directly with consumers.
They are very good at image labeling. They are ok at very simple documents, e.g. single column text, centered single level of headings, one image or table per page, etc. (which is what all the MVP demos show). They need another trillion parameters to become bad at complex documents with tables and images.
Right now they hallucinate so badly that you simply _can't_ use them for something as simple as a table with a heading at the top, data in the middle and a summary at the bottom.
- th0ma5 a month ago
  
  I wish I could upvote you more. The compounding errors of these document solutions preclude what people assume must be possible.
varunneal a month ago

I've been hacking away at trying to process PDFs into Markdown, having encountered similar obstacles to OP regarding header detection (and many other issues). OCR is fantastic these days but maintaining a global structure to the document is much trickier. Consistent HTML seems still out of reach for large documents. I'm having half-decent results with Markdown using multiple passes of an LLM to extract document structure and feeding it in contextually for page-by-pass extraction.
- dstryr a month ago
  
  Give this project a try. I've been using it with promising results.
  https://github.com/matthsena/AlcheMark
  
  aorth a month ago
  
  I tried with one PDF and was surprised to see it connect to some cloud service:
  2025-05-14 07:58:49,373 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443 2025-05-14 07:58:50,446 - urllib3.connectionpool - DEBUG - https://openaipublic.blob.core.windows.net:443 "GET /encodings/o200k_base.tiktoken HTTP/1.1" 200 361 3922
  The project's README doesn't mention that anywhere...
  
  degamad a month ago
  
  The project's README mentions that it uses tiktoken[0], which is a separate project created by OpenAI.
  tiktoken downloads token models the first time you use them, but it does not mention that. It does cache the models, so you shouldn't see more of those connections, if I'm understanding the code correctly.
  [0] <https://github.com/openai/tiktoken>
  
  varunneal a month ago
  
  I'll check it out!

herodotus a month ago

This is mostly what I worked on for many years at Apple with reasonable success. The main secret was to accept that everything was geometry, and use cluster analysis to try to distinguish between word gaps and letter gaps. On many PDF documents, it works really well, but there are so many different kinds of PDF documents that there are always cases were the results are not that great. If I were to do it today, I would stick with geometry, avoid OCR completely, but use machine learning. One big advantage for machine learning is that I could use existing tools to generate PDFs from known text, so that the training phase could be completly automatic. (Here is Bertrand Serlet announcing the feature at WWDC in 2009: https://youtu.be/FTfChHwGFf0?si=wNCfI9wZj1aj9rY7&t=308)

dwheeler a month ago

The better solution is to embed, in the PDF, the editable source document. This is easily done by LibreOffice. Embedding it takes very little space in general (because it compresses well), and then you have MUCH better information on what the text is and its meaning. It works just fine with existing PDF readers.

kerkeslager a month ago

That's true, but it's dependent on the creator of the PDF having aligned incentives with the consumer of the PDF.
In the e-Discovery field, it's commonplace for those providing evidence to dump it into a PDF purely so that it's harder for the opposing side's lawyers to consume. If both sides have lots of money this isn't a barrier, but for example public defenders don't have funds to hire someone (me!) to process the PDFs into a readable format, so realistically they end up taking much longer to process the data, which takes a psychological toll on the defendant. And that's if they process the data at all.
The solution is to make it illegal to do this: wiretap data, for example, should be provided in a standardized machine-readable format. There's no ethical reason for simple technical friction to be affecting the outcomes of criminal proceedings.
- lurk2 a month ago
  
  > The solution is to make it illegal to do this: wiretap data, for example, should be provided in a standardized machine-readable format. There's no ethical reason for simple technical friction to be affecting the outcomes of criminal proceedings.
  I can’t speak to wiretaps specifically, but when it comes to the legal field, this is usually already how it operates. GDPR, for example, makes specific provisions that user data must be provided in an accessible, machine-readable format. Most jurisdictions also aren’t going to look kindly on physical document dumping and will require that documents be provided in a machine-readable format. PDF is the legal industry standard for all outbound files. The consistency of its formatting makes up for the difficulties involved with machine-readability.
  There’s not a huge incentive to find an alternative because most firms will just charge a markup on the time a clerk spends reading through and transcribing those PDFs. If cost is a concern, though, most jurisdictions will require the party in possession of the original documents to provide them in a machine-readable format (e.g. providing bank records as Excel spreadsheets rather than as PDFs).
  
  kerkeslager a month ago
  
  I'm not sure I understand what you're saying? PDF isn't a machine-readable format for most kinds of data and keeping inherent court costs down is always a concern because it keeps the courts fair to the poor.
  
  lurk2 a month ago
  
  I’m saying that most jurisdictions likely already do require data to be machine-readable, but when you run into PDFs, it isn’t a document dump (which courts don’t look kindly upon), but is instead a product of mixed parts convention and motivated laziness.
  
  kerkeslager a month ago
  
  You're saying two mutually exclusive things. Either it's required to be machine readable or it's PDF: it can't be both.
- giovannibonetti a month ago
  
  I wonder if AI will solve that
  
  GaggiX a month ago
  
  There are specialized models, but even generic ones like Gemini 2.0 Flash are really good and cheap, you can use them and embed the OCR inside the PDF to index to the original content.
  
  kerkeslager a month ago
  
  This fundamentally misunderstands the problem. Effective OCR predates the popularity of ChatGPT and e-Discovery folks were already using it--AI in the modern sense adds nothing to this. Indexing the resulting text was also already possible--again AI adds nothing. The problem is that the resultant text lacks structure: being able to sort/filter wiretap data by date/location, for example, isn't inherently possible because you've obtained text or indexed it. AI accuracy simply isn't high enough to solve this problem without specialized training--off the shelf models simply won't work accurately enough even if you can get around the legal problems of feeding potentially-sensitive information into a model. AI models trained on a large enough domain-specific dataset might work, but the existing off-the-shelf models certainly are not accurate enough. And there are a lot of subdomains--wiretap data, cell phone GPS data, credit card data, email metadata, etc., which would each require model training.
  Fundamentally, the solution to this problem is to not create it in the first place. There's no reason for there to be a structured data -> PDF -> AI -> structured data pipeline when we can just force people providing evidence to provide the structured data.
lelandfe a month ago

The better solution to a search engine extracting text from existing PDFs is to provide advice on how to author PDFs?
What's the timeline for this solution to pay off
- chaps a month ago
  
  Microsoft is one of the bigger contributors to this. Like -- why does excel have a feature to export to PDF, but not a feature to do the opposite? That export functionality really feels like it was given to a summer intern who finished it in two weeks and never had to deal with it ever again.
  
  bartread a month ago
  
  It does have a feature to do the opposite. You can, in theory, extract tabular data from PDFs with Excel (note: only on the Windows version; this function isn’t available in macOS Excel).
  In practice I’ve found it to be extremely unreliable, and I suspect this may be because the optional metadata that semantically defines a table as a table is missing from the errant PDF. It’ll still look like a table when rendered, but there’s nothing that defines it as such. It’s just a bunch of graphical and text elements that, when rendered, happen to look like a table.
  
  chaps a month ago
  
  Yeah. The "extremely unreliable" part of that is the stinker. Some of the exports I get through FOIA are thousands and thousands of pages, so the unreliability really compounds really quickly. It's frustrating, because there are many things Microsoft could do with PDFs to make that a non-problem. But it's consistently been a naive implementation that doesn't consider newlines.
  
  mattigames a month ago
  
  Because then we would have 2 formats: "pdfs generated by Excel" and "real pdfs" with the same extension and that would be it's own can of worms for Microsoft's and for everyone else.
  
  chaps a month ago
  
  Hah, no. We would be going from 200,000 formats to 200,001 formats. Begone, shallow xkcd references!
layer8 a month ago

That’s true, but it also opens up the vulnerability of the source document being arbitrarily different from the rendered PDF content.
carabiner a month ago

I bet 90% of the problem space is legacy PDFs. My company has thousands of these. Some are crappy scans. Some have Adobe's OCR embedded, but most have none at all.
yxhuvud a month ago

Sure, and if you have access to the source document the pdf was generated from, then that is a good thing to do.
But generally speaking, you don't have that control.

1vuio0pswjnm7 a month ago

Below is a PDF. It is a .txt file. I can save it with a .pdf extension and open it in a PDF viewer. I can make changes in a text editor. For example, by editing this text file, I can change the text displayed on the screen when the PDF is opened, the font, font size, line spacing, the maximum characters per line, number of lines per page, the paper width and height, as well as portrait versus landscape mode.

   %PDF-1.4
   1 0 obj
   <<
   /CreationDate (D:2025)
   /Producer 
   >>
   endobj
   2 0 obj
   <<
   /Type /Catalog
   /Pages 3 0 R
   >>
   endobj
   4 0 obj
   <<
   /Type /Font
   /Subtype /Type1
   /Name /F1
   /BaseFont /Times-Roman
   >>
   endobj
   5 0 obj
   <<
     /Font << /F1 4 0 R >>
     /ProcSet [ /PDF /Text ]
   >>
   endobj
   6 0 obj
   <<
   /Type /Page
   /Parent 3 0 R
   /Resources 5 0 R
   /Contents 7 0 R
   >>
   endobj
   7 0 obj
   <<
   /Length 8 0 R
   >>
   stream
   BT
   /F1 50 Tf
   1 0 0 1 50 752 Tm
   54 TL
   (PDF is)' 
   ((a) a text format)'
   ((b) a graphics format)'
   ((c) (a) and (b).)'
   ()'
   ET
   endstream
   endobj
   8 0 obj
   53
   endobj
   3 0 obj
   <<
   /Type /Pages
   /Count 1
   /MediaBox [ 0 0 612 792 ]
   /Kids [ 6 0 R ]
   >>
   endobj
   xref
   0 9
   0000000000 65535 f

0000000009 00000 n 0000000113 00000 n 0000000514 00000 n 0000000162 00000 n 0000000240 00000 n 0000000311 00000 n 0000000391 00000 n 0000000496 00000 n trailer << /Size 9 /Root 2 0 R /Info 1 0 R >> startxref 599 %%EOF

swsieber a month ago

It can also have embedded binary streams. It was not made for text. It was made for layout and graphics. You give nice examples, but each of those lines could have been broken up into one call per character, or per word, even out of order.
- hnick a month ago
  
  It can also use fonts which map glyphs via characters which do not represent the final visual item e.g. "PDF" could be "1#F" and you only really know what it looks like by rendering then viewing/OCR.
  A nice file won't, but sometimes the best work is in not dealing with nice things.
  
  90s_dev a month ago
  
  See this is why we can't have nice things.
1vuio0pswjnm7 a month ago

"PDF" is an acronym for for "Portable Document Format"
"2.3.2 Portability
A PDF file is a 7-bit ASCII file, which means PDF files use only the printable subset of the ASCII character set to describe documents even those with images and special characters. As a result, PDF files are extremely portable across diverse hardware and operating system environments."
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...
- normie3000 a month ago
  
  > PDF files use only the printable subset of the ASCII character set to describe documents even those with images and special characters
  Great, so PDF source code is easily printable?
  
  gpvos a month ago
  
  Except most are compressed or contain binary streams. You can transform any PDF into an equivalent ASCII PDF though, e.g. using qpdf.
jimjimjim a month ago

This is the "Hello World" of PDFs.
Most pdfs these days have all of the objs compressed with deflate.
and then, because that didn't make it difficult enough to follow, a lot of pdfs have most of the objects grouped up inside object stream type objects which then get compressed. So you can't have text editor search for a "6 0 Obj" when you are tracking down the end of a "6 0 R"

ted_dunning a month ago

One of my favorite documents for highlighting the challenges described here is the PDF for this article:

https://academic.oup.com/auk/article/126/4/717/5148354

The first page is classic with two columns of text, centered headings, a text inclusion that sits between the columns and changes the line lengths and indentations for the columns. Then we get the fun of page headers that change between odd and even pages and section header conventions that vary drastically.

Oh... to make things even better, paragraphs doing get extra spacing and don't always have an indented first line.

Some of everything.

JKCalhoun a month ago

The API in CoreGraphics (MacOS) for PDF, at a basic level, simply presented the text, per page, in the order in which it was encoded in the dictionaries. And 95% of the time this was pretty good — and when working with PDFKit and Preview on the Mac, we got by with it for years.
If you stepped back you could imagine the app that originally had captured/produced the PDF — perhaps a word processor — it was likely rendering the text into the PDF context in some reasonable order from it's own text buffer(s). So even for two columns, you rather expect, and often found, that the text flowed correctly from the left column to the right. The text was therefore already in the correct order within the PDF document.
Now, footers, headers on the page — that would be anyone's guess as to what order the PDF-producing app dumped those into the PDF context.

bartread a month ago

Yeah, getting text - even structured text - out of PDFs is no picnic. Scraping a table out of an HTML document is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.

Not so PDFs.

I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, are easily discernible as a table because they're positioned in such a way that they render as a table.

I've actually had decent luck extracting tabular data from PDFS by converting the PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.

It's kind of groaty but it seems reliable for what I need. Certainly much moreso than going via formatted plaintext, which has issues with inconsistent spacing, and the insertion of newlines into the middle of rows.

spacecaps a month ago

I was irritated that I couldn't extract data from PDFs in a similar way to web pages + BeautifulSoup, so I built a library that (kind of) does just that[0]. It does a bunch of other nonsense, but the main goal is a more "human" way of interacting, e.g. `page.find('text:bold:contains("Summary").below().extract_text()`.
And since every PDF is its own bespoke nightmare, I'm also trying to build up a collection of awful-to-extract-data-from examples to serve as the foundation for a how-to library[1].
[0] https://jsoma.github.io/natural-pdf/
[1] https://badpdfs.com/
yxhuvud a month ago

My favorite is (official, governmental) documents that has one set of text that is rendered, and a totally different set of text that you get if you extract the text the normal way..
hermitcrab a month ago

I am hoping at some point to be able to extract tabular data from PDFs for my data wrangling software. If anyone knows of a library that can extract tables from PDFs, can be inegrated into a C++ app and is free or less than a few hundred $, please let me know!
- ______ a month ago
  
  pdfplumber is great for table extraction but it is python
  
  hermitcrab a month ago
  
  Thanks, but I prefer to keep everything C++ for simplicity and speed.
j45 a month ago

PDFs inherently are a markup / xml format, the standard is available to learn from.
It's possible to create the same PDF in many, many, many ways.
Some might lean towards exporting a layout containing text and graphics from a graphics suite.
Others might lean towards exporting text and graphics from a word processor, which is words first.
The lens of how the creating app deals with information is often something that has input on how the PDF is output.
If you're looking for an off the shelf utility that is surprisingly decent at pulling structured data from PDFs, tools like cisdem have already solved enough of it for local users. Lots of tools like this out there, many do promise structured data support but it needs to match what you're up to.
- layer8 a month ago
  
  > PDFs inherently are a markup / xml format
  This is false. PDFs are an object graph containing imperative-style drawing instructions (among many other things). There’s a way to add structural information on top (akin to an HTML document structure), but that’s completely optional and only serves as auxiliary metadata, it’s not at the core of the PDF format.
  
  j45 a month ago
  
  I appreciate the clarification. Should have been more precise with my terminology.
  That being said, I think I'm talking about the forest of PDFs.
  When I said PDFs have a "markup-like structure," I was talking from my experience manually writing PDFs from scratch using Adobe's spec.
  PDFs definitely have a structured, hierarchical format with nested elements that looks a lot like markup languages conceptually.
  The objects have a structure comparable to DOM-like structures - there's clear parent-child relationships just like in markup languages. Working with tags like "<<" and ">>" feels similar to markup tags when hand coding them.
  This is an article that highlights what I have seen (much cleaner PDF code): "The Structure of a PDF File" (https://medium.com/@jberkenbilt/the-structure-of-a-pdf-file-...) which says:
  "There are several types of objects. If you are familiar with JSON, YAML, or the object model in any reasonably modern programming language, this will seem very familiar to you... A PDF object may have one of the following types: String, Number, Boolean, Null, Name, Array, Dictionary..."
  This structure with dictionaries in "<<" and ">>" and arrays in brackets really gave me markup vibes when coding to the spec (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...).
  While PDFs are an object graph with drawing instructions like you said, the structure itself looks a lot like markup formats.
  Might be just a difference in choosing to focus on the forest vs the trees.
  That hierarchical structure is why different PDF creation methods can make such varied document structures, which is exactly why text extraction is so tricky.
  Learning to hand code PDFs in many ways, lets you learn to read and unravel them a little differently, maybe even a bit easier.
  
  layer8 a month ago
  
  Markup is only indirectly related to hierarchical structure. “Markup” means that there is text that is being “marked up” with additional attributes (styling, structure information, metadata, …). This is how HTML and XML work, and also languages like TeX, Troff, and Markdown. For example, in the text “this is some text”, you can mark up the word “some” as being emphasized, as in “this is <em>some</em> text”.
  The general principle is that the base content is plain text, which is augmented with markup information, which may or may not have hierarchical aspects. You can simply strip away the markup again and recover just the text. That’s not at all how PDF works, however.
  You cite a comparison to JSON and YAML. Those are not markup languages (despite what YAML originally was an abbreviation for, see [0]). (HTML also isn’t DOM.)
  [0] https://stackoverflow.com/a/18928199
  
  j45 a month ago
  
  I was quoting the article there about JSON/YAML, not making that claim myself.
  Did you take a look at the article I linked? It shows visual examples of hand-coded PDFs that demonstrate the structural similarities I am talking about.
  Thanks for the clarification on terminology. I could have been clearer and more precise. I referred to "DOM-like structures" as an analogy for the hierarchical nature of PDF objects, not to claim HTML is DOM.
  My core point wasn't about the technical definition of markup languages, but about the structural similarity between PDF's object model and hierarchical formats.
  When coding a PDF document by hand, you work with nested structures using delimiters like "<<" and ">>" that create hierarchical relationships between objects - which has practical parallels to working with nested elements in other formats.
  The forest vs. trees metaphor was to acknowledge that while PDFs aren't primarily markup formats (the trees), they do share structural characteristics with hierarchical formats (the forest) based on my hands-on experience with manual PDF creation.
  Hope that helps clarify things a bit.
  
  bartread a month ago
  
  > but that’s completely optional and only serves as auxiliary metadata, it’s not at the core of the PDF format.
  This is what I kind of suspected but, as I said in my original comment, I'm not an expert and for the PDFs I'm reading I didn't need to delve further because that metadata simply isn't in there (although, boy do I wish it was) so I needed to use a different approach. As soon as I realised what I had was purely presentation I knew it was going to be a bit grim.
  
  davidthewatson a month ago
  
  Thanks for your comment.
  Indeed. Therein lies the rub.
  Why?
  Because no matter the fact that I've spent several years of my latent career crawling and parsing and outputting PDF data, I see now that pointing my LLLM stack at a directory of *.pdf just makes the invisible encoding of the object graph visible. It's a skeptical science.
  The key transclusion may be to move from imperative to declarative tools or conditional to probabilistic tools, as many areas have in the last couple decades.
  I've been following John Sterling's ocaml work for a while on related topics and the ideas floating around have been a good influence on me in forests and their forester which I found resonant given my own experience:
  https://www.jonmsterling.com/index/index.xml
  https://github.com/jonsterling/forest
  I was gonna email john and ask whether it's still being worked on as I hope so, but I brought it up this morning as a way out of the noise that imperative programming PDF has been for a decade or more where turtles all the way down to the low-level root cause libraries mean that the high level imperative languages often display the exact same bugs despite significant differences as to what's being intended in the small on top of the stack vs the large on the bottom of the stack. It would help if "fitness for a particular purpose" decisions were thoughtful as to publishing and distribution but as the CFO likes to say, "Dave, that ship has already sailed." Sigh.
  ¯\_(ツ)_/¯
- jimjimjim a month ago
  
  uh. There is very little XML and the spec is a thousand pages long.
  
  j45 a month ago
  
  Clarified above - referring to the visual side of coding PDFs by hand.
  https://medium.com/@jberkenbilt/the-structure-of-a-pdf-file-...

gerdesj a month ago

PDF is a display format. It is optimised for eyeballs and printers. There has been some feature creep. It is a rubbish machine data transfer mechanism but really good for humans and say storing a page of A4 (letter for the US).

So, you start off with the premise that a .pdf stores text and you want that text. Well that's nice: grow some eyes!

Otherwise, you are going to have to get to grips with some really complicated stuff. For starters, is the text ... text or is it an image? Your eyes don't care and will just work (especially when you pop your specs back on) but your parser is probably seg faulting madly. It just gets worse.

PDF is for humans to read. Emulate a human to read a PDF.

amai a month ago

That's it. I could not have said it better.

trevor-e a month ago

Having built some toy parsers for PDF files in the past it was a huge wtf moment for me when I realized how the format works. With that in mind, it's even more puzzling how it's used often in text-heavy cases.

I always think about the invoicing use-case: digital systems should be able to easy extract data from the file while still being formatted visually for humans. It seems like the tech world would be much better off if we migrated to a better format.

Uvix a month ago

XML+XSLT was almost this, but unfortunately browsers no longer support it for local XML files, only ones on a remote server.

patrick41638265 a month ago

Good old https://linux.die.net/man/1/pdftotext and a little Python on top of its output will get you a long way if your documents are not too crazy. I use it to parse all my bank statements into an sqlite database for analysis.

gibsonf1 a month ago

We[1] Create "Units of Thought" from PDF's and then work with those for further discovery where a "Unit of Thought" is any paragraph, title, note heading - something that stands on its own semantically. We then create a hierarchy of objects from that pdf in the database for search and conceptual search - all at scale.

[1] https://graphmetrix.com/trinpod-server https://trinapp.com

IIAOPSW a month ago

I'm tempted to try it. My use case right now is a set of documents which are annual financial and statutory disclosures of a large institution. Every year they are formatted / organized slightly differently which makes it enormously tedious to manually find and compare the same basic section from one year to another, but they are consistent enough to recognize analogous sections from different years due to often reusing verbatim quotes or highly specific key words each time.
What I really want to do is take all these docs and just reorder all the content such that I can look at page n (or section whatever) scrolling down and compare it between different years by scrolling horizontally. Ideally with changes from one year to the next highlighted.
Can your product do this?
- gibsonf1 a month ago
  
  Probably without too much difficulty. If you have a sample to confirm, that would be great. frederick @ graphmetrix . com

drguthals a month ago

PDF -> Useful Information is what Tensorlake does (https://tensorlake.ai)

Because PDFs are so dominate and yet each one has information in more than just text (tables, images, formulas, hand-writing, strike-throughs even), we (as devs) need to be able have tools that understand the contents, not just "read" them.

Full disclosure...I work there

snehanairdoc a month ago

This was a great read. You've done an excellent job breaking down what makes PDFs so uniquely annoying to work with. People often underestimate how much of the “document-ness” (like headings, paragraphs, tables) is just visual, with no underlying semantic structure.

We ran into many of the same challenges while working on Docsumo, where we process business documents like invoices, bank statements, and scanned PDFs. In real-world use cases, things get even messier: inconsistent templates, rotated scans, overlapping text, or documents generated by ancient software with no tagging at all.

One thing we’ve found helpful (in addition to heuristics like font size/weight and spacing) is combining layout parsing with ML models trained to infer semantic roles (like "header", "table cell", "footer", etc.). It’s far from perfect, but it helps bridge the gap between how the document looks and what it means.

Really appreciate posts like this. PDF wrangling is a dark art more people should talk about.

smcleod a month ago

Definitely recommend docling for this. https://docling-project.github.io/docling/

xnx a month ago

Weird that there's no mention of LLMs in this article even though the article is very recent. LLMs haven't solved every OCR/document data extraction problem, but they've dramatically improved the situation.

marginalia_nu a month ago

Author here: LLMs are definitely the new gold standard for smaller bodies of shorter documents.
The article is in the context of an internet search engine, the corpus to be converted is of order 1 TB. Running that amount of data through an LLM would be extremely expensive, given the relatively marginal improvement in outcome.
- mediaman a month ago
  
  Corpus size doesn't mean much in the context of a PDF, given how variable that can be per page.
  I've found Google's Flash to cut my OCR costs by about 95+% compared to traditional commercial offerings that support structured data extraction, and I still get tables, headers, etc from each page. Still not perfect, but per page costs were less than one tenth of a cent per page, and 100 gb collections of PDFs ran to a few hundreds of dollars.
- noosphr a month ago
  
  A PDF corpus with a size of 1tb can mean anything from 10,000 really poorly scanned documents to 1,000,000,000 nicely generated latex pdfs. What matters is the number of documents, and the number of pages per document.
  For the first I can run a segmentation model + traditional OCR in a day or two for the cost of warming my office in winter. For the second you'd need a few hundred dollars and a cloud server.
  Feel free to reach out. I'd be happy to have a chat and do some pro-bono work for someone building a open source tool chain and index for the rest of us.
simonw a month ago

I've had great results against PDFs from recent vision models. Gemini, OpenAI and Claude can all accept PDFs directly now and treat them as image input.
For longer PDFs I've found that breaking them up into images per page and treating each page separately works well - feeing a thousand page PDF to even a long context model like Gemini 2.5 Pro or Flash still isn't reliable enough that I trust it.
As always though, the big challenge of using vision LLMs for OCR (or audio transcription) tasks is the risk of accidental instruction following - even more so if there's a risk of deliberately malicious instructions in the documents you are processing.
constantinum a month ago

True indeed, but there are a few problems — hallucinations and trusting the output(validation). More here https://unstract.com/blog/why-llms-struggle-with-unstructure...
j45 a month ago

LLMs are definitely helping approach some problems that couldn't be to date.

Sharlin a month ago

Some of the unsung heroes of the modern age are the programmers who, through what must have involved a lot of weeping and gnashing of teeth, have managed to implement the find, select, and copy operations in PDF readers.

bob1029 a month ago

When accommodating the general case, solving PDF-to-text is approximately equivalent to solving JPEG-to-text.

The only PDF parsing scenario I would consider putting my name on is scraping AcroForm field values from standardized documents.

kapitalx a month ago

This is approximately the approach we're taking also at https://doctly.ai, add to that a "multiple experts" approach for analyzing the image (for our 'ultra' version), and we get really good results. And we're making it better constantly.
layer8 a month ago

If you assume standardized documents, you can impose the use of Tagged PDF: https://pdfa.org/resource/tagged-pdf-q-a/

j45 a month ago

Part of a problem being challenging is recognizing if it's new, or just new to us.

We get to learn a lot when something is new to us.. at the same time the untouchable parts of PDF to Text are largely being solved with the help of LLMs.

I built a tool to extract information from PDFs a long time ago, and the break through was having no ego or attachment to any one way of doing it.

Different solutions and approaches offered different depth or quality of solutions and organizing them to work together in addition to anything I built myself provided what was needed - one place where more things work.. than not.

wrs a month ago

Since these are statistical classification problems, it seems like it would be worth trying some old-school machine learning (not an LLM, just an NN) to see how it compares with these manual heuristics.

marginalia_nu a month ago

I imagine that would work pretty well given an adequate and representative body of annotated sample data. Though that is also not easy to come by.
- ted_dunning a month ago
  
  Actually, it is easy to come up with reasonably decent heuristics that can auto-tag a corpus. From that you can look for anomalies and adjust your tagging system.
  The problem of getting a representative body is (surprisingly) much harder than the annotation. I know. I spent quite some time years ago doing this.
- wrs a month ago
  
  But if you believe in your manual heuristics enough to ship them, you must already have a body of tests that you're happy with, right?
  Also seems like this is a case where generating synthetic data would be a big help. You don't have to use only real-world documents for training, just examples of the sorts of things real-world documents have in them. Make a vast corpus of semi-random documents in semi-random fonts and settings, printed from Word, Pandoc, LaTeX, etc.

nicodjimenez a month ago

Check out mathpix.com. We handle complex tables, complex math, diagrams, rotated tables, and much more, extremely accurately.

Disclaimer: I'm the founder.

constantinum a month ago

PDF parsing is hell indeed, with all sorts of edge cases that breaks business workflows, more on that here https://unstract.com/blog/pdf-hell-and-practical-rag-applica...

incanus77 a month ago

I did some contract work some years back with a company who had a desktop product (for Mac) that would apply some smarts to strip out extraneous things on pages while printing (such as ads on webpages) as well as try to avoid the case where only a line or two was printed on a page, wasting paper. It initially was getting into things at the PostScript layer, which unsurprisingly was horrifying, but eventually worked on PDFs. This required finding and interpreting various textual parts of the passed documents and was a pretty big technical challenge.

While I'm not convinced it was viable at the business level, it feels like something platform/OS companies could focus on to have a measurable environmental and cost overhead impact.

lewtun a month ago

> The absolute best way of doing this is these days is likely through a vision based machine learning model, but that is an approach that is very far away from scaling to processing hundreds of gigabytes of PDF files off a single server with no GPU.

SmolDocling is pretty fast and the ONNX weights can be scaled to many CPUs: https://huggingface.co/ds4sd/SmolDocling-256M-preview

Not sure what time scale the author had in mind for processing GBs of PDFs, but the future might be closer than “very far away”

rad_gruchalski a month ago

So many of these problems have been solved by mozilla pdf.js together with its viewer implementation: https://mozilla.github.io/pdf.js/.

egnehots a month ago

I don't think so, pdf.js is able to render a pdf content.
Which is different from extracting "text". Text in PDF can be encoded in many ways, in an actual image, in shapes (think, segments, quadratic bezier curves...), or in an XML format (really easy to process).
PDF viewers are able to render text, like a printer would work, processing command to show pixels on the screen at the end.
But often, paragraph, text layout, columns, tables are lost in the process. Even though, you see them, so close yet so far. That is why AI is quite strong at this task.
- rad_gruchalski a month ago
  
  You are wrong. Pdf.js can extract text and has all facilities required to render and extract formatting. The latest version can also edit PDF files. It’s basically the same engine as the Firefox PDF viewer. Which also has a document outline, search, linking, print preview, scaling, scripting sandbox… it does not simply „render” a file.
  Regarding tables, this here https://www.npmjs.com/package/pdf-table-extractor does a very good job at table interpretation and works on top of pdf.js.
  I also didn’t say what works better or worse, neither do I go into PDF being good or bad.
  I simply said that a ton of problems have been covered by
- lionkor a month ago
  
  Correct me if im wrong, but pdf.js actually has a lot of methods to manipulate PDFs, no?
  
  rad_gruchalski a month ago
  
  Yes, pdf.js can do that: https://github.com/mozilla/pdf.js/blob/master/web/viewer.htm....
  The purpose of my original comment was to simply say: there’s an existing implementation so if you’re building a pdf file viewer/editor, and you need inspiration, have a look. One of the reasons why mozilla is doing this is to be a reference implementation. I’m not sure why people are upset with this. Though, I could have explained it better.
iAMkenough a month ago

A good PDF reader makes the problems easier to deal with, but does not solve the underlying issue.
The PDF itself is still flawed, even if pdf.js interprets it perfectly, which is still a problem for non-pdf.js viewers and tasks where "viewing" isn't the primary goal.
- rad_gruchalski a month ago
  
  Yeah. What I’m saying: pdf.js seems to have some of these solved. All I’m suggesting is have a look at it. I get it that for some PDF is a broken format.
zzleeper a month ago

Any sense on how PDF.js compares against other tools such as pdfminer?
- favorited a month ago
  
  I did some very broad testing of several PDF text extraction tools recently, and PDF.js was one of the slowest.
  My use-case was specifically testing their performance as command-line tools, so that will skew the results to an extent. For example, PDFBox was very slow because you're paying the JVM startup cost with each invocation.
  Poppler's pdftotext utility and pdfminer.six were generally the fastest. Both produced serviceable plain-text versions of the PDFs, with minor differences in where they placed paragraph breaks.
  I also wrote a small program which extracted text using Chrome's PDFium, which also performed well, but building that project can be a nightmare unless you're Google. IBM's Docling project, which uses ML models, produced by far the best formatting, preserving much of the document's original structure – but it was, of course, enormously slower and more energy-hungry.
  Disclaimer: I was testing specific PDF files that are representative of the kind of documents my software produces.
- rad_gruchalski a month ago
  
  I don’t know. I use pdf.js for everything PDF.

EmilStenstrom a month ago

I think using Gemma3 in vision mode could be a good use-case for converting PDF to text. It’s downloadable and runnable on a local computer, with decent memory requirements depending on which size you pick. Did anyone try it?

ljlolel a month ago

Mistral OCR has the best in class document understanding. https://mistral.ai/news/mistral-ocr
CaptainFever a month ago

Kind of unrelated, but Gemma 3's weights are unfree, so perhaps LLaVA (https://ollama.com/library/llava) would be a good alternative.

rekoros a month ago

I've been using Azure's "Document Intelligence" thingy (prebuilt "read" model) to extract text from PDFs with pretty good results [1]. Their terminology is so bad, it's easy to dismiss the whole thing for another Microsoft pile, but it actually, like, for real, works.

[1] https://learn.microsoft.com/en-us/azure/ai-services/document...

hilbert42 a month ago

"It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”."

I've often had trouble extracting text from PDFs, it's time consuming and messy, so a quick question.

The PDF format works pretty well for what it does but it's now pretty ancient, so does anyone know if there's any newer format on the horizon that could be a next-generation replacement that would make it much easier to extract its data and export it to another format (say, docx, odt, etc.)?

noosphr a month ago

I've worked on this in my day job: extracting _all_ relevant information from a financial services PDF for a bert based search engine.

The only way to solve that is with a segmentation model followed by a regular OCR model and whatever other specialized models you need to extract other types of data. VLM aren't ready for prime time and won't be for a decade on more.

What worked was using doclaynet trained YOLO models to get the areas of the document that were text, images, tables or formulas: https://github.com/DS4SD/DocLayNet if you don't care about anything but text you can feed the results into tesseract directly (but for the love of god read the manual). Congratulations, you're done.

Here's some pre-trained models that work OK out of the box: https://github.com/ppaanngggg/yolo-doclaynet I found that we needed to increase the resolution from ~700px to ~2100px horizontal for financial data segmentation.

VLMs on the other hand still choke on long text and hallucinate unpredictably. Worse they can't understand nested data. If you give _any_ current model nothing harder than three nested rectangles with text under each they will not extract the text correctly. Given that nested rectangles describes every table no VLM can currently extract data from anything but the most straightforward of tables. But it will happily lie to you that it did - after all a mining company should own a dozen bulldozers right? And if they each cost $35.000 it must be an amazing deal they got, right?

cess11 a month ago

That looks like a pretty good starting point, thanks. I've been dabbling in vision models but need a much higher degree of accuracy than they seem able to provide, opting instead for more traditional techniques and handling errors manually.
- noosphr a month ago
  
  For non-table documents a fine tuned yolov8 + tesseract with _good_ image pre-processing has basically a zero percent error rate on monolingual texts. I say basically because the training data has worse labels than what the multi-model system gives out in the cases that I double checked manually.
  But no one reads the manual on tesseract and everyone ends up feeding it garbage, with predictable results.
  Tables are an open research problem.
  We started training a custom version of this model: https://arxiv.org/pdf/2309.14962 but there wasn't the business case since the bert search model dealt well enough with the word soup that came out of easy ocr. If you're interested drop a line. I'd love to get a model like that trained since it's very low hanging fruit that no one has done right.
  
  opyate a month ago
  
  The first thing I did when I saw this thread was ctrl-f for doclaynet :)
  I've been at this problem since 2013, and a few years ago turned my findings into more of a consultancy than a product. See HTTPS://pdfcrun.ch
  However, due to various events, I burned out recently and took a permie job, so would love to stick my head in the sand and play video games in my spare time, but secretly hoping you'd see this and to hear about your work.
  
  noosphr a month ago
  
  There's not much to say.
  Doclaynet is the easy part and with triple the usual resolution the previous gen of yolo models have solved document segmentation for every document I've looked at.
  The hard part is the table segmentation. I don't have the budget to do a proper exploration of hyper parameters for the gridformer models before starting a $50,000 training run.
  This is a back burner project along with speaker diarization. I have no idea why those haven't been solved since they are very low hanging fruit that would release tens of millions in productivity when deployed at scale, but regardless I can't justify buying a Nvidia DGX H200 and spending two months exploring architectures for each.
  
  cess11 a month ago
  
  Thanks, that's interesting research, I'll look into it.
yfontana a month ago

[dead]

keybored a month ago

For people who want people to read their documents[1] they should have their PDF point to a more digital-friendly format, an alt document.

Looks like you’ve found my PDF. You might want this version instead:

PDFs are often subpar. Just see the first example: standard Latex serif section title. I mean, PDFs often aren’t even well-typeset for what they are (dead-tree simulations).

[1] No sarcasm or truism. Some may just want to submit a paper to whatever publisher and go through their whole laundry list of what a paper ought to be. Wide dissemanation is not the point.

PeterStuer a month ago

I guess I'm lucky the PDF's I need to process are mostly rather dull unadventurous layouts. So far I've had great success using docling.

andrethegiant a month ago

Cloudflare’s ai.toMarkdown() function available in Workers AI can handle PDFs pretty easily. Judging from speed alone, it seems they’re parsing the actual content rather than shoving into OCR/LLM.

Shameless plug: I use this under the hood when you prefix any PDF URL with https://pure.md/ to convert to raw text.

burkaman a month ago

If you're looking for test cases, this is the first thing I tried and the result is very bad: https://pure.md/https://docs.house.gov/meetings/IF/IF00/2025...
- marginalia_nu a month ago
  
  That PDF actually has some weird corner cases.
  First it's all the same font size everywhere, it's also got bolded "headings" with spaces that are not bolded. Had to fix my own handling to get it to process well.
  This is the search engine's view of the document as of those fixes: https://www.marginalia.nu/junk/congress.html
  Still far from perfect...
  
  mdaniel a month ago
  
  > That PDF actually has some weird corner cases.
  Heh, in my experience with PDFs that's a tautology
- andrethegiant a month ago
  
  Apart from lacking newlines, how is the result bad? It extracts the text for easy piping into an LLM.
  
  burkaman a month ago
  
  - Most of the titles have incorrectly split words, for example "P ART 2—R EPEAL OF EPA R ULE R ELATING TO M ULTI -P OLLUTANT E MISSION S TANDARDS". I know LLMs are resilient against typos and mistakes like this, but it still seems not ideal.
  - The header is parsed in a way that I suspect would mislead an LLM: "BRETT GUTHRIE, KENTUCKY FRANK PALLONE, JR., NEW JERSEY CHAIRMAN RANKING MEMBER ONE HUNDRED NINETEENTH CONGRESS". Guthrie is the chairman and Pallone is the ranking member, but that isn't implied in the text. In this particular case an LLM might already know that from other sources, but in more obscure contexts it will just have to rely on the parsed text.
  - It isn't converted into Markdown at all, the structure is completely lost. If you only care about text then I guess that's fine, and in this case an LLM might do an ok job at identifying some of the headers, but in the context of this discussion I think ai.toMarkdown() did a bad job of converting to Markdown and a just ok job of converting to text.
  I would have considered this a fairly easy test case, so it would make me hesitant to trust that function for general use if I were trying to solve the challenges described in the submitted article (Identifying headings, Joining consecutive headings, Identifying Paragraphs).
  I see that you are trying to minimize tokens for LLM input, so I realize your goals are probably not the same as what I'm talking about.
  Edit: Another test case, it seems to crash on any Arxiv PDF. Example: https://pure.md/https://arxiv.org/pdf/2411.12104.
  
  andrethegiant a month ago
  
  > it seems to crash on any Arxiv PDF
  Fixed, thanks for reporting :-)
_boffin_ a month ago

You’re aware that PDFs are containers that can hold various formats, which can be interlaced in different ways, such as on top, throughout, or in unexpected and unspecified ways that aren’t “parsable,” right?
I would wager that they’re using OCR/LLM in their pipeline.
- andrethegiant a month ago
  
  Could be. But their pricing for the conversion is free, which leads me to believe LLMs are not involved.
cpursley a month ago

How's their function do on complex data tables, charts and that sort of stuff?
bambax a month ago

It doesn't seem to handle multi-columns PDFs well?

remram a month ago

I built a simple OSS tool for qualitative data analysis, which needs to turn uploaded documents into text (stripped HTML). PDFs have been a huge problem from day one.

I have investigated many tools, but two-column layouts and footers etc often still mess up the content.

It's hard to convince my (often non-technical) users that this is a difficult problem.

bartread a month ago

Try Poppler’s pdftohtml command line tool. For me that seems to do a good job of spitting out multi-column text in the right order. Then you have the much easier task of extracting the text from the HTML.
Also, if it does come out in the wrong order for any pages you can analyse element coordinates to figure out which column each chunk of text belongs in.
(Note that you may have to deal with sub-columns if tables are present in any columns. I’ve never had this in my data but you may also find blocks that span across more than one column, either in whole or in part.)
They also have a pdftotext tool that may do the job for you if you disable its layout option. If you run it with the layout option enabled you’ll find it generates multi-column text in the output, as it tries to closely match the layout of the input PDF.
I think the pdftohtml tool is probably the way to go just because the extra metadata on each element is probably going to be helpful in determining how to treat that element, and it’s obviously relatively straightforward to strip out the HTML tags to extract plain text.

elpalek a month ago

Recently tested a (non-english) pdf ocr with Gemini 2.5 Pro. First, directly ask it to extract text from pdf. Result: random text blob, not useable.

Second, I converted pdf into pages of jpg. Gemini performed exceptional. Near perfect text extraction with intact format in markdown.

Maybe there's internal difference when processing pdf vs jpg inside the model.

jagged-chisel a month ago

Model isn’t rendering the PDF probably, just looking in the file for text.

hilbert42 a month ago

"It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”."

I've often had trouble extracting text from PDFs, it's time consuming and messy, so a quick question. The PDF format is now pretty ancient

coolcase a month ago

Tried extracting data from a newspaper. It is really hard. What is a headline and which headline belongs to which paragraphs? Harder than you think! And chucking it as is into OpenAI was no good at all. Manually dealing with coordinates from OCR was better but not perfect.

bickfordb a month ago

Maybe it's time for new document formats and browsers that neatly separate content, presentation and UI layers? PDF and HTML are 20+ years old and it's often difficult to extract information from either let alone author a browser.

rrr_oh_man a month ago

Yes, but I'm sure they're out there somewhere
(https://xkcd.com/927/)
- TRiG_Ireland a month ago
  
  Open XML Paper Specification is an XML-based format intended to compete with PDF. Unlike PDF, it is purely static: no scripting.
  Also unlike PDF, I've never seen it actually used in the wild.

TZubiri a month ago

As someone who has worked on this FT. (S&P, parsing of financial disclosures)

The solution is OCR. Don't fuck with internal file format. PDF is designed to print/display stuff, not to be parseable by machines.

ljlolel a month ago

Mistral OCR has best in class doing document understanding

https://mistral.ai/news/mistral-ocr

nemosaltat a month ago

`uv run marker_single “path/to/file.pdf”` auto dumps markdown, images, and a json map of the PDF to a folder in the venv and gives you the path.

It’s like magic.

viking2917 a month ago

coincidentally, posted this over on Show HN today. OCR workbench, AI OCR & editing tools for OCRing old / hard documents. https://news.ycombinator.com/item?id=43976450. Tesseract works fine for modern text documents, but it fails badly on older docs (e.g. colonial american, etc)

dobraczekolada a month ago

Reminds me of github.com/docwire/docwire

devrandoom a month ago

I currently use ocrmypdf for my private library. Then Recoll to index and search. Is there a better solution I'm missing?

fracus a month ago

Why hasn't the PDF standard been replaced or revised to require the text in meta form? Seems like a no brainer.

msephton a month ago

It's already supported, but it's an optional feature, so it's up to the app/developer/author.

anonu a month ago

They should called it NDF - Non-Portable Document Format.

0xml a month ago

Maybe Printable Document Format.

Obscurity4340 a month ago

Is this what GoodPDF does?

Klaus_ a month ago

[dead]

reify a month ago

https://github.com/jalan/pdftotext

pdftotext -layout input.pdf output.txt

pip install pdftotext