Ask HN: How to Structure Gnarly PDFs

I'm trying to compile a time series of publicly listed stocks stretching back to 2005. I'm doing this by parsing the semi-annual reports (NCSR filings) from a mutual fund complex that includes a large index fund (VTI). The reports are html with very different formats over the years. They each render to 500 pdf pages.

I initially tried passing the full pdf to the famous parsing platforms, without much luck. I then manually located the holdings tables I'm interested in (50 of the 500 pages in each of the pdfs) and tried using the famous parsing platforms without much luck.

Any advice from the community?

1 comments

I might be missing something, but parsing the HTML, even with the different formats, should be much simpler than the PDF form.

In 20 years I would guess they used no more than 20 formats, which is doable even if writing XPath (perhaps CSS selectors would suffice) by hand.

Do you mean that the mutual fund complex includes many funds and you get as many different formats for a same time period?

Thanks for the response!

For sure I could write heuristics for parsing each format. I was kind of hoping that ML algorithms had advanced to the state where they could handle messy tables in documents. (By the way if they have, that could be big for the companies with good structuring models. Financial data is unbelievably expensive and a lot of it is publicly available but badly organized, so structuring companies could conceivably eat that those markets as just one application of their tools. Starting with cheap stuff for hobbyists/students who can't afford the commercial solutions).

The complex includes 20 or so funds, so each file includes a "hot spot" with data that I'd like to extract. Within a filing the holdings tables all look the same. The format of the document changes from year to year. Unfortunately the tables aren't really formatted as tables in the html, so I thought rendering to pdf and passing off to an LLM might be the best thing to do. I posted links to a few examples below.

https://www.sec.gov/Archives/edgar/data/36405/00011046592508...

https://www.sec.gov/Archives/edgar/data/36405/00009324710500...