Semistructured data has been the bane of my life, both as a person and as a CTO. Without hundreds of man-hours spent labelling, transforming, connecting, flattening and ingesting, it’s near-useless, but it’s everywhere. It’s almost all data.
What do I mean by semistructured? Anything that’s structured enough to want to be treated respectfully, searched reliably and ingested without mistakes, but isn’t structured enough to make any of those things easy.
This is almost all data I encounter. Large corporate excel sheets that run entire departments? Semistructured. PDFs with separate, important tables? Semistructured. Receipts from thousands of vendors? Semistructured.
JSON/CSV output from unwilling (legacy) products that don’t really want you to export their data? Semistructured.
50% of all engineering hours spent at Greywing were spent getting this data to a properly structured form - well typed, well named, normalized and foreign-key connected formats ingested into a modern database.
40% of all of my time spent coding for myself has been doing the same - to get my information back out of bank statements, health analytics, git logs, and so on.
Structured data is wonderful. Most of the problems of AI retrieval systems go away with well structured data. The last 20 years has had us get better and better at storing, indexing, retrieving and moving structured information around.
Here are just some of the benefits in the AI space: