Semistructured data has been the bane of my life, both as a person and as a CTO. Without hundreds of man-hours spent labelling, transforming, connecting, flattening and ingesting, it’s near-useless, but it’s everywhere. It’s almost all data.

What do I mean by semistructured? Anything that’s structured enough to want to be treated respectfully, searched reliably and ingested without mistakes, but isn’t structured enough to make any of those things easy.

This is almost all data I encounter. Large corporate excel sheets that run entire departments? Semistructured. PDFs with separate, important tables? Semistructured. Receipts from thousands of vendors? Semistructured.

JSON/CSV output from unwilling (legacy) products that don’t really want you to export their data? Semistructured.

50% of all engineering hours spent at Greywing were spent getting this data to a properly structured form - well typed, well named, normalized and foreign-key connected formats ingested into a modern database.

40% of all of my time spent coding for myself has been doing the same - to get my information back out of bank statements, health analytics, git logs, and so on.

Why?

Structured data is wonderful. Most of the problems of AI retrieval systems go away with well structured data. The last 20 years has had us get better and better at storing, indexing, retrieving and moving structured information around.

Here are just some of the benefits in the AI space:

It’s scalable: You can store terabytes or petabytes of information without a problem.
It’s fast: Structured search can easily outperform vector search, both in speed and complexity.
No more hallucinations: Structured transformations and queries move the AI from a middleman that needs to regurgitate information to a ‘manager’, eliminating hallucinations entirely.
Inspectability: Queries and transformations are inspectable, instead of vector spaces that work, or don’t.
Privacy: Compliance becomes easier when you can hide the underlying data entirely from the AI.