Headshot photo of smiling man with very short brown hair, beard and mustacheLewin von Saldern, Chunkify
December 1, 2025

Since the launch of GPT-3.5 in November 2022, AI and large language models in particular, have experienced an unprecedented boom. Billions of dollars have been invested to develop AI applications across industries and functions. From chatbots to autonomous agents, the potential seems enormous.

But there is also a very down-to-earth aspect to all of this. To develop AI applications that truly add value, developers need high-quality data. The phrase “garbage in, garbage out” has become something like a constant reminder. What is often overlooked, is the simple fact that most knowledge in organizations is stored in unstructured formats. Some estimates suggest that around 80% of company knowledge lives in documents like PDFs, Word files, slide decks, and emails.

One of the most common methodologies to leverage unstructured data in AI is called Retrieval-Augmented Generation, or RAG. In a RAG system, a database (often a vector database) is connected to an LLM. The data lives in this database, while the LLM is mainly responsible for interpreting the user query, retrieving the most relevant pieces of content, and generating a coherent answer on top of them. To set up such a RAG system based on unstructured documents, like PDFs, the data must first be pre-processed. This step is referred to as parsing and chunking. The content is extracted from the original file and split into meaningful pieces. Each of these pieces (“chunks”) is embedded, meaning it is converted into a vector, a string of numbers, and stored in a vector database. When a user asks a question in the chatbot, the query itself is embedded in the same way. A similarity search then identifies the closest matching pieces of content to the query. The results are presented to the LLM, which uses them as context to generate an answer.

 Illustration 1: Set-up and operation of a RAG system (source: chunkify.io)

Today, many no-code tools claim to make this entire setup extremely easy. You can upload documents in any format you like or even connect your entire cloud drive. All your documents are then automatically parsed and chunked, allowing you to almost instantly interact with the content or spin up agents that take actions based on your provided unstructured data.

The problem: many of these tools fall short of expectations and do not create real business value. One of the problems is, that those tools are often simple, black box solutions, that don’t allow you to see what’s under the hood and how the RAG system is really set up. You will not be able to see how your content was parsed and chunked, or which parameters were used along the RAG pipeline. When setting up a RAG pipeline under real-world constraints, every step of the way can be engineered and re-engineered, as Illustration 2 shows, and often takes months for developers to get it right.

 

Illustration 2: Levers for increasing RAG accuracy (source: chunkify.io)

And while debates about re-ranking techniques, prompting strategies or setting the Top K correctly are fueling online discussions, the most underestimated aspect is the pre-processing of unstructured data: the “extraction & chunking” step. When content is extracted and split into pieces, it becomes crucial to understand the document’s structure so that information is not taken out of context.

Take a simple example: an airline wants to set up a customer-service chatbot that supports passengers between ticket purchase and boarding. A passenger wonders whether she can bring two suitcases on her flight. She tries to find the answer online, ends up at the airline’s chatbot, and asks: “Can I bring two suitcases?” The chatbot replies: “Yes, you can bring two suitcases.”

Unfortunately, that answer is wrong. Only business-class passengers may bring two suitcases. The correct rule was documented, but the information that this rule applied only to business class appeared in a headline that was not correctly identified during the parsing step. The system simply treated the headline as normal text or ignored it entirely, so the restriction was lost. The result: the passenger arrives at the airport with two suitcases and needs to pay an additional 100 euros. She will be unhappy, and the airline may lose her as a customer.

Illustration 3: What can happen when unstructured data is pushed into a RAG pipeline without proper processing (source: chunkify.io).

The key learning here is clear: when building RAG systems based on unstructured data, the minimum requirement is to properly structure this data during preprocessing. Only then does the system have a fair chance to interpret the content correctly. This requires time, effort, and high-quality tooling. And even then, parsing and chunking are only the first steps. You still need to configure retrieval, embeddings, context windows, fallbacks, guardrails, and much more, but any efforts done on tuning the RAG anywhere prior to cleaning and correctly structuring the data is time wasted.

And once applications become more complex, simple RAG might not be sufficient at all. Based on structured data, you may need to connect information across documents, build deterministic decision trees that control which content is used for which type of query, or put differently, develop more complex knowledge graphs that allow for reasoning and rule-based answers. Only with such structures can you launch AI applications that create real business value instead of unreliable prototypes.

Whether you rely on RAG or move toward a knowledge-graph approach, all reliable systems share one foundational requirement: they need high-quality structured data.

This is exactly why technical documentation is in such a privileged and interesting position at the moment. The field has decades of experience working with structured content and understands text-based information at a level of depth that many AI newcomers still need to acquire. Technical documentation teams know how to create consistent schemas, well-structured source documents, and clean information architectures. This expertise is not just helpful, it is becoming a central prerequisite for building powerful AI applications that go beyond demos and actually work in production.

The AI wave has made many things possible. But without structure, even the most advanced models reach their limits. In other words: structured content is more relevant than ever, and its value is now becoming visible far beyond the technical documentation community.