Laura Katajisto, Etteplan
July 1, 2025
As AI becomes integrated into technical communication processes, tools, and workflows, new questions arise about evaluating the content that writers co-create with AI. This article aims to provide some insight and options for thinking about how to measure the results of AI-assisted content creation.
From enterprise requirements to tech comms needs
Let’s begin by discussing the evaluation of AI for enterprise use. The use and “brand” of AI must be approved by parties such as upper management, IT and cybersecurity, and the legal department. When technical communicators start using AI, the AI itself should already have been evaluated against the enterprise requirements, and it should have met them.
Each organization has specific requirements that any tool, including AI, must comply with. These are mostly unique, but as an example, let’s look at a few criteria and their potential meanings in an enterprise context.
Figure 1: Examples of enterprise needs
| Criteria | What could this mean in an enterprise context? |
| Privacy & Security | Is the AI private and secure in your company IT infrastructure? |
| Legal | Does the AI and its developer comply with legal regulation on AI, privacy, and consent? |
| Risk & Safety | Are there any risks to company or customer data or users? Does the AI generate unsafe or malicious content? |
| Ethics | Does the AI and its developer comply with ethical guidelines? |
| Performance & Robustness | Can AI handle your prompts, your content, your locations (latency) and is it fast and stable enough? |
| Cost | What does using the AI cost? What is the pricing model and token cost? |
| Support & Roadmap | How well are you supported and is there a development roadmap you can follow? |
| Accuracy | Does the AI generate accurate and relevant responses? |
| Consistency | Does the AI generate the same response consistently? |
| Quality | Does the AI generate responses that are true? |
| Bias & Diversity | Is the AI data representative and unbiased, or does it produce results that promote harmful stereotypes? |
There are bigger and smaller questions, and naturally, some are more important to some evaluators than to others. Regardless of how an organization weighs its requirements, it should always conduct thorough AI evaluations first. Once due diligence has been carried out and an AI solution has been given to technical communicators, we get to the technical communication needs. These are more content- and subject matter -related, but of course this too depends on your case.
Figure 2: Examples of technical communication needs
As part of your needs analysis, you should also look at the content you plan to co-create with AI. Clear categories of content help you tailor your metrics better. As an example, I’m boldly categorizing my content into “fluffy” and “factual” content. Fluffy content (e.g., e-mails, marketing copy) is persuasive, emotionally resonant, and less technically detailed. Factual content (e.g., user guides, installation manuals) is structured, precise, and leaves little room for error or creativity. To get the best results, you should apply different evaluation criteria or differently weighted evaluation criteria for these two categories of content.
Figure 3: Fluffy and factual content
Metrics that matter
After you have identified your needs, you must design your future metrics. For example, if you need AI to summarize source material into a draft DITA task topic, you should assess its ability to accurately extract key points and organize them coherently into a valid DITA task topic. A good starting point is the way you currently evaluate the production of the human authors. You can also review AI regulations, standards, and quality guidelines to help you build your metrics.
Figure 4: Questions to start with
To create a comprehensive set of metrics, you should look at both “soft” and “hard” metrics. Soft metrics are subjective, non-numerical and qualitative – they are experiences of something, interpretations, opinions, and perceptions. Hard metrics are objective and quantitative – how long something takes and how many times something needs to be done. There is, of course, a grey area in between because soft metrics can also be counted. However, it will be the numbers that you will most likely be requested to provide instead of feelings.
A list of soft metrics and the related questions could look like this:
| Metric | Questions to ask |
| Accuracy and correctness | Is the generated text technically correct in its domain? Is it free of factual errors and errors of common knowledge? Does it follow the product specification (or other documentation you have specified as source/context)? |
| Clarity and usability | Does the generated text follow technical documentation best practices? Is it clear and readable by the target audience? Does it follow a logical structure? Is it free of linguistic errors? |
| Compliance with standards | Does the generated text comply with documentation standards (e.g., ISO/IEC 82079), safety disclaimers, and/or any other regulations that you must follow? |
| Comprehensiveness | Is the generated text complete? Does it cover all relevant topics or is something omitted? Does it include examples or other special content types where needed? |
| Consistency | Is the generated text uniform in language, style, and structure across the document (or other information set you are looking at)? Does the generated text use the same terminology consistently? |
| Hallucination | Does the generated text contain hallucinations? |
| Ideation | Does the generated text contain new ideas? Are the ideas usable? |
| Relevance | Is the generated text relevant for the target audience? Is it relevant for the purpose of the document (or other information set you are working with)? |
| Reliability | Is the same text generated in the same way every time? |
| Tone | Does the generated text use audience-appropriate language, tone, and depth of explanation? |
| Usefulness | Is the generated text created to be effective in solving user problems? |
| User experience | Is the generated text easy and effective to use? Is the correctness and comprehensiveness of the generated text good (enough) for the user? |
| User preference | Would the user rather use the AI-assisted version of the content than the previous version? |
| Writer experience | Is it easy and efficient to create the text? Is the work now more productive? Is the text of better quality? Do you feel better supported when working independently? Can you now focus better or be more creative where possible? Are you satisfied with your collaboration with the AI assistant? |
| XML (DITA) correctness | Is the generated text technically valid? Are the elements and attributes correctly used? |
A list of hard metrics and the related questions could look like this:
| Metric | Questions to ask |
| Cost | What impact does the use of the AI assistant have to the cost of creating documentation? |
| Efficiency gains | What is the time to completion (= faster designing, writing, illustrating)? How long it takes to resolve documentation issues? |
| Errors | How many errors did the generated text include? |
| Findability improvements | How long does it take to find information? |
| Productivity gain | How many updates to documentation there are? How many iterations were needed? |
| Quality improvements | How much has your quality grade changed? |
| Reduction of support costs | How effective was the AI-assisted content in solving user problems? How many support requests did you receive related to unclear instructions? |
| User satisfaction | What are the NPS score or user ratings? |
Whereas the answer to many of the soft metrics could simply be just yes or no, for the hard metrics you’re looking at numbers going up or down. These could be savings in time or money, count of errors, count of updates or iterations, or new vs. old quality grades or NPS scores. The key thing here is that you must have established a meaningful baseline before you start measuring. If you don’t know what your current quality score is, how much time you spend on a topic (or a manual), what your current readability score is, and how these have evolved, how can you show the change?
You should also note that there is a scale inside the soft metrics too. Consider the metric of “writer experience” and compare it to “hallucinations” – some soft metrics are more important than others, as some may be showstoppers and some just make you feel bad.
In conclusion
Ultimately, the successful measuring of the content that comes out of the AI-assisted workflow can be summarized in three points:
- You cannot measure properly without having initial baseline metrics. If you don’t have any figures showing where you are without AI, it’s like you’re trying to build your house from the second floor without having built the ground floor.
- Define your use case and objectives clearly in the beginning. The AI tools are powerful, but technical communication has complex use cases and turning a need into an AI function can take time and a lot of discussion & analysis. This definition impacts your metrics because you will then know what do really need to track and measure.
- Select metrics that give you real, actionable data for evaluation and improvements. Leave out all metrics that are just “pseudomeasures” which give you a pretty number for management presentations but don’t really mean anything.
There are several different metrics that you can use for measuring the results of AI-assisted content creation. From all the possible options, you need to select the ones that are suitable for your case – remember that each case is unique, and we’re still far away from a standardized set of metrics and criteria.