DCL logo: cube on the left and words DCL Data Conversion LaboratoryData Conversion Laboratory Inc
June 15, 2023

Content Structure: The Building Blocks of Innovation

Foundations for Success

All organizations have deep, growing archives of content. From historical documents and photographs to new research articles and product documentation, this content comes in many different forms, yet it can all be mined to create valuable resources for today’s digital-first consumers.

However, content can only unlock new opportunities for organizations if it has a foundation of rich structure and metadata. Simply having information in a digital format, such as a Word file or an image-based PDF, is not enough. For content to be easily discovered and used via modern platforms, it must be converted into a multidimensional XML format from which machines can extract pertinent information.

Thankfully, artificial intelligence technologies, including natural language processing and machine learning, have lowered previous barriers to constructing rich XML. Content-transformation projects require less time and fewer human resources than ever. In partnership with a technology solutions provider, content-producing organizations can undertake once unfeasible projects – in turn, making their wealth of resources more accessible and monetizable through digital product offerings.

From Flat to Multidimensional

Before embarking on a data conversion and content structure project, organizations must take stock of the information they have buried within their files and determine what is worth enriching and making digitally available. Likely, information is stored in a variety of formats, including hard copies (e.g., papers and microfilms) and digital copies (e.g., PDFs and word processing files).

Oftentimes, organizations have digitized information, but that information is locked in image-based PDFs and not structured to facilitate search on an updated platform. To unlock new dimensions of discoverability, the information within those images needs to be identified and tagged in a way that is conducive to search and utility (not to mention making it accessible for the visually impaired).

The Power of Intelligent Content

With rich data structure comes a world of new content-powered possibilities. For modern publishers and information providers, structured data is an essential building block for innovative new business models. In order to harness the power of intelligent content, organizations must go beyond simple data conversions. Not all XML is created equal, and extracted metadata needs to be high-quality for it to support continued business evolution.

With proper due diligence and tagging, organizations set themselves up for success not only in their current endeavors, but also for future endeavors yet to be identified. No matter what new technologies bring, however, one thing is certain­—if your content isn’t tagged properly, people won’t find your information. Making information more findable is critical for the success of organizations moving forward.

The DCL Difference

DCL differs from many service providers because it uses a combination of onshore staff plus technology to create well-structured data for its customers. The company’s deep knowledge of markup languages enables it to use leading technologies, such as NLP and ML, to generate accurate markup in an automated, cost-effective way.

DCL converts and transforms content from any format into structured content. The figure below details some of the inputs and outputs we’ve worked with over the years.

 

Over the years, DCL has developed a number of software tools to support its content structure and content conversion services:

Harmonizer

Harmonizer is a software application that analyzes document collections using natural language processing (NLP) to identify redundant and nearly redundant content in the collection. Harmonizer is a powerful tool that organizations use when planning a content reuse strategy or moving to a new platform.

Creating leaner content collections improves the consistency of an organization’s brand and communication while ensuring all business units operate with unified source content. Harmonizer reports allow users to navigate massive content collections to eliminate errors and outdated terminology.

Our customers find that Harmonizer reveals unintended differences in text phrasing, spelling, and punctuation—often finding errors that have been deeply embedded in documentation for many years.

Benefits

  • Capture reuse potential and metrics for ROI calculation
  • Harmonize content (reduce “near duplicate” content) to provide consistent information throughout a document set
  • Create leaner content collections to streamline management, translation, and localization
  • Increase efficiency in updating information in the future

 

Data Harvester

DCL Data Harvester provides automated website scraping configured to your business needs with customized XML feeds back to your organization. Organizations need to harvest and structure data and content posted and maintained on public websites. Websites are often the version of record for policy, procedure, legal, and regulatory content. Many businesses benefit from daily robotic scans of updated website content with structured XML feeds back into internal systems.

Sites are global and multi-lingual and contain information in multiple formats, such as HTML, PDF, XML, RTF, and DOCX. This necessitates a deeper solution where data is downloaded, normalized, structured, and converted into a common XML format with defined metadata, and related content is linked. It is critical that website crawling efforts do not look like attacks on the system, which would trigger DDoS alarms (Distributed Denial of Service).

​DCL has developed methods and bots to facilitate high-volume data retrieval from hundreds of websites, in a variety of source formats (HTML, RTF, DOCX, TXT, XML, etc.), in both European and Asian languages. We produce a unified data stream that is converted to XML for ingestion into derivative databases, data analytics platforms, and other downstream systems.

Benefits

  • Daily robotic scans of websites important to your business
  • Harvest new and modified content from a variety of sources: PDF, HTML, XML, RTF, Word
  • Analyze, cleanse, and harmonize data
  • Provide cross-reference linking
  • Convert to XML schema for delivery

 

DCL Reformer

Many organizations have extensive content buried in image-based PDFs (and even paper!) that cannot be digitized using standard OCR tools due to complex tables, charts, figures, foreign characters, chemical formulae, etc. DCL Reformer is an automated solution that transforms static content into structured formats, improving the content’s utility for downstream systems.

DCL uses computer vision techniques to detect and remove poor OCR-quality content, retaining text for high-accuracy OCR processing and conversion to unstructured text. Complex algorithms, NLP engines, and other techniques are then applied to analyze the unstructured text from documents with wide variations in format and quality, and accurately structure the data.

Benefits

  • Create content structure where it did not exist previously
  • Transform legacy documents into digital assets
  • Generate structured content to deliver to other systems in your organization
  • Experiment and innovate new product creation with digitized and structured content

 

Production Control System

An important part of all DCL solutions is the DCL Production Control System (PCS). PCS provides workflow management and scheduling/reporting capabilities with comprehensive monitoring mechanisms that track timeliness and quality levels, including generating alerts when requirements are at risk of not being met.

Benefits

  • Registers each document as it is received
  • Timestamps and tracks location and status of each document
  • Routes documents for exception processing and tracks successful completion
  • Provides client and internal reporting capabilities, both automated and user initiated; Reports are available 24/7 over a secure network connection
  • Delivers automated emails and notifications
  • Provides early warning and scheduling for pickups and delivery of documents
  • Reconciles metrics at every step in the process to ensure that all reporting is accurate
  • Load-balances tasks through the various processors and appropriate workflow steps
  • Maintains extensive workflow, quality, and performance statistics
  • Manages compliance with agreed turnaround times

 

Contact Us to Learn More

With expertise across many industries, DCL uses advanced technology and US-based project management teams to solve complex conversion challenges and is a recognized leader in XML, DITA, SPL, and S1000D conversions. If you have complex data and content challenges, we can help.

For more information visit dataconversionlaboratory.com.