What is metadata and why does it matter to publishers?
“Metadata liberates us, liberates knowledge.” — David Weinberger, American author, technologist, and speaker.
Metadata is generally defined as “data about data.” In the context of journal publishing, metadata is the data about a journal article. Article metadata, which provides useful information about the article, can include items like journal title, article title, ISSN, keywords, abstract, author name(s), publication dates, content type and format, author ORCIDs, and more. Book metadata includes the data items that a reader can use to identify a book, such as the title, subtitle, price, date of publication, and ISBN (International Standard Book Number).
The National Information Standards Organization (NISO) defines three types of metadata:
- Descriptive metadata for “finding or understanding a resource”
- Administrative metadata for decoding, rendering, and long-term file management
- Structural metadata to describe “relationships of parts of resources to one another”
With over 4 million citable scientific documents published in 2020 alone, journal publishing is growing rapidly. Finding any given article among the millions being published each year is much like looking for a needle in a very, very large haystack. This is where metadata comes in. The purpose of metadata is to help categorize articles so that search engines are able to find them easily. This in turn improves the chances of an article reaching the intended audience.
The power of metadata in boosting article discoverability is well-acknowledged. But the need to scale and maintain metadata sustainably across the publishing supply chain is quickly being recognized as important. This was discussed during the NISO Plus 2022 session, Going with the publishing (work)flow: moving metadata from the point of peer review, where metadata was described as “a publisher’s best marketing tool for discoverability.”
But for metadata to fulfill its intended purpose, it needs to be machine-readable. Metadata is machine-readable when it is well-organized and easily “read,” recognized, and understood by a computer. Search engines and scholarly indexers rely on this metadata to get relevant articles into readers’ reach.
Challenges around metadata
During the NISO Plus 2022 session, The “nested triangle” of metadata supply for OA books, the roadblocks to metadata supply were discussed in the context of OA books. Some of these challenges, which are equally applicable to journal publishers, are:
- Missing or inaccurate metadata
- Problems with converting metadata between different formats
- Merging metadata from multiple sources, particularly author information
- Issues with metadata reuse restrictions
These challenges across the publishing supply chain weaken the positive impact that metadata can have on article searchability and discoverability. A sustainable solution to these challenges calls for a concerted effort from stakeholders across the publishing lifecycle—including authors, publishers, funders, and technology providers—to create the infrastructure that fosters rich, open metadata.
Rich metadata with JATS and BITS XML – What publishers can do
eXtensible Markup Language, or XML, is a file format and markup language that establishes a set of rules to structure documents in a format that is readable by both humans and machines. XML is now being used widely to encode journal articles with high quality machine-readable metadata.
The Journal Article Tag Suite, or JATS, is an ANSI-approved XML format developed by NISO to describe scientific literature published online. This format involves a set of XML elements and attributes that describe the content of journal articles, including text and images, as well as supplementary material like letters, editorials, and more.
The majority of journal publishers are revamping their production workflows with tools powered by the AI/ML techniques and the XML-first principle to convert manuscript submissions to JATS XML. This ensures that the articles being published are enriched with rich, machine-readable metadata that boosts article discoverability and ultimately, readership and reach.
While JATS is being widely adopted by journals, the XML document model for STM books tells a different story.
The Book Interchange Tag Suite (BITS), which is based on JATS, is “a named collection of XML elements and attributes for describing the structural and semantic content of books and book components, as well as a packaging element for interchange of book parts.”
The compatibility between BITS and JATS means that publishers that publish both books and journals can do so using the same system. Despite its utility, the adoption of XML technologies by book publishers is much lower. Considering the longer publishing cycles and the lower frequency of publication, the pace of change is, naturally, different in book publishing.
Managing metadata can be a time-consuming and intensive process for a publisher, particularly if the process is not augmented by some level of automation, but it is a worthwhile endeavor that provides long-term benefits for a publication and its content. The key to growth and success in this digitally-driven era is to leverage the power of metadata to surface, discover, curate, and collect scholarship.