╔════════════════════════════════════════════════════════════════════╗
║ ISSUE: #02                                        DATE: 2026-01-06 ║
╟────────────────────────────────────────────────────────────────────╢
║                                                                    ║
║ ▀ █▄ █ ▀█▀ █▀▀ █▀█ █▀ █▀█ ▄▀█ █▀▀ █▀▀                              ║
║ █ █ ▀█  █  ██▄ █▀▄ ▄█ █▀▀ █▀█ █▄▄ ██▄                              ║
║                                                                    ║
║ AUTHOR: Irons, Sam                    PUBLISHER: Interspace Studio ║
║ TYPE: Newsletter                      LANGUAGE: en-US              ║
║ SUBJECTS: LLM Generation · Metadata Quality · SEO Best Practices   ║
║                                                                    ║
║ DESCRIPTION: Comprehensive guide exploring how content             ║
║ professionals can leverage LLMs for metadata generation, covering  ║
║ SEO best practices, Dublin Core mapping, and research-validated    ║
║ quality metrics including completeness, accuracy, and conformance  ║
║ to expectations.                                                   ║
╟────────────────────────────────────────────────────────────────────╢
║  Interspace can make mistakes. Consider checking important info.   ║
╚════════════════════════════════════════════════════════════════════╝

╔════════════════════════════════════════════════════════════════════╗
║ I N T R O D U C T I O N                                            ║
╚════════════════════════════════════════════════════════════════════╝

!! PLEASE FORWARD THIS TO WHOEVER YOU THINK MAY BE INTERESTED !!

Interspace is a content newletter written by Sam Irons, founder
of Interspace Studio in Sydney, Australia. Interspace covers content 
strategy, UX writing, technical writing, and content practices.

Interspace is also a community. You've probably received this from a
co-worker (if I didn't send it to you directly). Communities of 
practice are essential to keeping disciplines resilient, values-
driven, and creative. If something I've written sparks a discussion, 
then we're tending to and growing that community. Welcome.

Happy new year! In this issue, I explore metadata. Can we use
LLMs (large language models) to generate useful metadata?

This newsletter covers three topics:

1. Search engine optimization basics and prompts
2. A strategy for modern site builders to transform Dublin Core
   metadata into JSON-LD and HTML markup
3. How to measure metadata quality and spot poor metadata using
   LLMs

This newsletter is for nerds. Word nerds, tech nerds, AI nerds.
It's packed with ideas, techniques, prompts, and experiments.
I've included metadata transformation code for Next.js and Astro.
If you work with structured content and site builders, this issue
is for you.

Subscribe to future issues, or view back issues:

  << http://interspacestudio.com.au/newsletter >>

Thank you for reading,
Lots of love,

Sam Irons
irons.sam@interspacestudio.com.au

!! PLEASE FORWARD THIS TO WHOEVER YOU THINK MAY BE INTERESTED !!

╔════════════════════════════════════════════════════════════════════╗
║ C O N T E N T S                                                    ║
╚════════════════════════════════════════════════════════════════════╝

1. Starting slow with SEO
2. Metadata in a vibecoded world
3. Measuring metadata quality
4. Discussion
5. This month's reading
6. Thanks!

╔════════════════════════════════════════════════════════════════════╗
║ S T A R T I N G   S L O W   W I T H   S E O                   [01] ║
╚════════════════════════════════════════════════════════════════════╝

Most content professionals working on websites or applications
probably get their first dose of metadata when they start
thinking about search engine optimization (SEO). Its always been
a dark art and constantly changing with Google firmly in control.

Google and other search engines crawl pages and compare metadata
against the content of the page to validate it before ranking.
Throughout the constantly changing landscape of The Search
Algorithm, well-written and well-structured content - made for
human consumption - continues to top the search engine results
page.

LLMs, trained on human written content, also seem to prefer well
structured and well written content when searching and drawing
from the web.

Naturally, one of the first problem spaces that excited content
professionals when ChatGPT began making waves was the generation
of metadata.

Does LLM-generated metadata stack up to the expert curation of
content professionals and domain experts?

Researchers from Syracuse University and the University of
Washington ran a test with 26 educators, students, and other
education professionals.

They were each given 15 lesson plans and their associated
metadata blocks, completed with Dublin Core and its educational
extension elements.

Half were shown the lesson plan first, then the metadata; the
other half, vice versa.

You can read their paper in full here:

  << https://dl.acm.org/doi/epdf/10.1145/564376.564464 >>

They found that participants' satisfaction scores (whether the
metadata matched the content or not) varied only minimally
between human-generated and LLM-generated samples.

LLMs are fantastic at summarizing. Given a page, any model off
the shelf can write you a good summary.

New models, like GPT5 and Claude 4.5 are even more sophisticated.
Their natural language processing already considers industry best
practices when they "reason".

Ask an LLM to write page titles and meta description for content,
and it will return (most of the time) content that fits within
known character limits using active calls to action and benefits
statements directed at specific audiences.

Still, here's a few tips and tricks to add to the countless pages
of advice already online:

* Give the LLM the role of an SEO expert.
* Tell it to follow industry best practices.
* Ask it to check its work.
* Use few-shot prompting: give examples of good and bad metadata.
* Ask it to optimize output for your target audience's questions.

The first three are pretty standard best practice for prompting
these days, mostly from observations and anectodal quality
assessments. The last two come from research-minded capitalism.

Fidelity Investments, Bangalore, published a paper describing
their approach to building a "smart data catalog" to improve
generating metadata for data tables.

You can read their paper in full here:

  << https://arxiv.org/abs/2503.09003 >>

They found:

* Fine-tuning the dataset helps significantly. This means
  cleaning data and weighting it. For example, they eliminated
  audit columns. They weighted highly-ranked columns based on
  user popularity.

* Few-shot prompting helps. Providing examples of good and bad
  metadata improved quality.

The authors also found that business glossaries and style guides
help LLMs.

Additionally, the authors discussed that making business
glossaries and style guides available to LLMs helped the models.

Put it all together and here's a little prompt for you to
generate instances of SEO page titles and meta descriptions:

\\ INSTRUCTIONS
\\ You are an SEO expert. Generate a page title and meta
\\ description for the page content below.
\\
\\ First, analyze the content. Identify the target audience.
\\ Find the most important information for that audience.
\\
\\ Then, generate a page title and meta description. Follow all
\\ industry best practices.
\\
\\ RULES
\\ - Keep titles under 60 characters. Aim for 46 or less.
\\ - Write meta descriptions between 150-160 characters.
\\ - Use active voice in meta descriptions.
\\ - Include a call to action in meta descriptions.
\\
\\ PAGE TITLE EXAMPLES
\\ Poor: Our content management software
\\ Better (promises a benefit): Save 10 hours a week with our
\\ content management system
\\
\\ Poor: Guide to SEO/LMO
\\ Better (injects news): The 2025 Ultimate Guide to SEO and
\\ LMO
\\
\\ Poor: Writing Tips
\\ Better (includes numbers): 13 Easy Hacks for Better Business
\\ Writing
\\
\\ META DESCRIPTION EXAMPLES
\\ Poor (just a list of keywords): Sewing supplies, yarn,
\\ colored pencils, sewing machines, threads, bobbins, needles
\\ Better (specific and detailed): Get everything you need to
\\ sew your next garment. Open Monday-Friday 8-5pm, located in
\\ the Fashion District.
\\
\\ Poor (generic): Local news in Whoville, delivered to your
\\ doorstep. Find out what happened today.
\\ Better (specific and detailed): Upsetting the small town of
\\ Whoville, a local elderly man steals everyone's presents the
\\ night before an important event. Stay tuned for live updates
\\ on the matter
\\
\\ Poor (too short): Mechanical pencil
\\ Better (specific and detailed): Self-sharpening mechanical
\\ pencil that autocorrects your penmanship. Includes 2B
\\ auto-replenishing lead. Available in both Vintage Pink and
\\ Schoolbus Yellow. Order 50+ pencils, get free shipping.
\\
\\ PAGE CONTENT
\\ {content}
\\
\\ GLOSSARY
\\ {glossary}
\\
\\ STYLE GUIDE
\\ {style guide}

These examples come from Google's documentation and my
experience. Tailor yours to what converts and what causes
problems in your content.

This prompting can automate your workflow. Write the content.
Focus on the page. Run this during publishing.

Better yet, create an LLM agent. Set it to run automatically. Use
your glossary and style guide as knowledge sources.

╔════════════════════════════════════════════════════════════════════╗
║ M E T A D A T A   I N   A  V I B E C O D E D   W O R L D      [02] ║
╚════════════════════════════════════════════════════════════════════╝

I've noticed a shift in the last year or so, with the rise of
vibecoding. Most of these tools recommend and build sites using
Next.js, Astro, Gatsby, or similar static site generators (SSG).

These systems use markdown to store content. Simple frontmatter
describes the markdown content. When the site builds, it
transforms frontmatter into structured JSON-LD and HTML markup.

If you haven't defined a metadata strategy and you use a modern
SSG, I recommend two approaches:

* Use Dublin Core standards for internal cataloging. Dublin Core
  is a metadata standard designed for "cross-disciplinary
  resource discovery."

* Map Dublin Core elements to Schema.org and OpenGraph when you
  build the site.

Dublin Core works well in markdown frontmatter. It stays
human-readable. Schema.org and OpenGraph markup help with SEO and
social media sharing.

Here's the basic frontmatter and mappings. They suit any textual
content. You can extend them later:

Dublin            Schema.org           OpenGraph
------            ----------           ---------
title             title                og:title
subject           keywords             -
description       description          og:description
type              type                 og:type
coverage          temporalCoverage     -
creator           author               article:author
publisher         publisher            -
contributor       contributor          -
date              datePublished        article:published_time
identifier        url                  og:url
language          inLanguage           og:locale
image             -                    og:image
image-alt         -                    og:image:alt

Your developers can transform these into proper JSON-LD and HTML
markup. As a New Year's gift, I asked Claude to generate
templates for Next.js and Astro. You can find them here:

  << https://github.com/ironssamuel/ssg-metadata-templates >>

With this in mind, we can improve our prompt to generate
instances of structured metadata. Instead of freeform text, we
generate structured metadata:

\\ INSTRUCTIONS
\\ You are an SEO expert. Generate metadata for the page content
\\ below.
\\
\\ First, analyze the content. Identify the target audience. Find
\\ the most important information for that audience.
\\
\\ Then, generate metadata as valid frontmatter. Follow this
\\ format:
\\
\\ ```
\\ title: "The name of the resource. 60 characters or less."
\\ subject: "Keywords or phrases describing the content."
\\ description: "Description of the content. 150-160 characters.
\\ Use active voice. Include a call to action."
\\ type: Article/TechArticle/HowTo
\\ coverage: "The spatial or temporal characteristics of the
\\ content."
\\ creator: // The person or organization who created the
\\ content.
\\ publisher: // The entity that made the resource available,
\\ such as a publishing house, university, or company.
\\ contributor: // A person or organization who made significant
\\ contributions but secondary to the creator (editor,
\\ transcriber, illustrator).
\\ date: YYYY-MM-DD - Creation or availability date.
\\ identifier: // String or number that uniquely identifies the
\\ resource. Examples: URLs, URNs, ISBN.
\\ language: "Language code of the content."
\\ image: // public/images/path/to/open-graph/image
\\ image-alt: // Alternative description of Open Graph image
\\ ```
\\
\\ OUTPUT RULES
\\ - Return only the metadata frontmatter in plain text.
\\ - Complete all fields.
\\ - Wrap title, subject, coverage, creator, publisher,
\\   contributor, and image-alt in quotation marks.
\\ - Format date as YYYY-MM-DD with no quotation marks.
\\ - Format identifier as a JavaScript comment. Example:
\\   "identifier: // path-to-file/name"
\\ - Format image as a JavaScript comment. Example: "image: //
\\   public/docs/og-image.png"
\\ - Format image-alt as a JavaScript comment. Example:
\\   "image-alt: // 'Description of Open Graph image.'"
\\
\\ PAGE CONTENT
\\ {content}
\\
\\ GLOSSARY
\\ {glossary}
\\
\\ STYLE GUIDE
\\ {style guide}

I've commented out some fields that should be scrutinized
closely, like the identifier. If you can complete these or any
metadata fields deterministically, do it.

For example, the date field can be completed via the last updated
date of the file or deploy time. The creator field can be
completed by the user data of the person saving or commiting the
file. The identifier can be completed using the file name or
path. Etc.

Again, this can be automated into your publishing workflow, which
can provide serious effeciency gains at scale.

Can you trust an LLM to generate useful metadata at scale? To
answer that, we need to dive a little deeper into what makes for
high-quality metadata.

╔════════════════════════════════════════════════════════════════════╗
║ M E A S U R I N G   M E T A D A T A   Q U A L I T Y           [03] ║
╚════════════════════════════════════════════════════════════════════╝

Information scientists have studied auto-tagging and metadata
generation for years. Library sciences lead this work.

In 2004, Bruce and Hillman created a framework for Cornell Law
School. They suggested these metrics for evaluating metadata
quality:

* Completeness - Metadata describes content as fully as possible.

* Accuracy - Metadata is as correct as possible.

* Conformance to expectations - Metadata fulfills user
  requirements for tasks like finding, identifying, and selecting
  resources.

* Logical consistency and coherence - Metadata follows domain
  standards for language and structure.

* Accessibility - Metadata is retrievable and understandable. I
  think "findability" describes this better.

* Timeliness - Metadata is current.

* Provenance - The source of metadata is known and credible.

Read their full paper:

    << https://ecommons.cornell.edu/server/api/core/bitstreams/
    2b3e14fd-82a9-49ce-a8c4-9fd096010a08/content >>

These are great starting points. You can imagine giving users
metadata samples from a repository. Then ask them to rate the
metadata on these dimensions.

In 2009, Ochoa (NYU) and Duval (KU Leuven) went further. They
sought to measure metadata quality programmatically. They created
metrics for evaluating each of Bruce and Hillman's
characteristics.

Read their full paper:

  << https://www.researchgate.net/publication/220387581_
  Automatic_evaluation_of_metadata_quality_in_digital_
  libraries >>

Here's a brief summary:

* Completeness: A basic measure counts completed fields. A
  better measure weights important fields more heavily.

* Accuracy: A basic measure checks if fields contain correct
  information (numbers in number fields) and data quality (no
  broken links). A better measure counts words shared between
  metadata and the resource.

* Conformance to expectation: This measures how unique the
  metadata is compared to others in the set.

* Consistency: This checks if metadata follows standards in
  structure (like Dublin Core) and language.

* Coherence: This checks if all metadata fields describe the
  resource similarly.

* Findability/accessibility: A basic measure looks at explicit
  links (like "relates to" or "is a version of"). A better
  measure looks at implicit links by traversing a data graph.

* Timeliness: A basic measure checks currency (last updated
  date). A better measure compares currency to average quality
  over time.

* Provenance: This measures perceived trust.

The researchers created metrics to measure all these aspects
programmatically. They used proxies to estimate some. But here's
the finding: most metrics don't matter much for overall quality
compared to how humans evaluate metadata.

They ran three studies to validate their findings. In the first
study, they tested if their metrics matched human ratings. 22
researchers evaluated 20 metadata instances (10 manual, 10
auto-generated). They graded metadata on a 7-point scale for each
parameter.

"In general, the quality metrics do not correlate with their
expected quality parameters as human[sic.] rate them."

But one metric stood out. It influenced all quality measures AND
matched how humans rated other aspects. "If all the parameters
are averaged, the final result could be mostly estimated (80%) by
the Qtinfo metric in combination with the origin of the
metadata."

Qtinfo is a conformance to expectation metric. It measures how
well metadata fulfills user requirements for finding,
identifying, selecting, and obtaining a resource. The researchers
suggest that usefulness depends on unique information in the
metadata. Users differentiate resources more easily when metadata
instances aren't similar. They defined how to measure this
programmatically.

It would take another full newsletter to explain the equations.
In broad strokes, the researchers measure the importance of a
word in a document as a proportion of its frequency in the
document, and inversely proportionate to how frequently documents
in the corpus contain that word. Or, plainer, if a unique word
appears frequently in one document but rarely in others, then it
is more useful in finding, indentifying, and selecting the
content.

For content professionals, this means one thing: You can't
evaluate metadata alone. You must evaluate it against the
complete content set. To get LLMs to create meaningful metadata,
they need to evaluate all metadata instances in the set. This
generates unique metadata that differentiates resources rather
than just summarizing documents. Agentic design makes this easy.
You can explicitly provide knowledge sources to the LLM.

Ochoa and Duval's second study measured manual versus
auto-generated metadata using their metrics. Unsurprisingly: "In
general, the metrics found that manual metadata set has higher
quality than the automatic metadata set."

This happened before the LLM revolution. They used SAmgI, a
simple text analysis algorithm. One major difference:
completeness. Human experts filled more fields than the bot
could. LLMs have changed this significantly. They can help close
the gap. Even then, SAmgI generated more accurate metadata than
humans. It used text directly from the resource. Humans use
synonyms. LLMs do too.

What is interesting to content professionals is how of this
scales. The researchers ran a third study looking to use
programmatic quality metrics as a "filter" to identify poor
quality metadata entries. A few metrics were found to be
extremely useful in this regard:

* Completeness
* Weighted completeness
* Conformance to expectation

In designing evaluation systems, these metrics should be used
when defining scorecards for measuring metadata across a set.
Three deterministic metrics that can be programmatically
calculated feels lightweight to me. Its certainly faster than
regular sampling and surveying against human panels.

╔════════════════════════════════════════════════════════════════════╗
║ D I S C U S S I O N                                           [04] ║
╚════════════════════════════════════════════════════════════════════╝

This investigation taught me a lot:

* LLMs generate metadata instances well when summarizing a page.
  They can generate structured metadata that follows industry
  standards like Dublin Core.

* Don't create metadata in a vacuum. Compare instances across the
  set. Generate unique, distinct metadata entries instead of just
  summarizing pages individually. Provide LLMs with glossaries
  and style guides for even better results.

* Markdown wins again. This makes me smile. Simple frontmatter
  transforms easily into structured markup during site
  generation. We can keep the author experience clean and
  templated.

See examples in Astro and Next.js:

  << https://github.com/ironssamuel/ssg-metadata-templates >>

* Complete metadata fields automatically when you can.

* You can quantify and monitor metadata quality. Use quantitative
  frameworks to evaluate metadata across a set. Identify poorly
  formed entries.

I give LLMs a solid 5/6 on these approaches.

Enterprises gain massive efficiency by using LLM-generated
metadata. They can spot poorly formed metadata with quality
checks and scripts.

Freelancers and contractors speed up workflows too. Use templated
approaches to static site generation.

Most importantly, better metadata helps users. It helps them
find, identify, and select the correct record. Whether they
search or browse. Isn't that what it's all about?

╔════════════════════════════════════════════════════════════════════╗
║ T H I S   M O N T H ' S   R E A D I N G                       [05] ║
╚════════════════════════════════════════════════════════════════════╝

Here's your monthly oracle reading from the Design Oracle!

YOUR JANUARY PRACTICE

This month invites you to build with structure and surrender. Map
the path forward with conviction. Break complexity into
manageable steps. But hold your plans lightly. The territory
reveals itself as you walk it. The best course adapts to
discoveries along the way. Planning serves your work. It
shouldn't constrain it.

Progress comes through repetition with purpose. Each iteration
brings wisdom you can only learn through doing. Question each
version with care: Are you making meaningful improvements or just
making changes? Let scrutiny guide your refinement. Trust the
cyclical nature of growth. Excellence emerges through layers.
Each builds on what came before.

The cards suggest January is a month of disciplined flexibility.
Create the structure you need to move forward. Then refine
through cycles of making and questioning. The path to excellence
is rarely linear. Trust the spiral.

Get your own oracle deck. Drive personal insights. Get motivated
with design rituals. The Design Oracle is free in the public
domain:

  << http://design-oracle.github.com/ >>

╔════════════════════════════════════════════════════════════════════╗
║ T H A N K S !                                                      ║
╚════════════════════════════════════════════════════════════════════╝

If you've read this far, thank you so much! I know these
newsletters are long, but hopefully they've given you something
to think about and discuss with your community of practice.

Subscribe to new issues or read back issues:

  << http://interspacestudio.com.au/newsletter >>

Check out my services and rates there too. I help businesses
succeed with their content. I consult, contract, coach, and
speak. Reach out to see what I can do for your business.

  << irons.sam@interspacestudio.com.au >>

Until next time!

!! PLEASE FORWARD THIS TO WHOEVER YOU THINK MAY BE INTERESTED !!