WordPress Importers: Defining a Schema

from above of ethnic scientist exploring details of aircraft using magnifier

While schemata are usually implemented using language-specific tools (eg, XML uses XML Schema, JSON uses JSON Schema), they largely use the same concepts when talking about data. This is rather helpful, we don’t need to make a decision on data formats before we can start thinking about how the data should be arranged.

Note: Since these concepts apply equally to all data formats, I’m using “WXR” in this post as shorthand for “the structured data section of whichever file format we ultimately use”, rather than specifically referring to the existing WXR format. 🙂

Why is a Schema Important?

It’s fair to ask why, if the WordPress Importers have survived this entire time without a formal schema, why would we need one now?

There are two major reasons why we haven’t needed one in the past:

  • WXR has remained largely unchanged in the last 10 years: there have been small additions or tweaks, but nothing significant. There’s been no need to keep track of changes.
  • WXR is currently very simple, with just a handful of basic elements. In a recent experiment, I was able to implement a JavaScript-based WXR generator in just a few days, entirely by referencing the Core implementation.

These reasons are also why it would help to implement a schema for the future:

  • As work on WXR proceeds, there will likely need to be substantial changes to what data is included: adding new fields, modifying existing fields, and removing redundant fields. Tracking these changes helps ensure any WXR implementations can stay in sync.
  • These changes will result in a more complex schema: relying on the source to re-implement it will become increasingly difficult and error-prone. Following Gutenberg’s lead, it’s likely that we’d want to provide official libraries in both PHP and JavaScript: keeping them in sync is best done from a source schema, rather than having one implementation copy the other.

Taking the time to plan out a schema now gives us a solid base to work from, and it allows for future changes to happen in a reliable fashion.

WXR for all of WordPress

With a well defined schema, we can start to expand what data will be included in a WXR file.

Media

Interestingly, many of the challenges around media files are less to do with WXR, and more to do with importer capabilities. The biggest headache is retrieving the actual files, which the importer currently handles by trying to retrieve the file from the remote server, as defined in the wp:attachment_url node. In context, this behaviour is understandable: 10+ years ago, personal internet connections were too slow to be moving media around, it was better to have the servers talk to each other. It’s a useful mechanism that we should keep as a fallback, but the more reliable solution is to include the media file with the export.

Plugins and Themes

There are two parts to plugins and themes: the code, and the content. Modern WordPress sites require plugins to function, and most are customised to suit their particular theme.

For exporting the code, I wonder if a tiered solution could be applied:

  • Anything from WordPress.org would just need their slug, since they can be re-downloaded during import. Particularly as WordPress continues to move towards an auto-updated future, modified versions of plugins and themes are explicitly not supported.
  • Third party plugins and themes would be given a filter to use, where they can provide a download URL that can be included in the export file.
  • Third party plugins/themes that don’t provide a download URL would either need to be skipped, or zipped up and included in the export file.

For exporting the content, WXR already includes custom post types, but doesn’t include custom settings, or custom tables. The former should be included automatically, and the latter would likely be handled by an appropriate action for the plugin to hook into.

Settings

There are a currently handful of special settings that are exported, but (as I just noted, particularly with plugins and themes being exported) this would likely need to be expanded to included most items in wp_options.

Users

Currently, the bare minimum information about users who’ve authored a post is included in the export. This would need to be expanded to include more user information, as well as users who aren’t post authors.

WXR for parts of WordPress

The modern use case for importers isn’t just to handle a full site, but to handle keeping sites in sync. For example, most news organisations will have a staging site (or even several layers of staging!) which is synchronised to production.

While it’s well outside the scope of this project to directly handle every one of these use cases, we should be able to provide the framework for organisations to build reliable platforms on. Exports should be repeatable, objects in the export should have unique identifiers, and the importer should be able to handle any subset of WXR.

WXR Beyond WordPress

Up until this point, we’ve really been talking about WordPress→WordPress migrations, but I think WXR is a useful format beyond that. Instead of just containing direct exports of the data from particular plugins, we could also allow it to contain “types” of data. This turns WXR into an intermediary language, exports can be created from any source, and imported into WordPress.

Let’s consider an example. Say we create a tool that can export a Shopify, Wix, or GoDaddy site to WXR, how would we represent an online store in the WXR file? We don’t want to export in the format that any particular plugin would use, since a WordPress Core tool shouldn’t be advantaging one plugin over others.

Instead, it would be better if we could format the data in a platform-agnostic way, which plugins could then implement support for. As luck would have it, Schema.org provides exactly the kind of data structure we could use here. It’s been actively maintained for nearly nine years, it supports a wide variety of data types, and is intentionally platform-agnostic.

Gazing into my crystal ball for a moment, I can certainly imagine a future where plugins could implement and declare support for importing certain data types. When handling such an import (assuming one of those plugins wasn’t already installed), the WordPress Importer could offer them as options during the import process. This kind of seamless integration allows WordPress to show that it offers the same kind of fully-featured site building experience that modern CMS services do.

Of course, reality is never quite as simple as crystal balls and magic wands make them out to be. We have to contend with services that provide incomplete or fragmented exports, and there are even services that deliberately don’t provide exports at all. In the next post, I’ll be writing about why we should address this problem, and how we might be able to go about it.

This post is part of a series, talking about the WordPress Importers, their history, where they are now, and where they could go in the future.