Skip to main content

WXR to CSV Converter

The WXR to CSV Converter is a Python utility that transforms WordPress eXtended RSS (WXR) export files into CSV format. This tool simplifies data migration, analysis, and archival processes for WordPress content by converting XML-based exports into a universally accessible tabular format.

Purpose and Use Cases

Data Migration

WordPress WXR exports are XML-based and can be challenging to work with programmatically. Converting to CSV enables:

  • Easier data import into new platforms
  • Database migration preparation
  • Content management system transitions
  • Bulk data processing

Content Analysis

CSV format facilitates analysis through:

  • Spreadsheet applications (Excel, Google Sheets)
  • Data analysis tools (Python pandas, R)
  • Business intelligence platforms
  • Custom analysis scripts

Archival and Backup

CSV provides a durable archival format:

  • Human-readable without special tools
  • Platform-independent
  • Easily versioned in source control
  • Long-term data preservation

Content Audit

Organizations can use CSV exports to:

  • Review content inventory
  • Identify outdated posts
  • Analyze authorship patterns
  • Plan content strategy

Features

Comprehensive Data Extraction

The converter extracts all major WordPress content fields:

Post Metadata

  • Post ID (WordPress internal identifier)
  • Title
  • Post type (post, page, custom post types)
  • Publication status (publish, draft, private, pending)
  • Post date and modification date
  • Creator (author username)
  • URL and slug

Content Fields

  • Full post content (HTML preserved)
  • Post excerpt
  • Description

Taxonomy

  • Categories (semicolon-separated)
  • Tags (semicolon-separated)

Settings

  • Comment status (open, closed)
  • Ping status (open, closed)
  • Sticky post flag
  • Password protection

Custom Fields

  • All custom field data
  • Serialized as JSON for complex values

Hierarchical Data

  • Parent post ID (for pages and hierarchical types)
  • Menu order

HTML Content Handling

WordPress content often contains HTML markup. The converter:

  • Preserves HTML formatting in content fields
  • Maintains paragraph structure
  • Retains links and formatting
  • Handles special characters properly

Character Encoding

Proper handling of international content:

  • UTF-8 encoding by default
  • Preserves Unicode characters
  • Handles special symbols
  • International language support

Flexible Post Type Filtering

Control which content types to export:

  • Export all post types
  • Filter specific types (posts only, pages only)
  • Include custom post types
  • Exclude specific types

Command-Line Interface

Simple, scriptable command-line tool:

  • Standard Unix command patterns
  • Pipeline compatible
  • Automation friendly
  • Clear error messages

Zero External Dependencies

Built entirely on Python's standard library:

  • No pip install required (beyond Python itself)
  • Minimal installation footprint
  • Reduces dependency conflicts
  • Easy deployment

CSV Output Structure

Column Definitions

The generated CSV includes these columns:

Identification

post_id

WordPress internal post identifier. Unique numeric value used to track posts across the WordPress database. Useful for maintaining relationships during migration.

post_type

Content type classification:

  • post: Blog posts
  • page: Static pages
  • attachment: Media files
  • Custom post types (defined by themes/plugins)

Content

title

The post or page title as it appears to users.

content

Full post content including HTML markup. May contain:

  • Paragraphs and text formatting
  • Images and media embeds
  • Shortcodes
  • HTML blocks

excerpt

Short summary or preview text. May be manually written or automatically generated by WordPress.

description

Additional descriptive content, typically from RSS feeds.

Publication

status

Current publication state:

  • publish: Publicly visible
  • draft: Work in progress
  • pending: Awaiting review
  • private: Visible only to authorized users
  • trash: Marked for deletion

post_date

Date and time when the content was first published. Format: YYYY-MM-DD HH:MM:SS

post_date_gmt

Publication date in GMT timezone.

post_modified

Date and time of last modification.

post_modified_gmt

Modification date in GMT timezone.

pub_date

RSS publication date (may differ from post_date).

Authorship

creator

WordPress username of the content creator. Useful for:

  • Attribution
  • Author-based filtering
  • Workload analysis

URL and Routing

link

Full URL where the content is accessible.

post_name

URL slug (the last part of the permalink). Used for:

  • SEO-friendly URLs
  • Content identification
  • URL structure

Taxonomy

categories

All assigned categories, semicolon-separated. Example:

Technology;Web Development;WordPress

tags

All assigned tags, semicolon-separated. Example:

PHP;MySQL;CMS;blogging

Settings

comment_status

Whether comments are allowed:

  • open: Comments enabled
  • closed: Comments disabled

ping_status

Whether pingbacks/trackbacks are allowed:

  • open: Pings enabled
  • closed: Pings disabled

is_sticky

Boolean flag indicating if post is pinned to the top of blog archives.

post_password

Password required to view content (if password-protected).

Hierarchy

post_parent

ID of parent post (for hierarchical content like pages). Zero indicates top-level content.

menu_order

Manual sort order for pages and custom post types.

Extended Data

custom_fields

All custom field data serialized as JSON. WordPress plugins and themes often store additional data here.

Usage

Basic Conversion

Convert an entire WordPress export:

python wxr_to_csv.py wordpress_export.xml

This creates wordpress_export.csv in the same directory.

Specify Output File

Control the output filename:

python wxr_to_csv.py export.xml -o output.csv

Filter Post Types

Export only specific content types:

# Only blog posts
python wxr_to_csv.py export.xml -t post

# Only pages
python wxr_to_csv.py export.xml -t page

# Posts and pages
python wxr_to_csv.py export.xml -t post page

# Include custom post type
python wxr_to_csv.py export.xml -t post page product

Autorun Script

The included autorun script provides a guided experience:

python autorun.py

This interactive script:

  • Prompts for input file
  • Suggests output filename
  • Offers post type selection
  • Shows progress
  • Confirms completion

Python API

Programmatic Usage

Import and use in your own Python scripts:

from wxr_to_csv import WXRToCSVConverter

converter = WXRToCSVConverter()
converter.convert_to_csv(
input_file='export.xml',
output_file='output.csv',
post_types=['post', 'page']
)

Custom Processing

Extend the converter for custom needs:

class CustomConverter(WXRToCSVConverter):
def process_item(self, item):
# Custom processing logic
processed = super().process_item(item)
# Add custom fields
processed['custom_field'] = self.extract_custom_data(item)
return processed

Getting WordPress Export Files

Export Process

  1. Log into WordPress admin dashboard
  2. Navigate to Tools → Export
  3. Select content to export:
    • All content: Everything
    • Posts: Blog posts only
    • Pages: Static pages only
    • Media: Attachments
  4. Click Download Export File
  5. Save the .xml file

Export Options

WordPress provides several export scopes:

All Content

Exports everything:

  • Posts and pages
  • Comments
  • Custom fields
  • Categories and tags
  • Custom post types
  • Navigation menus

Selective Export

Choose specific content:

  • Date ranges
  • Author filtering
  • Status filtering
  • Category filtering

Technical Details

XML Parsing

The converter uses Python's xml.etree.ElementTree to:

  • Parse WXR XML structure
  • Navigate WordPress-specific namespaces
  • Extract nested data
  • Handle malformed XML gracefully

CSV Generation

Uses Python's csv module with:

  • Proper quote handling
  • Unicode support
  • Configurable delimiters
  • Excel compatibility

Memory Management

For large WordPress sites:

  • Streaming XML parsing
  • Incremental CSV writing
  • Memory-efficient processing
  • Progress indicators

Error Handling

Robust error management:

  • File not found errors
  • XML parsing errors
  • Encoding issues
  • Write permission errors
  • Missing required fields

Troubleshooting

Common Issues

Error Parsing WXR File

Symptoms

Error: Unable to parse XML file

Causes

  • Corrupted export file
  • Incomplete download
  • Not a valid WXR file
  • Unsupported WordPress version

Solutions

  • Re-export from WordPress
  • Verify file integrity
  • Check file size matches export
  • Try a different browser for download

No Posts Found

Symptoms

Warning: No posts found in export

Causes

  • Wrong post types specified
  • Empty WordPress site
  • Export filtered to exclude content

Solutions

  • Check post type names
  • Try without -t filter
  • Verify WordPress site has content
  • Re-export without filters

Encoding Issues

Symptoms

  • Strange characters in output
  • Corrupted international text
  • Box symbols replacing characters

Causes

  • Incorrect encoding detection
  • Non-UTF-8 WordPress database
  • Terminal encoding mismatch

Solutions

  • Ensure WordPress database uses UTF-8
  • Save CSV with UTF-8 encoding
  • Check terminal locale settings

Large File Handling

For WordPress sites with thousands of posts:

Memory Issues

If the converter runs out of memory:

# Process in chunks
converter.set_chunk_size(1000)
converter.convert_to_csv('large_export.xml', 'output.csv')

Performance

Speed up processing:

  • Export specific date ranges
  • Filter by post type
  • Split large exports
  • Use SSD storage

Data Processing Examples

Excel Import

Open in Microsoft Excel:

  1. Open Excel
  2. Go to Data → From Text/CSV
  3. Select the CSV file
  4. Choose UTF-8 encoding
  5. Verify data preview
  6. Click Load

Google Sheets Import

  1. Open Google Sheets
  2. File → Import
  3. Upload CSV file
  4. Choose Replace current sheet or Insert new sheet
  5. Select Comma separator
  6. Click Import data

Python Pandas Analysis

import pandas as pd

# Read CSV
df = pd.read_csv('wordpress_export.csv')

# Posts per author
author_counts = df['creator'].value_counts()

# Posts per month
df['post_date'] = pd.to_datetime(df['post_date'])
monthly_posts = df.resample('M', on='post_date').size()

# Find posts without categories
uncategorized = df[df['categories'].isna()]

SQL Import

Import into database:

-- MySQL example
LOAD DATA INFILE 'wordpress_export.csv'
INTO TABLE posts
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

Advanced Features

Custom Column Mapping

Customize which fields to export:

converter = WXRToCSVConverter()
converter.set_column_mapping({
'post_id': 'ID',
'title': 'Title',
'content': 'Body',
'post_date': 'Date'
})

Field Transformation

Apply transformations during export:

def clean_html(html):
# Remove HTML tags
return re.sub(r'<[^>]+>', '', html)

converter.add_transform('content', clean_html)

Filtering

Exclude specific content:

def filter_drafts(item):
return item['status'] != 'draft'

converter.add_filter(filter_drafts)

Best Practices

Pre-Conversion

  • Backup your WordPress site before exporting
  • Clean up draft posts and spam if not needed
  • Verify export completed successfully
  • Check file size is reasonable

During Conversion

  • Test with small exports first
  • Verify post type names are correct
  • Check terminal output for errors
  • Monitor progress on large files

Post-Conversion

  • Verify row count matches expectations
  • Spot-check content accuracy
  • Test opening in target application
  • Keep original WXR file as backup

Security Considerations

Sensitive Data

WordPress exports may contain:

  • User email addresses
  • Private posts
  • Password-protected content
  • Custom fields with credentials
  • Personal information

Recommendations

  • Review exported data before sharing
  • Remove sensitive columns if needed
  • Encrypt CSV files for storage
  • Control access to export files

Data Sanitization

Clean data before migration:

# Remove email addresses
converter.add_transform('creator', lambda x: x.split('@')[0])

# Strip private posts
converter.add_filter(lambda x: x['status'] != 'private')

Performance Optimization

Benchmarks

Typical performance:

  • Small site (100 posts): < 1 second
  • Medium site (1000 posts): 2-5 seconds
  • Large site (10000 posts): 20-60 seconds

Optimization Tips

  • Use SSD storage
  • Filter unnecessary post types
  • Process during off-peak hours
  • Close other applications

Integration Examples

Automated Backups

#!/bin/bash
# Automated WordPress export and conversion

# Export WordPress
wp export --dir=/tmp/

# Convert to CSV
python wxr_to_csv.py /tmp/wordpress.xml -o /backups/$(date +%Y%m%d).csv

# Upload to cloud storage
aws s3 cp /backups/$(date +%Y%m%d).csv s3://my-bucket/backups/

CI/CD Pipeline

# GitHub Actions example
name: Export WordPress Content
on:
schedule:
- cron: '0 0 * * 0' # Weekly

jobs:
export:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
- name: Convert Export
run: python wxr_to_csv.py export.xml
- name: Upload Artifact
uses: actions/upload-artifact@v2
with:
name: wordpress-csv
path: export.csv

License

The WXR to CSV Converter is released under the MIT License, allowing free use, modification, and distribution.

Contributing

Contributions welcome:

  • Bug fixes
  • Feature enhancements
  • Documentation improvements
  • Test coverage

Support

  • GitHub Issues: Bug reports and feature requests
  • Documentation: This comprehensive guide
  • Community: WordPress forums and developer communities