WXR to CSV Converter
The WXR to CSV Converter is a Python utility that transforms WordPress eXtended RSS (WXR) export files into CSV format. This tool simplifies data migration, analysis, and archival processes for WordPress content by converting XML-based exports into a universally accessible tabular format.
Purpose and Use Cases
Data Migration
WordPress WXR exports are XML-based and can be challenging to work with programmatically. Converting to CSV enables:
- Easier data import into new platforms
- Database migration preparation
- Content management system transitions
- Bulk data processing
Content Analysis
CSV format facilitates analysis through:
- Spreadsheet applications (Excel, Google Sheets)
- Data analysis tools (Python pandas, R)
- Business intelligence platforms
- Custom analysis scripts
Archival and Backup
CSV provides a durable archival format:
- Human-readable without special tools
- Platform-independent
- Easily versioned in source control
- Long-term data preservation
Content Audit
Organizations can use CSV exports to:
- Review content inventory
- Identify outdated posts
- Analyze authorship patterns
- Plan content strategy
Features
Comprehensive Data Extraction
The converter extracts all major WordPress content fields:
Post Metadata
- Post ID (WordPress internal identifier)
- Title
- Post type (post, page, custom post types)
- Publication status (publish, draft, private, pending)
- Post date and modification date
- Creator (author username)
- URL and slug
Content Fields
- Full post content (HTML preserved)
- Post excerpt
- Description
Taxonomy
- Categories (semicolon-separated)
- Tags (semicolon-separated)
Settings
- Comment status (open, closed)
- Ping status (open, closed)
- Sticky post flag
- Password protection
Custom Fields
- All custom field data
- Serialized as JSON for complex values
Hierarchical Data
- Parent post ID (for pages and hierarchical types)
- Menu order
HTML Content Handling
WordPress content often contains HTML markup. The converter:
- Preserves HTML formatting in content fields
- Maintains paragraph structure
- Retains links and formatting
- Handles special characters properly
Character Encoding
Proper handling of international content:
- UTF-8 encoding by default
- Preserves Unicode characters
- Handles special symbols
- International language support
Flexible Post Type Filtering
Control which content types to export:
- Export all post types
- Filter specific types (posts only, pages only)
- Include custom post types
- Exclude specific types
Command-Line Interface
Simple, scriptable command-line tool:
- Standard Unix command patterns
- Pipeline compatible
- Automation friendly
- Clear error messages
Zero External Dependencies
Built entirely on Python's standard library:
- No pip install required (beyond Python itself)
- Minimal installation footprint
- Reduces dependency conflicts
- Easy deployment
CSV Output Structure
Column Definitions
The generated CSV includes these columns:
Identification
post_id
WordPress internal post identifier. Unique numeric value used to track posts across the WordPress database. Useful for maintaining relationships during migration.
post_type
Content type classification:
post: Blog postspage: Static pagesattachment: Media files- Custom post types (defined by themes/plugins)
Content
title
The post or page title as it appears to users.
content
Full post content including HTML markup. May contain:
- Paragraphs and text formatting
- Images and media embeds
- Shortcodes
- HTML blocks
excerpt
Short summary or preview text. May be manually written or automatically generated by WordPress.
description
Additional descriptive content, typically from RSS feeds.
Publication
status
Current publication state:
publish: Publicly visibledraft: Work in progresspending: Awaiting reviewprivate: Visible only to authorized userstrash: Marked for deletion
post_date
Date and time when the content was first published. Format: YYYY-MM-DD HH:MM:SS
post_date_gmt
Publication date in GMT timezone.
post_modified
Date and time of last modification.
post_modified_gmt
Modification date in GMT timezone.
pub_date
RSS publication date (may differ from post_date).
Authorship
creator
WordPress username of the content creator. Useful for:
- Attribution
- Author-based filtering
- Workload analysis
URL and Routing
link
Full URL where the content is accessible.
post_name
URL slug (the last part of the permalink). Used for:
- SEO-friendly URLs
- Content identification
- URL structure
Taxonomy
categories
All assigned categories, semicolon-separated. Example:
Technology;Web Development;WordPress
tags
All assigned tags, semicolon-separated. Example:
PHP;MySQL;CMS;blogging
Settings
comment_status
Whether comments are allowed:
open: Comments enabledclosed: Comments disabled
ping_status
Whether pingbacks/trackbacks are allowed:
open: Pings enabledclosed: Pings disabled
is_sticky
Boolean flag indicating if post is pinned to the top of blog archives.
post_password
Password required to view content (if password-protected).
Hierarchy
post_parent
ID of parent post (for hierarchical content like pages). Zero indicates top-level content.
menu_order
Manual sort order for pages and custom post types.
Extended Data
custom_fields
All custom field data serialized as JSON. WordPress plugins and themes often store additional data here.
Usage
Basic Conversion
Convert an entire WordPress export:
python wxr_to_csv.py wordpress_export.xml
This creates wordpress_export.csv in the same directory.
Specify Output File
Control the output filename:
python wxr_to_csv.py export.xml -o output.csv
Filter Post Types
Export only specific content types:
# Only blog posts
python wxr_to_csv.py export.xml -t post
# Only pages
python wxr_to_csv.py export.xml -t page
# Posts and pages
python wxr_to_csv.py export.xml -t post page
# Include custom post type
python wxr_to_csv.py export.xml -t post page product
Autorun Script
The included autorun script provides a guided experience:
python autorun.py
This interactive script:
- Prompts for input file
- Suggests output filename
- Offers post type selection
- Shows progress
- Confirms completion
Python API
Programmatic Usage
Import and use in your own Python scripts:
from wxr_to_csv import WXRToCSVConverter
converter = WXRToCSVConverter()
converter.convert_to_csv(
input_file='export.xml',
output_file='output.csv',
post_types=['post', 'page']
)
Custom Processing
Extend the converter for custom needs:
class CustomConverter(WXRToCSVConverter):
def process_item(self, item):
# Custom processing logic
processed = super().process_item(item)
# Add custom fields
processed['custom_field'] = self.extract_custom_data(item)
return processed
Getting WordPress Export Files
Export Process
- Log into WordPress admin dashboard
- Navigate to Tools → Export
- Select content to export:
- All content: Everything
- Posts: Blog posts only
- Pages: Static pages only
- Media: Attachments
- Click Download Export File
- Save the
.xmlfile
Export Options
WordPress provides several export scopes:
All Content
Exports everything:
- Posts and pages
- Comments
- Custom fields
- Categories and tags
- Custom post types
- Navigation menus
Selective Export
Choose specific content:
- Date ranges
- Author filtering
- Status filtering
- Category filtering
Technical Details
XML Parsing
The converter uses Python's xml.etree.ElementTree to:
- Parse WXR XML structure
- Navigate WordPress-specific namespaces
- Extract nested data
- Handle malformed XML gracefully
CSV Generation
Uses Python's csv module with:
- Proper quote handling
- Unicode support
- Configurable delimiters
- Excel compatibility
Memory Management
For large WordPress sites:
- Streaming XML parsing
- Incremental CSV writing
- Memory-efficient processing
- Progress indicators
Error Handling
Robust error management:
- File not found errors
- XML parsing errors
- Encoding issues
- Write permission errors
- Missing required fields
Troubleshooting
Common Issues
Error Parsing WXR File
Symptoms
Error: Unable to parse XML file
Causes
- Corrupted export file
- Incomplete download
- Not a valid WXR file
- Unsupported WordPress version
Solutions
- Re-export from WordPress
- Verify file integrity
- Check file size matches export
- Try a different browser for download
No Posts Found
Symptoms
Warning: No posts found in export
Causes
- Wrong post types specified
- Empty WordPress site
- Export filtered to exclude content
Solutions
- Check post type names
- Try without
-tfilter - Verify WordPress site has content
- Re-export without filters
Encoding Issues
Symptoms
- Strange characters in output
- Corrupted international text
- Box symbols replacing characters
Causes
- Incorrect encoding detection
- Non-UTF-8 WordPress database
- Terminal encoding mismatch
Solutions
- Ensure WordPress database uses UTF-8
- Save CSV with UTF-8 encoding
- Check terminal locale settings
Large File Handling
For WordPress sites with thousands of posts:
Memory Issues
If the converter runs out of memory:
# Process in chunks
converter.set_chunk_size(1000)
converter.convert_to_csv('large_export.xml', 'output.csv')
Performance
Speed up processing:
- Export specific date ranges
- Filter by post type
- Split large exports
- Use SSD storage
Data Processing Examples
Excel Import
Open in Microsoft Excel:
- Open Excel
- Go to Data → From Text/CSV
- Select the CSV file
- Choose UTF-8 encoding
- Verify data preview
- Click Load
Google Sheets Import
- Open Google Sheets
- File → Import
- Upload CSV file
- Choose Replace current sheet or Insert new sheet
- Select Comma separator
- Click Import data
Python Pandas Analysis
import pandas as pd
# Read CSV
df = pd.read_csv('wordpress_export.csv')
# Posts per author
author_counts = df['creator'].value_counts()
# Posts per month
df['post_date'] = pd.to_datetime(df['post_date'])
monthly_posts = df.resample('M', on='post_date').size()
# Find posts without categories
uncategorized = df[df['categories'].isna()]
SQL Import
Import into database:
-- MySQL example
LOAD DATA INFILE 'wordpress_export.csv'
INTO TABLE posts
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
Advanced Features
Custom Column Mapping
Customize which fields to export:
converter = WXRToCSVConverter()
converter.set_column_mapping({
'post_id': 'ID',
'title': 'Title',
'content': 'Body',
'post_date': 'Date'
})
Field Transformation
Apply transformations during export:
def clean_html(html):
# Remove HTML tags
return re.sub(r'<[^>]+>', '', html)
converter.add_transform('content', clean_html)
Filtering
Exclude specific content:
def filter_drafts(item):
return item['status'] != 'draft'
converter.add_filter(filter_drafts)
Best Practices
Pre-Conversion
- Backup your WordPress site before exporting
- Clean up draft posts and spam if not needed
- Verify export completed successfully
- Check file size is reasonable
During Conversion
- Test with small exports first
- Verify post type names are correct
- Check terminal output for errors
- Monitor progress on large files
Post-Conversion
- Verify row count matches expectations
- Spot-check content accuracy
- Test opening in target application
- Keep original WXR file as backup
Security Considerations
Sensitive Data
WordPress exports may contain:
- User email addresses
- Private posts
- Password-protected content
- Custom fields with credentials
- Personal information
Recommendations
- Review exported data before sharing
- Remove sensitive columns if needed
- Encrypt CSV files for storage
- Control access to export files
Data Sanitization
Clean data before migration:
# Remove email addresses
converter.add_transform('creator', lambda x: x.split('@')[0])
# Strip private posts
converter.add_filter(lambda x: x['status'] != 'private')
Performance Optimization
Benchmarks
Typical performance:
- Small site (100 posts): < 1 second
- Medium site (1000 posts): 2-5 seconds
- Large site (10000 posts): 20-60 seconds
Optimization Tips
- Use SSD storage
- Filter unnecessary post types
- Process during off-peak hours
- Close other applications
Integration Examples
Automated Backups
#!/bin/bash
# Automated WordPress export and conversion
# Export WordPress
wp export --dir=/tmp/
# Convert to CSV
python wxr_to_csv.py /tmp/wordpress.xml -o /backups/$(date +%Y%m%d).csv
# Upload to cloud storage
aws s3 cp /backups/$(date +%Y%m%d).csv s3://my-bucket/backups/
CI/CD Pipeline
# GitHub Actions example
name: Export WordPress Content
on:
schedule:
- cron: '0 0 * * 0' # Weekly
jobs:
export:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
- name: Convert Export
run: python wxr_to_csv.py export.xml
- name: Upload Artifact
uses: actions/upload-artifact@v2
with:
name: wordpress-csv
path: export.csv
License
The WXR to CSV Converter is released under the MIT License, allowing free use, modification, and distribution.
Contributing
Contributions welcome:
- Bug fixes
- Feature enhancements
- Documentation improvements
- Test coverage
Support
- GitHub Issues: Bug reports and feature requests
- Documentation: This comprehensive guide
- Community: WordPress forums and developer communities