Plan to restore replication and resolve WordPress database inconsistencies related to 4/9 incident
Thursday, April 25th, 2013
On Tuesday, April 9th, an incident with WordPress services caused a temporary service interruption and reduced capacity. Service was restored with a secondary, replicated database, but further analysis of this incident has determined activities related to the service outage and restoration process introduced a data inconsistency between the primary and secondary databases. No data has been lost, but the databases are out of sync. This impacts current service in two ways:
- Content and settings that were added or edited on websites during this incident may be missing from your website or reverted back to its previous state.
- Database replication services cannot be restored until the inconsistencies are cleared. Until replication is restored, our production failover option is the most recent daily backup.
After extensive analysis, IS&T teams are working through a 3-step plan to resolve these issues and return service to normal operations as quickly as possible.
Step 1 – Preparation (completed)
On Friday morning (4/26) beginning at 2am, application developers will deploy changes to the WordPress code to facilitate the work described in steps 2 and 3 below. These changes will not impact service for WordPress websites. Successful deployment and testing of these changes will allow us to proceed to step 2.
UPDATE: Step 1 (Preparation) work was completed successfully 4/26/13 at 2:46am. Step 2 will proceed as planned on 4/28/13.
Step 2 – Restore Database Replication (completed)
On Wednesday morning (5/1) beginning shortly after midnight, all WordPress-based websites at BU will be put into “maintenance mode.” This will temporarily suspend functions that write to the database — primarily edits, comments, and forms. We expect all websites to continue to function in read-only mode during this time. Sites will be fully browsable, but editors will not be able to log in and edit content, visitors will be unable to leave comments, and most forms will be unavailable. We will automatically post in-place notifications where the comment and submission forms would normally appear, urging visitors to return later to complete the form. We expect to finish this work before 5am. Successful completion of step 2 will restore normal replication processes between the primary and secondary databases, and will allow us to proceed to step 3. Pages/posts that are scheduled to be published during this timeframe will be published after this work is completed.
UPDATE: Developers encountered a problem with Step 2 on 4/28, so this has been rescheduled for Wednesday morning, May 1st, starting shortly after midnight. All WordPress-based websites at BU will be put into “maintenance mode” and we expect this work to be finished by 5am.
UPDATE II: Maintenance mode was lifted approximately 3:30am on 5/1. Database replication was restored at 5:20am on 5/1.
Step 3 – Reconcile Data Inconsistencies
After replication is restored, we will run monitored scripts on a site-by-site basis to reconcile data inconsistencies that were introduced during the 4/9 incident and subsequent service restoration activities. These will be tested before the work begins, and we do not anticipate any service disruptions during this work.
UPDATE: During the week of May 20th, IS&T web team will begin to contact administrators for those sites where we have found data discrepancies. You will receive detailed information about the data discrepancies and will have the option to 1) manually check and repair any issues on your own; OR 2) allow IS&T to continue with the monitored scripts that will reconcile the data problems.
We’ve received a couple reports of missing pages or reverted edits that were made during this incident, and there are other inconsistencies that have gone unnoticed. But the duration of this incident is known, and our analysis shows the majority of the database inconsistencies are relatively benign. We will continue to update both the steps above and the list below as we work to resolve these issues.
What is impacted by the data inconsistencies?
The original incident occurred between approximately 7:02am and 12:17pm on 4/9. This was the service interruption and the switchover to the standby database. On 4/13, service was switched back to the primary database. This is when data inconsistencies began to be discovered and replication was unable to proceed.
- Web pages/posts that were edited during the original incident on 4/9 may have reverted back to the previous state. These edits will be restored during the work outlined in step 3.
- Web pages/posts that were added during the original incident and subsequently edited between 4/9 and 4/13 may have reverted back to the previous state. The edits will be restored.
- Web pages/posts that were created during the incident may now be missing. These will be restored.
- Web pages/posts that were moved from Published status to the Trash may have reverted back to Published status. These will be returned to Trash.
- Web pages/posts that were moved from Draft status to the Trash may have reverted back to Draft status. These will be returned to Trash.
- Site settings that were edited/changed may have reverted back to the previous state. The edited settings will be restored.
- Comments that were submitted during the incident may be missing. These will be restored.
- Gravity Form entries that were submitted during the incident may be missing. These will be restored. Note that email notification of form submissions continued to operate normally during the incident.
- Form view counts, form submission counts, and form conversion percentages may be slightly innacurate for the period of the 4/9 incident. The impact to these metrics is negligible, and these numbers will not be reconciled.
- Page revision history during the 4/9 incident may no longer be available.
- Form entries that were deleted by site admins or editors during the period of the 4/9 incident may be restored. This may require admins/editors to delete the form entry again after this work is completed.
The WordPress database is extremely active, supporting more than 800 websites and thousands of blogs. Not only does the database store site settings and page content, but it also stores temporary content that makes the overall application more efficient and helps to minimize page load time for visitors. IS&T teams have been meticulously assessing the inconsistencies on a site-by-site basis, filtering out issues with temporarily-stored data in order to identify and resolve those issues where public-facing content and site settings may have been impacted.
Reconciling data discrepancies
Actions
The data inconsistencies cover three scenarios. Site admins will receive a report listing the types of data discrepancies and the action needed to resolve the problems.
- Insertion – this data was added during the 4/9 incident and is currently missing from your site. If you opt in to our plan, IS&T will insert (restore) this data for you.
- Replace – this data was edited during the 4/9 incident, but the edits reverted back to their previous state. If you opt in to our plan, IS&T will replace this data with the most recent version of your edits.
- Delete – this data was deleted from your site during the 4/9 incident and has since reverted back to live content. If you opt in to our plan, IS&T will delete this data.
Post Types
The report received by site admins will also list the post types with data discrepancies. These are:
- Page – a standard web page
- Post – a standard post (usually a news or blog post)
- Profile – a faculty/staff profile for those websites that use the BU Profiles plugin
- Attachment – a media item (usually an image or document) that has been uploaded to your Media Library and is associated with a specific page or post. (Note: the media item is still on your site and links to it continue to work, but you may not be able to see it in your WP Media Library.)
Form Entries
Gravity Form Submissions – entries submitted via Gravity Forms on your website. Email notification of form entries was not impacted during this incident — see below.
Most departmental workflows use the email notifications sent out by WordPress when someone submits a form. Email notifications were not impacted by the 4/9 incident, so site admins have received these normally, but form entries submitted during the incident are not currently in the database. If your department’s workflow relies on the form entries database rather than the email notifications, please inform the web team of this when you reply to our message. We only plan to resolve form entry discrepancies for those sites that indicates their workflow relies on the form entries being present in the database.
Post Meta Data
The report will also list pages/posts with discrepancies in “post meta” data — information about, or related to, the page or post (things like content banner settings, navigation labels, etc.). Specific data discrepancies will not be listed on the report. These will be resolved if you opt in to the reconciliation plan.
Taxonomies
IS&T cannot resolve discrepancies with taxonomies. It is technically very complex to pinpoint any discrepancies with taxonomies (categories and tags associated with a page or post). Site admins are encouraged to check any pages/posts that were edited during this incident to ensure the correct categories/tags are being used.
If you have questions or have encountered a problem with your WordPress-based website that you believe is related to this incident, please report it and describe the issue in detail so IS&T teams can investigate.