HowTo: Keep MediaWiki from Modifying Your Extension Output

From Jimbojw.com

Jump to: navigation, search

Update: Since this article was originally published, I've found a much better way to keep extension output from being garbled.


If you've ever written a MediaWiki tag extension, you've likely run into this problem. Due to the Parser's behavior regarding extension tags, some parsing still occurs after the extension has run - potentially mangling otherwise perfect output. The obvious question is:

"How can I avoid modification of my extension's HTML output?"

This article will demonstrate one solution.

The Problem

Suppose you want to insert a <form> into a MediaWiki article, triggered by an extension tag. You might be tempted to try something like this:

$wgExtensionFunctions[] = "wfExampleExtension";
 
function wfExampleExtension() {
    global $wgParser;
    # register the extension with the WikiText parser
    $wgParser->setHook( "example", "renderExample" );
}
 
# The callback function for converting the input text to HTML output
function renderExample( $input, $argv, &$parser ) {
    $output = "Here is the form: \n";
    $output .= "<form action='/path/to/form/processor/doSomething.php' method='post'>\n";
    $output .= "Enter your name: <input type='text' name='yourName' />\n\n";
    $output .= "<input type='submit' value='Submit' />\n";
    $output .= "</form>\n";
    return $output;
}
Note: The above snippet was based on an example found in Extending wiki markup (meta.wikimedia.org)

In your article, you dutifully call your extension like so (expecting to see your <form> exactly as specified):

<example />

After saving the page and viewing the HTML source, you realize that although some of your markup made it to the rendered page, the form tags are missing! They have been stripped by the Parser because certain tags (including <form>) are not permissible in MediaWiki markup.

You would also find that the endline characters used ("\n" in the snippet) have been converted surreptitiously into <p> tags. This is because the Parser's whitespace processing doesn't occur until after the extension tag insertion step.

The Solution

The question of how to prevent this has been asked before. To which the answer has historically been:

This will probably require moving some code around in the parser. The current extension code assumes extensions will produce inline material and they are inserted before the final block-level rendering stages.

This is no longer the case. Without changing anything in the MediaWiki core code, it is possible to slip your extension's output past the Parser unscathed. To do this, we will hook OutputPageBeforeHTML - which executes after all parsing has completed, just before page display.

Functions that hook OutputPageBeforeHTML are passed two parameters:

  • $out - a reference to the OutputPage object processing this request.
  • $text - the fully rendered text of the page, ready for display.

Since our hook function will accept $text by reference, we can modify it at will. This is how we'll sneak our extension output past the parser.

Consider this enhanced version of the previous example:

$wgExtensionFunctions[] = "wfExampleExtension";
 
function wfExampleExtension() {
    global $wgParser;
    # register the extension with the WikiText parser
    $wgParser->setHook( "example", "renderExample" );
}
 
# The callback function for converting the input text to HTML output
function renderExample( $input, $argv, &$parser ) {
    $output = "Here is the form: \n";
    $output .= "<form action='/path/to/form/processor/doSomething.php' method='post'>\n";
    $output .= "Enter your name: <input type='text' name='yourName' />\n\n";
    $output .= "<input type='submit' value='Submit' />\n";
    $output .= "</form>\n";
 
    # Using base64 within an XML comment to sneak past the parser
    return '<!-- ENCODED_CONTENT '.base64_encode($output).' -->';
}
 
# Wrapping this for safety, just in case another extension beat us to it!
if (!function_exists('processEncodedOutput')) {
 
    # Happens just before page display, after all processing
    $wgHooks['OutputPageBeforeHTML'][] = 'processEncodedOutput';
 
    # Search and replace the XML comments with the hidden, encoded content
    function processEncodedOutput( &$out, &$text ) {
        $text = preg_replace(
            '/<!-- ENCODED_CONTENT ([0-9a-zA-Z\\+]+=*) -->/e',
            'base64_decode("$1")',
            $text
        );
        return true;
    }
 
}

Let's look at the important pieces individually. First, consider this line from renderExample():

return '<!-- ENCODED_CONTENT '.base64_encode($output).' -->';

Rather than returning the generated $output directly, we're hiding it (base64 encoded) inside an XML comment.

Using base64 ensures that only alphanumeric characters and '+' and '=' are used. This way, the Parser will largely ignore the output since it looks just like regular text: having no line breaks, HTML entities, or wiki markup characters.

Although XML comments are stripped from standard wiki text prior to page display, this occurs prior to the tag extension step when parsing. Therefore the comment flies under the Parser's radar without modification.

Next, consider the processEncodedOutput() function:

# Search and replace the XML comments with the hidden, encoded content
function processEncodedOutput( &$out, &$text ) {
    $text = preg_replace(
        '/<!-- ENCODED_CONTENT ([0-9a-zA-Z\\+]+=*) -->/e',
        'base64_decode("$1")',
        $text
    );
    return true;
}

Since we attached this to the 'OutputPageBeforeHTML' hook, processEncodedOutput() will execute just prior to page display. It scans through the rendered page looking for our XML comments. When found, these comments are replaced by the original, unencoded content.

Caveats and Gotchas

Please note that OutputPageBeforeHTML only executes on page view, not on "Show Preview". So you won't see the final output until after page save. However you will see the "ENCODED CONTENT" comments if you view the HTML source of the preview page.

Also, you may have trouble using this method if your MediaWiki install employs the Squid cache, as explained in Mediazilla's bug 7050.

The bug report explains:

As described in the 1.6 release notes, the OutputPageBeforeHTML hook was designed "to postprocess article HTML on page view (comes after parser cache, if used)." It therefore seemed ideal for the design of extensions that produce rapidly-changing output, since producing it at this late stage would not affect the page cache. However, according to Brion Vibber, this hook is not triggered when loading Squid-cached content...

It would therefore seem that any extension which hooks OutputPageBeforeHTML, as described in this article, runs the risk of putting additional burden on the server, since processEncodedOutput() is executed for every page and on every view.

We all know that "premature optimization is the root of all evil", but it's something to keep an eye out for, should you install an extension that uses this particular Parser bypass method.

Comments

Got something to say?

Leave a comment
Sorry, comments are disabled.

or, read what others have said ...