Raw HTML Output from a Parser Extension

From Jimbojw.com

Jump to: navigation, search

In a previous article, I described how to get unadulterated output from a MediaWiki tag extension using the OutputPageBeforeHTML hook. It turns out there are much better ways to achieve the same effect.

This article describes one such technique which I call 'hide-and-replace' where extension output is hidden in plain sight only to be revealed later in the parsing process.

Note: If you're the kind or person who just wants to skip right to the code, be my guest! Check out the RawHTML extension for a real example of the techniques outlined in the remainder of this article. DO NOT install RawHTML on a publicly editable wiki (see the RawHTML Disclaimer for more info).

Introducing the Example extension

The MediaWiki Parser is a complex and powerful beast. It runs through many stages of parsing to produce the final HTML version of wikitext. One of those passes includes rendering custom extension tags.

Consider this example extension code:

$wgExtensionFunctions[] = "wfExampleExtension";
 
function wfExampleExtension() {
    global $wgParser;
    # register the extension with the WikiText parser
    $wgParser->setHook( "example", "renderExample" );
}
 
# The callback function for converting the input text to HTML output
function renderExample( $input, $argv, &$parser ) {
 
    # Determine the desired output
    $output = "Here is some text, isn't that great!";
 
    return $output;
}

A call to this extension in a wiki article would look simply like this:

<example />

As expected, the rendered page would contain:

<p>Here is some text, isn't that great!
</p>

Problem overview

The problem (which isn't illustrated by the example above) arises when the output is more complex. Due to the order of operations which occurs during a parse, some things happen later than others. In the case of tag extension rendering, two troublesome steps follow:

  • Whitespace and list processing
  • Tidy

Whitespace and list processing refers to a group of tasks ranging from turning back-to-back endlines ("\n") into paragraph breaks (</p><p>) all the way to inserting list notation (UL, OL and LI tags) where leading asterisks and pound-signs are found.

Tidy is a utility which takes possibly mangled HTML input and modifies it to be XHTML compliant.

Whitespace and list processing

The whitespace is troublesome because it affects embedded JavaScript. Suppose our example extension contained this:

# The callback function for converting the input text to HTML output
function renderExample( $input, $argv, &$parser ) {
 
    # Create a script block
    $output =
        "<script type='text/javascript'>\n".
        "  alert('hi there');\n".
        "</script>";
 
    return $output;
}

The expected output would be this:

<script type='text/javascript'>
  alert('hi there');
</script>";

In reality we get something akin to this:

<script type='text/javascript'>
<pre>  alert('hi there');
</pre>
</script>";

This is because leading whitespace in wiki articles is interpreted as preformatted text, and this is enforced after extension tag processing has occurred.

As you can guess, those <pre> tags will cause JavaScript execution errors.

Tidy

Tidy is also a problem for embedded SCRIPT tags in extension output since it likes to convert HTML/XML reserved characters into entities but doesn't respect CDATA declarations.

Consider the following:

# The callback function for converting the input text to HTML output
function renderExample( $input, $argv, &$parser ) {
 
    # Create a script block
    $output =
        "<script type='text/javascript'>/*<![CDATA[*/ \n".
        "if (a && b) alert('both a and b are true!');\n".
        "/*]]>*/</script>";
 
    return $output;
}

We would expect to see this output:

<script type='text/javascript'>/*<![CDATA[*/
if (a && b) alert('both a and b are true!');
/*]]>*/</script>

What we really get is more like this:

<script type='text/javascript'>/*<![CDATA[*/
if (a &amp;&amp; b) alert('both a and b are true!');
/*]]>*/</script>

Notice that the '&' symbols have been replaced by '&amp;'. Clearly this will not execute properly and JavaScript errors ensue.

Workaround

The workaround is actually rather simple - all you have to do is hide the data in plain sight, then unmask it later!

A better explanation might be this: have your extension output text that "looks like" regular text that the rest of the parser will ignore, then you hook into the parser elsewhere and reveal the hidden data.

There are two hooks that are useful for this: ParserBeforeTidy and ParserAfterTidy.

As you can imagine, both of these hooks occur within the Parser class, one executing before Tidy cleanup and the other afterwards.

Here is the original example extension rewritten to use this hide-and-replace technique:

$wgExtensionFunctions[] = "wfExampleExtension";
 
function wfExampleExtension() {
    global $wgParser;
    # register the extension with the WikiText parser
    $wgParser->setHook( "example", "renderExample" );
}
 
# The callback function for converting the input text to HTML output
function renderExample( $input, $argv, &$parser ) {
 
    # Determine the desired output
    $output = "Here is some text, isn't that great!";
 
    # Hiding content from parser (to be decoded later)
    return '<!-- ENCODED_CONTENT '.base64_encode($output).' -->';
}
 
# Process the encoded output
if (!function_exists('processEncodedOutput')) {
    $wgHooks['ParserAfterTidy'][] = 'processEncodedOutput';
    function processEncodedOutput( &$out, &$text ) {
 
        # Find all hidden content and restore to normal
        $text = preg_replace(
            '/<!-- ENCODED_CONTENT ([0-9a-zA-Z\\+]+=*) -->/esm',
            'base64_decode("$1")',
            $text
        );
 
        return true;
    }
}

The output of this new extension is exactly as described for the original example extension. When you put "<example />" in a page, the rendered page contains "Here is some text, isn't that great!".

The difference is that for the other examples (those with more complicated output), the whitespace/list and Tidy processing have been bypassed and therefore do not mangle the output. Problem solved!

Caveats and gotchas

The downside of hooking into ParserAfterTidy for this purpose is that it becomes possible for the output of your wiki page to be non-XHTML compliant since the output of an extension may not be well behaved.

This can be avoided by using ParserBeforeTidy, but then the problem with HTML entity conversion re-arises. It's a tradeoff that you have to weigh when considering which hook to use.

RawHTML Disclaimer

BE WARNED! The RawHTML extension is only meant as a proof-of-concept extension and should not be installed lightly. It is dangerous since it introduces an XSS vulnerability on publicly editable wikis (in this context, both guest-editable and voluntary-registration wikis are considered "publicly editable").

Only consider installing it if you have a trusted, limited editorship. Use at your own risk!