Update: Since this article was originally published, I've found a much better way to keep extension output from being garbled.
If you've ever written a MediaWiki tag extension, you've likely run into this problem. Due to the Parser's behavior regarding extension tags, some parsing still occurs after the extension has run - potentially mangling otherwise perfect output. The obvious question is:
This article will demonstrate one solution.
Suppose you want to insert a <form> into a MediaWiki article, triggered by an extension tag. You might be tempted to try something like this:
$wgExtensionFunctions[] = "wfExampleExtension"; function wfExampleExtension() { global $wgParser; # register the extension with the WikiText parser $wgParser->setHook( "example", "renderExample" ); } # The callback function for converting the input text to HTML output function renderExample( $input, $argv, &$parser ) { $output = "Here is the form: \n"; $output .= "<form action='/path/to/form/processor/doSomething.php' method='post'>\n"; $output .= "Enter your name: <input type='text' name='yourName' />\n\n"; $output .= "<input type='submit' value='Submit' />\n"; $output .= "</form>\n"; return $output; }
In your article, you dutifully call your extension like so (expecting to see your <form> exactly as specified):
<example />
After saving the page and viewing the HTML source, you realize that although some of your markup made it to the rendered page, the form tags are missing! They have been stripped by the Parser because certain tags (including <form>) are not permissible in MediaWiki markup.
You would also find that the endline characters used ("\n" in the snippet) have been converted surreptitiously into <p> tags. This is because the Parser's whitespace processing doesn't occur until after the extension tag insertion step.
The question of how to prevent this has been asked before. To which the answer has historically been:
This is no longer the case. Without changing anything in the MediaWiki core code, it is possible to slip your extension's output past the Parser unscathed. To do this, we will hook OutputPageBeforeHTML - which executes after all parsing has completed, just before page display.
Functions that hook OutputPageBeforeHTML are passed two parameters:
Since our hook function will accept $text by reference, we can modify it at will. This is how we'll sneak our extension output past the parser.
Consider this enhanced version of the previous example:
$wgExtensionFunctions[] = "wfExampleExtension"; function wfExampleExtension() { global $wgParser; # register the extension with the WikiText parser $wgParser->setHook( "example", "renderExample" ); } # The callback function for converting the input text to HTML output function renderExample( $input, $argv, &$parser ) { $output = "Here is the form: \n"; $output .= "<form action='/path/to/form/processor/doSomething.php' method='post'>\n"; $output .= "Enter your name: <input type='text' name='yourName' />\n\n"; $output .= "<input type='submit' value='Submit' />\n"; $output .= "</form>\n"; # Using base64 within an XML comment to sneak past the parser return '<!-- ENCODED_CONTENT '.base64_encode($output).' -->'; } # Wrapping this for safety, just in case another extension beat us to it! if (!function_exists('processEncodedOutput')) { # Happens just before page display, after all processing $wgHooks['OutputPageBeforeHTML'][] = 'processEncodedOutput'; # Search and replace the XML comments with the hidden, encoded content function processEncodedOutput( &$out, &$text ) { $text = preg_replace( '/<!-- ENCODED_CONTENT ([0-9a-zA-Z\\+]+=*) -->/e', 'base64_decode("$1")', $text ); return true; } }
Let's look at the important pieces individually. First, consider this line from renderExample():
return '<!-- ENCODED_CONTENT '.base64_encode($output).' -->';
Rather than returning the generated $output directly, we're hiding it (base64 encoded) inside an XML comment.
Using base64 ensures that only alphanumeric characters and '+' and '=' are used. This way, the Parser will largely ignore the output since it looks just like regular text: having no line breaks, HTML entities, or wiki markup characters.
Although XML comments are stripped from standard wiki text prior to page display, this occurs prior to the tag extension step when parsing. Therefore the comment flies under the Parser's radar without modification.
Next, consider the processEncodedOutput() function:
# Search and replace the XML comments with the hidden, encoded content function processEncodedOutput( &$out, &$text ) { $text = preg_replace( '/<!-- ENCODED_CONTENT ([0-9a-zA-Z\\+]+=*) -->/e', 'base64_decode("$1")', $text ); return true; }
Since we attached this to the 'OutputPageBeforeHTML' hook, processEncodedOutput() will execute just prior to page display. It scans through the rendered page looking for our XML comments. When found, these comments are replaced by the original, unencoded content.
Please note that OutputPageBeforeHTML only executes on page view, not on "Show Preview". So you won't see the final output until after page save. However you will see the "ENCODED CONTENT" comments if you view the HTML source of the preview page.
Also, you may have trouble using this method if your MediaWiki install employs the Squid cache, as explained in Mediazilla's bug 7050.
The bug report explains:
It would therefore seem that any extension which hooks OutputPageBeforeHTML, as described in this article, runs the risk of putting additional burden on the server, since processEncodedOutput() is executed for every page and on every view.
We all know that "premature optimization is the root of all evil", but it's something to keep an eye out for, should you install an extension that uses this particular Parser bypass method.
Got something to say?
or, read what others have said ...