HTML Purifier for PHP


User Provided Markup

On this site, I have three markup systems that I interact with regularly - the Wiki uses Wikitext, the messageboard uses BBCode, and the movable type system uses plain old HTML. Movable type doesn't do much parsing that I'm aware of, other than to parse custom MT Tags within my text for added functionality. That's to be expected since only trusted people have access to create entries. The messageboard and the wiki use two different syntaxes to attack the same two problems:
  • Securing user provided markup
  • End user ease of use

Custom Syntax vs. HTML

The argument against implementing a user markup language is that any user markup language implementation is going to be lazily built and therefore susceptible to any number of unforeseeable problems. This is true. I know both BBCode and Wikitext parsers have had their issues in the past. Many implementations are mature enough that they should be considered safe, but the argument still stands because there aren't any universal parsing libraries and there is no standards document. Anybody implementing one of these things has to borrow, steal, or rewrite the code base and the chances of them re-implementing a long-ago fixed bug are pretty high.

The argument for an HTML implementation is that you can reasonably validate the input according to standards. Unfortunately, you can't trust that the end user is going to send you valid HTML. Unless, of course, the end user is using a WYSIWIG editor.

HTML Purifier

HTML Purifier is an HTML parser that will reasonably accept HTML input, remove unwanted tags and attributes, and output valid HTML, XHTML, etc. This isn't the first project to take on the task, but it looks to be more complete than other "secure" implementations.

While it's interesting, I don't know that I would still want to go down this road. Just because you provide a WYSIWYG editor doesn't mean that end users will use it. I do like the idea of using a whitelist of acceptable tags rather than a blacklist. I also like the fact that HTML Purifier attacks the problem from the perspective that you are parsing a document fragment and not necessarily an entire document.

Where I differ philosophically is that the task of white listing user provide tags should be simple, easily portable, and easy to read and update. It should also be very fast. I don't see that HTML Purifier has met those needs and I think they added a great deal of complexity in order to satisfy the unreasonable desire that some people have to translate user provided input from one version of HTML to another. My view of things may be a bit too simplistic, but in my experience the more complex you make a problem, the more impossible it is to resolve.



HTML Purifier for PHP Commentary