XML DTD Validation in Clojure: Turning It Off, Parsing Malformed XML

I wanted to parse some externally-generated and malformed HTML, so naturally I went to the short and sweet clojure.xml/parse function. I got a nasty error:

error: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

It seems that the W3C blocked access to the DTDs two years ago, but Java still tries to load them by default anyway. The following allows clojure.xml to run without checking external DTDs:

(defn startparse-sax-non-validating [s ch]
  (.. (doto (. javax.xml.parsers.SAXParserFactory (newInstance))
       (.setValidating false)
       (.setFeature "http://apache.org/xml/features/nonvalidating/load-dtd-grammar" false)
       (.setFeature "http://apache.org/xml/features/nonvalidating/load-external-dtd" false)
       (.setFeature "http://xml.org/sax/features/validation" false)
       (.setFeature "http://xml.org/sax/features/external-general-entities" false)
       (.setFeature "http://xml.org/sax/features/external-parameter-entities" false))

       (newSAXParser) (parse s ch))))

Then you can simply (xml/parse "sourcefile.xml" startparse-sax-non-validating). This is not the ideal solution — ideally, we want to use a locally cached DTD — but it works well enough for one-off code. Read on for further information.

W3C’s malformed specification

The W3C webserver was being DDoSed by the massive number of XML parsers trying to download the DTDs constantly, so they just blocked everyone. This is a direct result of the poorly considered technical specification for DTD identification and URIs themselves, then exacerbated by the W3C’s institutional culture and philosophy.

Technical commentary

The root of the problem is, as they put it, that “these [DTD URIs] are not hyperlinks; these URIs are used for identification” (emphasis in original.) This is an abuse of the URL format. If it’s “for identification,” then it doesn’t need to specify a transport protocol! They are relying upon their particular interpretation of the old URI/URL debate, reducing the transport protocol element to a namespace identifier instead, and then assuming everyone will defer to their judgement and wisdom on the matter.

The W3C (and others who share their URL/URI interpretation) decided to conflate the semantic functions of locating and transporting a resource with the very distinct function of identifying a resource. “My name” is not the same thing as an instruction to “get in your car, drive to my street address, pick up someone with my name here, and then drive me back to your house,” even though there is indeed only one person with my name at this address. If you want to identify me, you would not give someone the latter instruction. Most likely, they didn’t think about this at all and just did it. Now everyone suffers for this poor design decision. They are being DDoSed and paying massive amounts of money to keep the servers going, and everyone’s XML parsers are broken because they’re now returing 503s for all the DTD URIs.

The W3C’s proposed solution is that everyone else should a) defer to their views on the URL/URI matter, and 2) rewrite their XML parsers to be massively more complex and build local caching DTD catalogs for every single one-off parse. In practice, most people will just turn DTD validation off, potentially breaking a lot of other things.

Institutional commentary on the W3C itself

The tone of naive incredulity in the W3C’s plea is most amazing of all. They seem genuinely baffled as to why anyone would ever choose to ignore their wisdom on the URL/URI matter, and even more confused as to how anyone could ignore any detail of a baroque and poorly communicated specification like those for XML and DTD cataloging. “We spent a long time writing that, and we have [the self-appointed] institutional authority to write specifications and make determinations like what a URI should be, so surely people would never just ignore us!”

For them, the only conceivable explanation for every detail of the specification not being implemented is sheer ignorance. They even gently chide the readers about their failure to use URIs properly and to build DTD caches into everything, much as a kindergarten teacher would a well-meaning but rather dumb child who continually ignores instructions about how to use his crayons. They think that they only have to patiently remind people enough times that there is indeed a right way to do it, that which is sanctioned by the appropriate institutional authorities (them, of course) — and finally everything will be fine once people get that message.

This is the sort of ingenuous attitude that can only develop in a pure research institute and among people who mostly have very high power distance index scores. The high PDI explains their naive arrogation of global authority and utterly genuine assumption that everyone ought to just do as they say, simply because they’re the (self-)designated authority. The research institute setting means that they’re shut off from the economic real world. Since money is never a big issue, the worship of artificially created fancy titles and institutional authority can become rampant, and it’s perfectly fine to spend extended amounts of time parsing a 2 page XML document “the right way” rather than 15 minutes hacking it and moving on to more pressing problems so your company doesn’t go bankrupt in the interim.

Moreover, they are still intransigently defending the original poor design choice to use URIs with embedded transport protocols as unique identifiers, because they (as an institution) cannot admit the possibility that this was a poor choice. That would create a great deal of cognitive dissonance vis-à-vis their conception of the nature of an institution and its role in dictating what is correct and what is not in the world; their view that institutional authority must be right, because, well, otherwise it wouldn’t be an institution. After all, everyone there does have more fancy titles than most people who are not there.

These attitudes are pervasive throughout most W3C work, and are the main reason that real-world implementations ignore much of their work.

Parsing malformed HTML

So, in any case, I have another problem besides Java trying to load the unavailable DTD. The input HTML is malformed. Not just a little; it’s utterly broken. It’s both externally-generated and unreplaceable, so fixing it at the source is not an option. I’m stuck with it; it’s the only way to get the data I need. (That’s real-world programming as opposed to academia right there.)

Disabling XML validation

Disabling all XML validation is the obvious solution to both problems, but there is no readily apparent way to do that clojure.xml. Calling setValidating(false) on the factory is insufficient. Thanks to Stack Overflow, I found that you must also use
setFeature() with several obscure “SAX2 Standard Feature Flags“, as above.

HTML Tidy to the rescue

This still leaves the problem of the malformed HTML. There does not seem to be any easy way to make Sax ignore errors. ((.setFeature "http://apache.org/xml/features/continue-after-fatal-error" true) has no apparent effect; it still doesn’t continue after a fatal parse error.)

In the end, I just ran the malformed input HTML through HTML Tidy before attempting to parse it. HTML Tidy coerces the malformed input into valid XML, and Sax is finally happy.

(I now get “Exception in thread "Swank REPL Thread" java.lang.RuntimeException: java.lang.IllegalMonitorStateException” from SLIME when I try to actually display the parsed XML in Emacs, but that’s another matter…)

4 Responses to XML DTD Validation in Clojure: Turning It Off, Parsing Malformed XML

  1. Pingback: XML DTD Validation in Clojure: Turning It Off, Parsing Malformed … - xml

  2. I had to do some HTML parsing in Clojure recently. After looking at several available solutions, I settled on using HtmlCleaner — it seems to have the cleanest syntax, at least for usage from Clojure. IMHO, SAX adds too much incidental complexity.

    Here is the code to parse HTML source and return the title and content of the page: https://gist.github.com/5cf012a929d5c35c98a0

    • Siddhartha,
      I agree about SAX: the benefits are not worth the additional complexity. HtmlCleaner looks much nicer. I’ll give it a try the next time I need to do some HTML parsing. Thanks!
      Best,
      Paul