Parsing Bad Markup

Bill de hÓra has a nice writeup on parsing junk markup. I think anyone who has tried to extract anything out of markup will have faced this problem. It seems idealistic to deny reading junk markup as it is the responsibility of the author. However, that is the moon, on earth we still face a lot of non-well-formed markup. Bill mentions tools like BeautifulSoup which can handle bad markup. I have earlier worked with TagSoup in Java for something similar. They can be real saviours!

Say your thought!

If you want to use HTML you can use these tags: <a>, <em>, <strong>, <abbr>, <code>, <blockquote>. Closing the tags will be appreciated as this site uses valid XHTML.



Abhijit Nadgouda
iface Consulting
+91 9819820312
My bookmarks


This is the weblog of Abhijit Nadgouda where he writes down his thoughts on software development and related topics. You are invited to subscribe to the feed to stay updated or check out more subscription options. Or you can choose to browse by one of the topics.