Bill de hÓra has a nice writeup on parsing junk markup. I think anyone who has tried to extract anything out of markup will have faced this problem. It seems idealistic to deny reading junk markup as it is the responsibility of the author. However, that is the moon, on earth we still face a lot of non-well-formed markup. Bill mentions tools like BeautifulSoup which can handle bad markup. I have earlier worked with TagSoup in Java for something similar. They can be real saviours!

