It would probably be easiest if the HTML documents were first imported as symbolic XML, then "fixed" with ReplaceAll and the appropriate set of replacement rules, and then exported back as HTML.
Could you please provide a couple of smallish files to try this out?