Python转换HTML到Text纯文本的方法 - 军军小站|张军博客

本文实例讲述了Python转换HTML到Text纯文本的方法。分享给大家供大家参考。具体分析如下：

今天项目需要将HTML转换为纯文本，去网上搜了一下，发现Python果然是神通广大，无所不能，方法是五花八门。

拿今天亲自试的两个方法举例，以方便后人：

方法一：

1. 安装nltk，可以去pipy装

（注：需要依赖以下包：numpy, PyYAML）

2.测试代码：

复制代码代码如下:

          
   >>> import nltk  
  
           >>> aa = r''''' 
  
            Project:
          
           DeHTML
          
            Description
          
          :
          
            This small script is intended to allow conversion from HTML markup to  
  
            plain text. 
  
           ''' 
  
           >>> aa  
  
           '\n\n            \n                
          
            Project:
          
           DeHTML
          
          \n                
          
            Description
          
          :
          
          \n                This small script is intended to allow conversion from HTML markup to \n                plain text.\n            \n        \n        '  
  
           >>> 
          
            print nltk.clean_html(aa)
          
           Project: DeHTML   
  
                Description :   
  
               This small script is intended to allow conversion from HTML markup to   
  
               plain text.

方法二：

如果觉得nltk太笨重，大材小用的话，可以自己写代码，代码如下:

复制代码代码如下:

          
   from HTMLParser import HTMLParser  
  
           from re import sub  
  
           from sys import stderr  
  
           from traceback import print_exc  
  
           class _DeHTMLParser(HTMLParser):  
  
               def __init__(self):  
  
                   HTMLParser.__init__(self)  
  
                   self.__text = []  
  
               def handle_data(self, data):  
  
                   text = data.strip()  
  
                   if len(text) > 0:  
  
                       text = sub('[ \t\r\n]+', ' ', text)  
  
                       self.__text.append(text + ' ')  
  
               def handle_starttag(self, tag, attrs):  
  
                   if tag == 'p':  
  
                       self.__text.append('\n\n')  
  
                   elif tag == 'br':  
  
                       self.__text.append('\n')  
  
               def handle_startendtag(self, tag, attrs):  
  
                   if tag == 'br':  
  
                       self.__text.append('\n\n')  
  
               def text(self):  
  
                   return ''.join(self.__text).strip()  
  
           def dehtml(text):  
  
               try:  
  
                   parser = _DeHTMLParser()  
  
                   parser.feed(text)  
  
                   parser.close()  
  
                   return parser.text()  
  
               except:  
  
                   print_exc(file=stderr)  
  
                   return text  
  
           def main():  
  
               text = r''''' 
  
            Project:
          
           DeHTML
          
            Description
          
          :
          
                           This small script is intended to allow conversion from HTML markup to  
  
                           plain text. 
  
               '''  
  
               print(dehtml(text))  
  
           if __name__ == '__main__':  
  
               main()

运行结果：

>>> ================================ RESTART ================================
>>>
Project: DeHTML
Description :
This small script is intended to allow conversion from HTML markup to plain text.

希望本文所述对大家的Python程序设计有所帮助。

更多文章、技术交流、商务合作、联系博主

微信扫码或搜索：z360901061

微信扫一扫加我为好友

QQ号联系： 360901061

您的支持是博主写作最大的动力，如果您喜欢我的文章，感觉我的文章对您有帮助，请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧，狠狠点击下面给点支持吧，站长非常感激您！手机微信长按不能支付解决办法：请将微信支付二维码保存到相册，切换到微信，然后点击微信右上角扫一扫功能，选择支付二维码完成支付。

【本文对您有帮助就好】元

2元

5元

10元

20元

自定义