统计英文文本中的词频 - 军军小站|张军博客

　　NLP的文本分类过程中，大多会统计文章的词频，这是分类的重要依据之一。词频是由一个pair组成的，word是key

frequece是value。用什么方法统计最好，当然是map。用vector，list也可以实现，但是它们基于关键字的检索效率没有

map高，map一般是用rb-Tree实现的，查找效率是O(log(n))，list，vector都是线性的，查找复杂度是O(n)。

　　先上代码。

header

        
                    
            #ifndef _WORD_FREQUENCE_
            

          
          
            #define
          
          
             _WORD_FREQUENCE_
          
          
            

            #include 
          
          
            <
          
          
            map
          
          
            >
          
          
            

            #include 
          
          
            <
          
          
            iostream
          
          
            >
          
          
            

            #include 
          
          
            <
          
          
            string
          
          
            >
          
          
            

          
          
            using
          
          
             std::map;
            

          
          
            class
          
          
             WordFrequence{
            

          
          
            public
          
          
            :
            

             WordFrequence(): file_name_(NULL){}
            

             WordFrequence(
          
          
            char
          
          
          
          
            *
          
          
            file_name): file_name_(file_name){
            

             LoadFromFile();
            

             ReplaceSymbol();
            

             parse();
            

             }
            

          
          
            private
          
          
            :
            

          
          
            char
          
          
          
          
            *
          
          
            file_name_;
            

          
          
            char
          
          
          
          
            *
          
          
            text;
            

             map
          
          
            <
          
          
            std::
          
          
            string
          
          
            , 
          
          
            int
          
          
            >
          
          
             word_frequence_map_;
            

          
          
            void
          
          
             parse();
            

          
          
            void
          
          
             ReplaceSymbol();
            

          
          
            void
          
          
             LoadFromFile();
            

          
          
            bool
          
          
             IsWhiteChar(
          
          
            const
          
          
          
          
            char
          
          
             chr);
            

             friend std::ostream
          
          
            &
          
          
          
          
            operator
          
          
            <<
          
          
            (std::ostream
          
          
            &
          
          
             os, 
          
          
            const
          
          
             WordFrequence
          
          
            &
          
          
             wf); 
            

            };
            

          
          
            #endif

cpp

        
                    
            #include 
          
          
            "
          
          
            word_frequence.h
          
          
            "
          
          
            

            #include 
          
          
            <
          
          
            string
          
          
            >
          
          
            

            #include 
          
          
            <
          
          
            iostream
          
          
            >
          
          
            

            #include 
          
          
            <
          
          
            fstream
          
          
            >
          
          
            

            #include 
          
          
            <
          
          
            map
          
          
            >
          
          
            

            

          
          
            const
          
          
          
          
            char
          
          
          
          
            *
          
          
            symbols 
          
          
            =
          
          
          
          
            "
          
          
            ~!@#$%^&*()_+-=[]\\{}|:\
          
          
            "
          
          
            ;
          
          
            '
          
          
            ,./<>?";
          
          
            

          
          
            const
          
          
          
          
            int
          
          
             MAX_SIZE 
          
          
            =
          
          
          
          
            100000
          
          
            ;
            

            

          
          
            bool
          
          
             WordFrequence::IsWhiteChar(
          
          
            const
          
          
          
          
            char
          
          
             chr){
            

          
          
            switch
          
          
             (chr){
            

          
          
            case
          
          
          
          
            '
          
          
            \t
          
          
            '
          
          
            :
            

          
          
            case
          
          
          
          
            '
          
          
            \r
          
          
            '
          
          
            :
            

          
          
            case
          
          
          
          
            '
          
          
            \n
          
          
            '
          
          
            :
            

          
          
            case
          
          
          
          
            '
          
          
          
          
            '
          
          
            :
            

          
          
            case
          
          
          
          
            '
          
          
            \0
          
          
            '
          
          
            :
            

          
          
            return
          
          
          
          
            true
          
          
            ;
            

          
          
            default
          
          
            : 
            

          
          
            return
          
          
          
          
            false
          
          
            ;
            

             }
            

            }
            

            

          
          
            void
          
          
             WordFrequence::LoadFromFile(){
            

             std::ifstream 
          
          
            is
          
          
            (file_name_, std::fstream::
          
          
            in
          
          
            );
            

          
          
            if
          
          
            (
          
          
            !
          
          
            is
          
          
            )
            

             std::cerr
          
          
            <<
          
          
            "
          
          
            error: can't open file: 
          
          
            "
          
          
            <<
          
          
            "
          
          
            [
          
          
            "
          
          
            <<
          
          
            file_name_
          
          
            <<
          
          
            "
          
          
            ]
          
          
            "
          
          
            <<
          
          
            std::endl;
            

             text 
          
          
            =
          
          
          
          
            new
          
          
          
          
            char
          
          
            [MAX_SIZE];
            

          
          
            is
          
          
            .read(text, MAX_SIZE);
            

            }
            

            

          
          
            void
          
          
             WordFrequence::parse(){
            

             word_frequence_map_.clear();
            

          
          
            int
          
          
             index
          
          
            =
          
          
            0
          
          
            ;
            

          
          
            int
          
          
             count 
          
          
            =
          
          
             strlen(text);
            

             std::
          
          
            string
          
          
             str;
            

          
          
            while
          
          
            (index 
          
          
            <
          
          
             count){
            

          
          
            for
          
          
            (
          
          
            int
          
          
             i
          
          
            =
          
          
            index; i
          
          
            <=
          
          
            count; 
          
          
            ++
          
          
            i){
            

          
          
            if
          
          
            (IsWhiteChar(text[i])){
            

          
          
            int
          
          
             len
          
          
            =
          
          
            i
          
          
            -
          
          
            index 
          
          
            +
          
          
          
          
            1
          
          
            ;
            

          
          
            char
          
          
          
          
            *
          
          
            p 
          
          
            =
          
          
          
          
            new
          
          
          
          
            char
          
          
            [len];
            

             memcpy(p, text
          
          
            +
          
          
            index, i
          
          
            -
          
          
            index);
            

             p[len
          
          
            -
          
          
            1
          
          
            ] 
          
          
            =
          
          
          
          
            '
          
          
            \0
          
          
            '
          
          
            ;
            

             str
          
          
            =
          
          
            p;
            

          
          
            ++
          
          
            word_frequence_map_[str];
            

             index 
          
          
            =
          
          
             i
          
          
            +
          
          
            1
          
          
            ;
            

          
          
            while
          
          
            (IsWhiteChar(text[index]))
            

          
          
            ++
          
          
            index;
            

          
          
            break
          
          
            ;
            

             }
            

             }
            

             }
            

            }
            

            

          
          
            void
          
          
             WordFrequence::ReplaceSymbol(){
            

          
          
            int
          
          
             j
          
          
            =
          
          
            0
          
          
            ;
            

          
          
            while
          
          
            (
          
          
            *
          
          
            (text
          
          
            +
          
          
            j) 
          
          
            !=
          
          
          
          
            '
          
          
            \0
          
          
            '
          
          
            ){
            

          
          
            for
          
          
            (
          
          
            int
          
          
             i
          
          
            =
          
          
            0
          
          
            ; i
          
          
            <
          
          
            strlen(symbols); 
          
          
            ++
          
          
            i){
            

          
          
            if
          
          
            (
          
          
            *
          
          
            (text
          
          
            +
          
          
            j)
          
          
            ==
          
          
            symbols[i])
            

          
          
            *
          
          
            (text
          
          
            +
          
          
            j)
          
          
            =
          
          
            '
          
          
          
          
            '
          
          
            ;
            

             }
            

             j
          
          
            ++
          
          
            ;
            

             }
            

            }
            

            

            std::ostream
          
          
            &
          
          
          
          
            operator
          
          
            <<
          
          
            (std::ostream
          
          
            &
          
          
             os, 
          
          
            const
          
          
             WordFrequence
          
          
            &
          
          
             wf){
            

             os
          
          
            <<
          
          
            "
          
          
            word\t\tfrequence
          
          
            "
          
          
            <<
          
          
            std::endl;
            

             os
          
          
            <<
          
          
            "
          
          
            -----------------------
          
          
            "
          
          
            <<
          
          
            std::endl;
            

             std::map
          
          
            <
          
          
            std::
          
          
            string
          
          
            , 
          
          
            int
          
          
            >
          
          
            ::const_iterator i_begin 
          
          
            =
          
          
             wf.word_frequence_map_.begin();
            

             std::map
          
          
            <
          
          
            std::
          
          
            string
          
          
            , 
          
          
            int
          
          
            >
          
          
            ::const_iterator i_end 
          
          
            =
          
          
             wf.word_frequence_map_.end();
            

          
          
            while
          
          
            (i_begin 
          
          
            !=
          
          
             i_end){
            

             os
          
          
            <<
          
          
            ""
          
          
            <<
          
          
            i_begin
          
          
            ->
          
          
            first
          
          
            <<
          
          
            "
          
          
            \t\t
          
          
            "
          
          
            <<
          
          
            i_begin
          
          
            ->
          
          
            second
          
          
            <<
          
          
            ""
          
          
            <<
          
          
            std::endl;
            

          
          
            ++
          
          
            i_begin;
            

             }
            

          
          
            return
          
          
             os;
            

            }

      
                
          #include 
        
        
          <
        
        
          iostream
        
        
          >
        
        
          

          #include 
        
        
          "
        
        
          word_frequence.h
        
        
          "
        
        
          

        
        
          using
        
        
        
        
          namespace
        
        
           std;
          

          

        
        
          int
        
        
           main(
        
        
          int
        
        
           argc, 
        
        
          char
        
        
          *
        
        
           argv[])
          

          {
          

           WordFrequence wf(
        
        
          "
        
        
          d:\\test.txt
        
        
          "
        
        
          );
          

        
        
          return
        
        
        
        
          0
        
        
          ;
          

          }

　　实现的方式很简单，首先把从文件里load出text，然后去掉里面的符号，最后扫描一遍文件，遇着单词就塞到map

里面.

      ++word_freq_map["word"];

这句话太好用了。一句话实现插入map，如果有就增加value，如果没有就插入。

　　这个程序简单训练了一下map容器的使用方法，也用到文件的读取。注意ostream open以后一定要判断open

成功了没有。ostream有几种读取方式，有格式化的>>读取，也有getline这种一行读取的，也有getchar这种一个字符

读一次的。也有read这种一次读一大段二进制的。读的时候一定要对文件内容有先验知识。

　　如果一次读的数据量很大，建议read来读取，效率很高，用循环读取可能效率很低。

统计英文文本中的词频

更多文章、技术交流、商务合作、联系博主

微信扫码或搜索：z360901061

微信扫一扫加我为好友

QQ号联系： 360901061

您的支持是博主写作最大的动力，如果您喜欢我的文章，感觉我的文章对您有帮助，请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧，狠狠点击下面给点支持吧，站长非常感激您！手机微信长按不能支付解决办法：请将微信支付二维码保存到相册，切换到微信，然后点击微信右上角扫一扫功能，选择支付二维码完成支付。

【本文对您有帮助就好】元

2元

5元

10元

20元

自定义