使用python PyHCUP 处理 hcup 数据集的asc 格式数据

系统 1964 0
原文链接: https://github.com/jburke5/pyhcup

文章大纲

  • 环境搭建
    • python 及jupyter 环境
    • conda 虚环境
  • About
  • Example Usage
    • Load a datafile/loadfile combination.
  • 样例程序
  • Shortcut to loadfiles (meta data)
  • 参考文献


翻译: season

美国的一部分医疗数据是通过HIPPA 脱密后在 https://www.hcup-us.ahrq.gov/ 网站上对研究者开放进行探索的。但是由于她给出的数据格式为asc 的不常见格式,我们需要转化成csv 后才能正常使用spark 等大数据分析组件进行分析。

还好2015年,有人用python 写了一个调用SAS 解析hcup 数据的开源库,那么今天我们就一起来探索一下,如何用python 对hcup 的asc 数据进行解析并使用。

环境搭建

python 及jupyter 环境

            
              
                # 设置环境变量
              
              
                export
              
               PATH
              
                =
              
              
                "/root/anaconda2/bin/:
                
                  $PATH
                
                "
              
              
                source
              
               ~/.bashrc

jupyter notebook --no-browser --port 8888 --ip
              
                =
              
              0.0.0.0 --allow-root

jupyter notebook  --generate-config
在~/home 或者c盘usrs administrators  下找到文件夹  .jupyter 修改jupyter_application_config.py 文件。


              
                # c.NotebookApp.notebook_dir = ''  去掉注释 
              
            
          

conda 虚环境

            
              conda create -n iz_pyhcup --copy -y -q python
              
                =
              
              2.7 ipykernel pandas numpy

              
                source
              
               activate iz_pyhcup

              
                echo
              
              
                "y"
              
              
                |
              
              pip 
              
                install
              
               PyHCUP

              
                echo
              
              
                "y"
              
              
                |
              
              pip 
              
                install
              
               sqlalchemy

              
                source
              
               deactivate


            
          

About

PyHCUP is a Python library for parsing and importing data obtained from the United States Healthcare Cost and Utilization Program (http://hcup-us.ahrq.gov).


Data from HCUP come as a text file, with each column a specific width. However, the widths of these columns, and their names, are elsewhere. HCUP provide this meta data as either SAS or SPSS data loading programs.

PyHCUP is built to extract meta data from the SAS loading programs, then use that meta data to parse the actual data in the fixed-width text files. You’ll still need to acquire the actual data through HCUP.

A more verbose set of instructions is available in a series of posts on the author’s blog at

http://bielism.blogspot.com/2013/12/hcup-and-python-pt-i-background.html.


Example Usage

Load a datafile/loadfile combination.

            
              
                import
              
               pyhcup
 

              
                # specify where your data and loadfiles live
              
              
datafile 
              
                =
              
              
                'D:\\Users\\hcup\\sid\\NY_SID_2009_CORE.asc'
              
              
loadfile 
              
                =
              
              
                'D:\\Users\\hcup\\sid\\sasload\\NY_SID_2009_CORE.sas'
              
              
                # pull basic meta from SAS loadfile
              
              
meta_df 
              
                =
              
               pyhcup
              
                .
              
              meta_from_sas
              
                (
              
              loadfile
              
                )
              
              
                # use meta knowledge to parse datafile into a pandas DataFrame
              
              
df 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              datafile
              
                ,
              
               meta_df
              
                )
              
              
                # that's it. use df from here.
              
            
          

Deal with very large files that cannot be held in memory in two ways.

  1. To import a subset of rows, such as for preliminary work or troubleshooting, specify nrows to read and/or skiprows to skip using sas.df_from_sas().
            
              
                # optionally specify nrows and/or skiprows to handle larger files
              
              
df 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              datafile
              
                ,
              
               meta_df
              
                ,
              
               nrows
              
                =
              
              
                500000
              
              
                ,
              
               skiprows
              
                =
              
              
                1000000
              
              
                )
              
            
          
  1. To iterate through chunks of rows, such as for importing into a database, first use the metadata to build lists of column names and widths. Next, pass a chunksize to the read() function above to create a generator yielding manageable-sized chunks.
            
              
chunk_size 
              
                =
              
              
                500000
              
              
reader 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              datafile
              
                ,
              
               meta_df
              
                ,
              
               chunksize
              
                =
              
              chunk_size
              
                )
              
              
                for
              
               df 
              
                in
              
               reader
              
                :
              
              
                # do your business
              
              
                # such as replacing sentinel values (below)
              
              
                # or inserting into a database with another Python library
              
            
          

Whether you are pulling in all records or just a chunk of records, you can also replace all those pesky missing/invalid data placeholders from HCUP (this is less useful for generically parsing missing values for non-HCUP files).

::

            
              # fyi, this bulldozes through all values in all columns with no per-column control
replaced = pyhcup.replace_sentinels(df)

            
          

样例程序

上文提供了两种加载大数据文件的办法(原始文件一般非常大,一次性加载到pandas 中肯定会报错),一种是迭代,一种是直接定位到某些行,进行子数据集的分析,下面给出一段样例分析代码,将hcup 数据集中的asc 文件转化成标准csv

            
              
                #### save NY_SASD_2016_CORE.asc
              
              


filename 
              
                =
              
              
                "NY_SASD_2016_CORE.asc"
              
              

data_path 
              
                =
              
               filename
load_path 
              
                =
              
              
                'NY_SASD_2016_CORE.sas'
              
              
                #build a pandas DataFrame object from meta data
              
              
meta_df 
              
                =
              
               pyhcup
              
                .
              
              sas
              
                .
              
              meta_from_sas
              
                (
              
              load_path
              
                )
              
              



chunk_size 
              
                =
              
              
                500000
              
              
reader 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              data_path
              
                ,
              
               meta_df
              
                ,
              
               chunksize
              
                =
              
              chunk_size
              
                )
              
              


index 
              
                =
              
              
                1
              
              
                for
              
               df 
              
                in
              
               reader
              
                :
              
              
                if
              
               index
              
                ==
              
              
                1
              
              
                :
              
              
                #首先读一次,去掉前两行,生成文件
              
              
        index 
              
                =
              
               index 
              
                +
              
              
                1
              
              
        df
              
                [
              
              
                2
              
              
                :
              
              
                ]
              
              
                .
              
              to_csv
              
                (
              
              
                'NY_SASD_2016_CORE.csv'
              
              
                ,
              
               index
              
                =
              
              
                None
              
              
                )
              
              
                else
              
              
                :
              
              
                #后面不带header,追加文件
              
              
        index 
              
                =
              
               index 
              
                +
              
              
                1
              
              
        df
              
                .
              
              to_csv
              
                (
              
              
                'NY_SASD_2016_CORE.csv'
              
              
                ,
              
               mode
              
                =
              
              
                'a'
              
              
                ,
              
               header
              
                =
              
              
                False
              
              
                ,
              
              index
              
                =
              
              
                None
              
              
                )
              
              
                print
              
              
                (
              
              index
              
                )
              
            
          

写了两个封装的函数,对应的status 类的asc 文件进行csv 文件的导出

            
              
                ##################### 批量写入 ####################################
              
              
                def
              
              
                write_hcupAsc_to_csv
              
              
                (
              
              file_name_for_status_And_Year
              
                )
              
              
                :
              
              
    filename 
              
                =
              
               file_name_for_status_And_Year 
              
                +
              
              
                ".asc"
              
              
    load_path 
              
                =
              
               file_name_for_status_And_Year 
              
                +
              
              
                ".sas"
              
              
    save_name 
              
                =
              
               file_name_for_status_And_Year 
              
                +
              
              
                ".csv"
              
              
    
    meta_df 
              
                =
              
               pyhcup
              
                .
              
              sas
              
                .
              
              meta_from_sas
              
                (
              
              load_path
              
                )
              
              



    chunk_size 
              
                =
              
              
                500000
              
              
    reader 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              filename
              
                ,
              
               meta_df
              
                ,
              
               chunksize
              
                =
              
              chunk_size
              
                )
              
              


    index 
              
                =
              
              
                1
              
              
                for
              
               df 
              
                in
              
               reader
              
                :
              
              
                if
              
               index
              
                ==
              
              
                1
              
              
                :
              
              
                #首先读一次,去掉前两行,生成文件
              
              
            index 
              
                =
              
               index 
              
                +
              
              
                1
              
              
            df
              
                [
              
              
                2
              
              
                :
              
              
                ]
              
              
                .
              
              to_csv
              
                (
              
              save_name
              
                ,
              
               index
              
                =
              
              
                None
              
              
                )
              
              
                print
              
              
                (
              
              
                type
              
              
                (
              
              df
              
                [
              
              
                'KEY'
              
              
                ]
              
              
                .
              
              dtype
              
                )
              
              
                )
              
              
                else
              
              
                :
              
              
                #后面不带header,追加文件
              
              
            index 
              
                =
              
               index 
              
                +
              
              
                1
              
              
            df
              
                .
              
              to_csv
              
                (
              
              save_name
              
                ,
              
               mode
              
                =
              
              
                'a'
              
              
                ,
              
               header
              
                =
              
              
                False
              
              
                ,
              
              index
              
                =
              
              
                None
              
              
                )
              
              
                print
              
              
                (
              
              index
              
                )
              
              
                ########################### 测试写入 从开头第二行开始写 nrows 行 ################################
              
              
                def
              
              
                write_Test_hcupAsc_to_csv
              
              
                (
              
              file_name_for_status_And_Year
              
                ,
              
              save_name
              
                ,
              
              nrows
              
                )
              
              
                :
              
              
    filename 
              
                =
              
               file_name_for_status_And_Year 
              
                +
              
              
                ".asc"
              
              
    load_path 
              
                =
              
               file_name_for_status_And_Year 
              
                +
              
              
                ".sas"
              
              
    save_name 
              
                =
              
               save_name 
              
                +
              
              
                ".csv"
              
              
    
    meta_df 
              
                =
              
               pyhcup
              
                .
              
              sas
              
                .
              
              meta_from_sas
              
                (
              
              load_path
              
                )
              
              

    df 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              filename
              
                ,
              
               meta_df
              
                ,
              
               nrows
              
                =
              
              nrows
              
                ,
              
               skiprows
              
                =
              
              
                2
              
              
                )
              
              

    df
              
                .
              
              to_csv
              
                (
              
              save_name
              
                ,
              
              index
              
                =
              
              
                None
              
              
                )
              
            
          

还有一种读取的方法,我们没有用常用的chunksize,而是每次计算从特定位置开始读取

            
              
                #第二种方式,不用chunksize
              
              

filename 
              
                =
              
              
                "NY_SID_2016_CORE.asc"
              
              

load_path 
              
                =
              
              
                'NY_SID_2016_CORE.sas'
              
              

save_name 
              
                =
              
              
                'NY_SID_2016_CORE.csv'
              
              
                #build a pandas DataFrame object from meta data
              
              
meta_df 
              
                =
              
               pyhcup
              
                .
              
              sas
              
                .
              
              meta_from_sas
              
                (
              
              load_path
              
                )
              
              
                #获取文件行数
              
              

length 
              
                =
              
              
                len
              
              
                (
              
              
                [
              
              
                ""
              
              
                for
              
               line 
              
                in
              
              
                open
              
              
                (
              
              filename
              
                ,
              
              
                "r"
              
              
                )
              
              
                ]
              
              
                )
              
              
                print
              
              
                (
              
              length
              
                )
              
              

chunk_size 
              
                =
              
              
                500000
              
              

step 
              
                =
              
              
                int
              
              
                (
              
              length 
              
                /
              
              chunk_size
              
                )
              
              

df 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              filename
              
                ,
              
               meta_df
              
                ,
              
               nrows
              
                =
              
              nrows
              
                ,
              
               skiprows
              
                =
              
              
                2
              
              
                )
              
              
df
              
                .
              
              to_csv
              
                (
              
              save_name
              
                ,
              
              index
              
                =
              
              
                None
              
              
                )
              
              
                for
              
               i 
              
                in
              
              
                range
              
              
                (
              
              
                1
              
              
                ,
              
              step
              
                )
              
              
                :
              
              

    reader 
              
                =
              
               pyhcup
              
                .
              
              read
              
                (
              
              filename
              
                ,
              
               meta_df
              
                ,
              
               nrows
              
                =
              
              chunk_size
              
                ,
              
               skiprows
              
                =
              
              
                2
              
              
                +
              
              i
              
                *
              
              chunk_size
              
                )
              
              

    df
              
                .
              
              to_csv
              
                (
              
              save_name
              
                ,
              
               mode
              
                =
              
              
                'a'
              
              
                ,
              
               header
              
                =
              
              
                False
              
              
                ,
              
              index
              
                =
              
              
                None
              
              
                )
              
            
          

Shortcut to loadfiles (meta data)

The SAS loading program files provided by HCUP for the State Inpatient Database (SID), State Ambulatory Surgery Database (SASD), and State Emergency Department Database (SEDD) are bundled in this package for easy access. You can retrieve the meta data for these directly, without having to specify a loadfile path as described above.

Acquire meta in this way using the get_meta() function. You must pass a state abbreviation as the first argument and a year as the second arugment, like so.

            
              meta_df 
              
                =
              
               pyhcup
              
                .
              
              get_meta
              
                (
              
              
                'NY'
              
              
                ,
              
              
                2009
              
              
                )
              
            
          

By default, get_meta() acquires SID CORE data. Other meta can be acquired with the optional keyword arguments datafile (‘SID’, ‘SEDD’, or ‘SASD’) and category (‘CORE’, ‘CHGS’, ‘SEVERITY’, ‘DX_PR_GRPS’, or ‘AHAL’).

            
              
                # California emergency department charges meta for 2010
              
              
ca_2010_emergency_charges_meta 
              
                =
              
               pyhcup
              
                .
              
              get_meta
              
                (
              
              
                'CA'
              
              
                ,
              
              
                2010
              
              
                ,
              
               datafile
              
                =
              
              
                'SEDD'
              
              
                ,
              
               category
              
                =
              
              
                'CHGS'
              
              
                )
              
              
                # Arizona outpatient surgery DRG records meta for 2004
              
              
az_2004_surg_groups_meta 
              
                =
              
               pyhcup
              
                .
              
              get_meta
              
                (
              
              
                'AZ'
              
              
                ,
              
              
                2004
              
              
                ,
              
               datafile
              
                =
              
              
                'SASD'
              
              
                ,
              
               category
              
                =
              
              
                'DX_PR_GRPS'
              
              
                # etc.
              
            
          

参考文献

http://bielism.blogspot.com/2013/12/hcup-and-python-pt-5-nulls-and-pre.html


更多文章、技术交流、商务合作、联系博主

微信扫码或搜索:z360901061

微信扫一扫加我为好友

QQ号联系: 360901061

您的支持是博主写作最大的动力,如果您喜欢我的文章,感觉我的文章对您有帮助,请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧,狠狠点击下面给点支持吧,站长非常感激您!手机微信长按不能支付解决办法:请将微信支付二维码保存到相册,切换到微信,然后点击微信右上角扫一扫功能,选择支付二维码完成支付。

【本文对您有帮助就好】

您的支持是博主写作最大的动力,如果您喜欢我的文章,感觉我的文章对您有帮助,请用微信扫描上面二维码支持博主2元、5元、10元、自定义金额等您想捐的金额吧,站长会非常 感谢您的哦!!!

发表我的评论
最新评论 总共0条评论