python之sklearn学习笔记 - 军军小站|张军博客

sklearn介绍

scikit-learn是数据挖掘与分析的简单而有效的工具。
依赖于NumPy， SciPy和matplotlib。

它主要包含以下几部分内容：

从功能来分：
classification
Regression
Clustering
Dimensionality reduction
Model selection

经常用到的有clustering, classification(svm, tree, linear regression 等), decomposition, preprocessing, metrics等

cluster

阅读sklearn.cluster的API，可以发现里面主要有两个内容：一个是各种聚类方法的class如cluster.KMeans，一个是可以直接使用的聚类方法的函数

            
              sklearn
              
                .
              
              cluster
              
                .
              
              k_means
              
                (
              
              X
              
                ,
              
               n_clusters
              
                ,
              
               init
              
                =
              
              
                'k-means++'
              
              
                ,
              
               
    precompute_distances
              
                =
              
              
                'auto'
              
              
                ,
              
               n_init
              
                =
              
              
                10
              
              
                ,
              
               max_iter
              
                =
              
              
                300
              
              
                ,
              
               
    verbose
              
                =
              
              
                False
              
              
                ,
              
               tol
              
                =
              
              
                0.0001
              
              
                ,
              
               random_state
              
                =
              
              
                None
              
              
                ,
              
               
    copy_x
              
                =
              
              
                True
              
              
                ,
              
               n_jobs
              
                =
              
              
                1
              
              
                ,
              
               algorithm
              
                =
              
              
                'auto'
              
              
                ,
              
               return_n_iter
              
                =
              
              
                False
              
              
                )

所以实际使用中，对应也有两种方法。

在sklearn.cluster共有9种聚类方法，分别是

AffinityPropagation: 吸引子传播
AgglomerativeClustering: 层次聚类
Birch
DBSCAN
FeatureAgglomeration: 特征聚集
KMeans: K均值聚类
MiniBatchKMeans
MeanShift
SpectralClustering: 谱聚类
拿我们最熟悉的Kmeans举例说明：

采用类构造器，来构造Kmeans聚类器

首先API中KMeans的构造函数为：

            
              sklearn
              
                .
              
              cluster
              
                .
              
              KMeans
              
                (
              
              n_clusters
              
                =
              
              
                8
              
              
                ,
              
              
     init
              
                =
              
              
                'k-means++'
              
              
                ,
              
               
    n_init
              
                =
              
              
                10
              
              
                ,
              
               
    max_iter
              
                =
              
              
                300
              
              
                ,
              
               
    tol
              
                =
              
              
                0.0001
              
              
                ,
              
               
    precompute_distances
              
                =
              
              
                'auto'
              
              
                ,
              
               
    verbose
              
                =
              
              
                0
              
              
                ,
              
               
    random_state
              
                =
              
              
                None
              
              
                ,
              
               
    copy_x
              
                =
              
              
                True
              
              
                ,
              
               
    n_jobs
              
                =
              
              
                1
              
              
                ,
              
               
    algorithm
              
                =
              
              
                'auto'
              
              
                )

            
              参数的意义：

n_clusters:簇的个数，即你想聚成几类
init: 初始簇中心的获取方法
n_init: 获取初始簇中心的更迭次数
max_iter: 最大迭代次数（因为kmeans算法的实现需要迭代）
tol: 容忍度，即kmeans运行准则收敛的条件
precompute_distances：是否需要提前计算距离
verbose: 冗长模式（不太懂是啥意思，反正一般不去改默认值）
random_state: 随机生成簇中心的状态条件。
copy_x: 对是否修改数据的一个标记，如果True，即复制了就不会修改数据。
n_jobs: 并行设置
algorithm: kmeans的实现算法，有：‘auto’, ‘full’, ‘elkan’, 其中 'full’表示用EM方式实现
下面给一个简单的例子：

            
              
                import
              
               numpy 
              
                as
              
               np

              
                from
              
               sklearn
              
                .
              
              cluster 
              
                import
              
               KMeans
data 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              rand
              
                (
              
              
                100
              
              
                ,
              
              
                3
              
              
                )
              
              
                #生成一个随机数据，样本大小为100, 特征数为3
              
              
                #假如我要构造一个聚类数为3的聚类器
              
              
estimator 
              
                =
              
               KMeans
              
                (
              
              n_clusters
              
                =
              
              
                3
              
              
                )
              
              
                #构造聚类器
              
              
estimator
              
                .
              
              fit
              
                (
              
              data
              
                )
              
              
                #聚类
              
              
label_pred 
              
                =
              
               estimator
              
                .
              
              label_ 
              
                #获取聚类标签
              
              
centroids 
              
                =
              
               estimator
              
                .
              
              cluster_centers_ 
              
                #获取聚类中心
              
              
inertia 
              
                =
              
               estimator
              
                .
              
              inertia_ 
              
                # 获取聚类准则的最后值

直接采用kmeans函数：

            
              
                import
              
               numpy 
              
                as
              
               np

              
                from
              
               sklearn 
              
                import
              
               cluster
data 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              rand
              
                (
              
              
                100
              
              
                ,
              
              
                3
              
              
                )
              
              
                #生成一个随机数据，样本大小为100, 特征数为3
              
              
k 
              
                =
              
              
                3
              
              
                # 假如我要聚类为3个clusters
              
              
                [
              
              centroid
              
                ,
              
               label
              
                ,
              
               inertia
              
                ]
              
              
                =
              
               cluster
              
                .
              
              k_means
              
                (
              
              data
              
                ,
              
               k
              
                )

classification

常用的分类方法有：

KNN最近邻:sklearn.neighbors
logistic regression逻辑回归: sklearn.linear_model.LogisticRegression
svm支持向量机: sklearn.svm
Naive Bayes朴素贝叶斯: sklearn.naive_bayes
Decision Tree决策树: sklearn.tree
Neural network神经网络: sklearn.neural_network
那么下面以KNN为例（主要是Nearest Neighbors Classification）来看看怎么使用这些方法：

            
              
                from
              
               sklearn 
              
                import
              
               neighbors
              
                ,
              
               datasets


              
                # import some data to play with
              
              
iris 
              
                =
              
               datasets
              
                .
              
              load_iris
              
                (
              
              
                )
              
              
n_neighbors 
              
                =
              
              
                15
              
              
X 
              
                =
              
               iris
              
                .
              
              data
              
                [
              
              
                :
              
              
                ,
              
              
                :
              
              
                2
              
              
                ]
              
              
                # we only take the first two features. We could
              
              
                # avoid this ugly slicing by using a two-dim dataset
              
              
y 
              
                =
              
               iris
              
                .
              
              target

weights 
              
                =
              
              
                'distance'
              
              
                # also set as 'uniform'
              
              
clf 
              
                =
              
               neighbors
              
                .
              
              KNeighborsClassifier
              
                (
              
              n_neighbors
              
                ,
              
               weights
              
                =
              
              weights
              
                )
              
              
clf
              
                .
              
              fit
              
                (
              
              X
              
                ,
              
               y
              
                )
              
              
                # if you have test data, just predict with the following functions
              
              
                # for example, xx, yy is constructed test data
              
              
x_min
              
                ,
              
               x_max 
              
                =
              
               X
              
                [
              
              
                :
              
              
                ,
              
              
                0
              
              
                ]
              
              
                .
              
              
                min
              
              
                (
              
              
                )
              
              
                -
              
              
                1
              
              
                ,
              
               X
              
                [
              
              
                :
              
              
                ,
              
              
                0
              
              
                ]
              
              
                .
              
              
                max
              
              
                (
              
              
                )
              
              
                +
              
              
                1
              
              
y_min
              
                ,
              
               y_max 
              
                =
              
               X
              
                [
              
              
                :
              
              
                ,
              
              
                1
              
              
                ]
              
              
                .
              
              
                min
              
              
                (
              
              
                )
              
              
                -
              
              
                1
              
              
                ,
              
               X
              
                [
              
              
                :
              
              
                ,
              
              
                1
              
              
                ]
              
              
                .
              
              
                max
              
              
                (
              
              
                )
              
              
                +
              
              
                1
              
              
xx
              
                ,
              
               yy 
              
                =
              
               np
              
                .
              
              meshgrid
              
                (
              
              np
              
                .
              
              arange
              
                (
              
              x_min
              
                ,
              
               x_max
              
                ,
              
               h
              
                )
              
              
                ,
              
              
                         np
              
                .
              
              arange
              
                (
              
              y_min
              
                ,
              
               y_max
              
                ,
              
               h
              
                )
              
              
                )
              
              
Z 
              
                =
              
               clf
              
                .
              
              predict
              
                (
              
              np
              
                .
              
              c_
              
                [
              
              xx
              
                .
              
              ravel
              
                (
              
              
                )
              
              
                ,
              
               yy
              
                .
              
              ravel
              
                (
              
              
                )
              
              
                ]
              
              
                )
              
              
                # Z is the label_pred

再比如svm：

            
              
                from
              
               sklearn 
              
                import
              
               svm
X 
              
                =
              
              
                [
              
              
                [
              
              
                0
              
              
                ,
              
              
                0
              
              
                ]
              
              
                ,
              
              
                [
              
              
                1
              
              
                ,
              
              
                1
              
              
                ]
              
              
                ]
              
              
y 
              
                =
              
              
                [
              
              
                0
              
              
                ,
              
              
                1
              
              
                ]
              
              
                #建立支持向量分类模型
              
              
clf 
              
                =
              
               svm
              
                .
              
              SVC
              
                (
              
              
                )
              
              
                #拟合训练数据，得到训练模型参数
              
              
clf
              
                .
              
              fit
              
                (
              
              X
              
                ,
              
               y
              
                )
              
              
                #对测试点[2., 2.], [3., 3.]预测
              
              
res 
              
                =
              
               clf
              
                .
              
              predict
              
                (
              
              
                [
              
              
                [
              
              
                2
              
              
                .
              
              
                ,
              
              
                2
              
              
                .
              
              
                ]
              
              
                ,
              
              
                [
              
              
                3
              
              
                .
              
              
                ,
              
              
                3
              
              
                .
              
              
                ]
              
              
                ]
              
              
                )
              
              
                #输出预测结果值
              
              
                print
              
              
                (
              
              res
              
                )
              
              
                #get support vectors
              
              
                print
              
              
                (
              
              
                "support vectors:"
              
              
                ,
              
               clf
              
                .
              
              support_vectors_
              
                )
              
              
                #get indices of support vectors
              
              
                print
              
              
                (
              
              
                "indices of support vectors:"
              
              
                ,
              
               clf
              
                .
              
              support_ 
              
                )
              
              
                #get number of support vectors for each class
              
              
                print
              
              
                (
              
              
                "number of support vectors for each class:"
              
              
                ,
              
               clf
              
                .
              
              n_support_ 
              
                )

当然SVM还有对应的回归模型SVR

逻辑回归

            
              
                from
              
               sklearn 
              
                import
              
               linear_model
X 
              
                =
              
              
                [
              
              
                [
              
              
                0
              
              
                ,
              
              
                0
              
              
                ]
              
              
                ,
              
              
                [
              
              
                1
              
              
                ,
              
              
                1
              
              
                ]
              
              
                ]
              
              
y 
              
                =
              
              
                [
              
              
                0
              
              
                ,
              
              
                1
              
              
                ]
              
              
logreg 
              
                =
              
               linear_model
              
                .
              
              LogisticRegression
              
                (
              
              C
              
                =
              
              
                1e5
              
              
                )
              
              
                #we create an instance of Neighbours Classifier and fit the data.
              
              
logreg
              
                .
              
              fit
              
                (
              
              X
              
                ,
              
               y
              
                )
              
              

res 
              
                =
              
               logreg
              
                .
              
              predict
              
                (
              
              
                [
              
              
                [
              
              
                2
              
              
                ,
              
              
                2
              
              
                ]
              
              
                ]
              
              
                )
              
              
                print
              
              
                (
              
              res
              
                )

preprocessing

这一块通常我要用到的是Scale操作。而Scale类型也有很多，包括：

StandardScaler
MaxAbsScaler
MinMaxScaler
RobustScaler
Normalizer
等其他预处理操作
对应的有直接的函数使用：scale(), maxabs_scale(), minmax_scale(), robust_scale(), normaizer()。

            
              
                import
              
               numpy 
              
                as
              
               np

              
                from
              
               sklearn 
              
                import
              
               preprocessing
X 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              rand
              
                (
              
              
                3
              
              
                ,
              
              
                4
              
              
                )
              
              
                #用scaler的方法
              
              
scaler 
              
                =
              
               preprocessing
              
                .
              
              MinMaxScaler
              
                (
              
              
                )
              
              
X_scaled 
              
                =
              
               scaler
              
                .
              
              fit_transform
              
                (
              
              X
              
                )
              
              
                #用scale函数的方法
              
              
X_scaled_convinent 
              
                =
              
               preprocessing
              
                .
              
              minmax_scale
              
                (
              
              X
              
                )

decomposition

NMF

            
              
                import
              
               numpy 
              
                as
              
               np
X 
              
                =
              
               np
              
                .
              
              array
              
                (
              
              
                [
              
              
                [
              
              
                1
              
              
                ,
              
              
                1
              
              
                ]
              
              
                ,
              
              
                [
              
              
                2
              
              
                ,
              
              
                1
              
              
                ]
              
              
                ,
              
              
                [
              
              
                3
              
              
                ,
              
              
                1.2
              
              
                ]
              
              
                ,
              
              
                [
              
              
                4
              
              
                ,
              
              
                1
              
              
                ]
              
              
                ,
              
              
                [
              
              
                5
              
              
                ,
              
              
                0.8
              
              
                ]
              
              
                ,
              
              
                [
              
              
                6
              
              
                ,
              
              
                1
              
              
                ]
              
              
                ]
              
              
                )
              
              
                from
              
               sklearn
              
                .
              
              decomposition 
              
                import
              
               NMF
model 
              
                =
              
               NMF
              
                (
              
              n_components
              
                =
              
              
                2
              
              
                ,
              
               init
              
                =
              
              
                'random'
              
              
                ,
              
               random_state
              
                =
              
              
                0
              
              
                )
              
              
model
              
                .
              
              fit
              
                (
              
              X
              
                )
              
              
                print
              
              
                (
              
              model
              
                .
              
              components_
              
                )
              
              
                print
              
              
                (
              
              model
              
                .
              
              reconstruction_err_
              
                )
              
              
                print
              
              
                (
              
              model
              
                .
              
              n_iter_
              
                )

PCA

            
              
                import
              
               numpy 
              
                as
              
               np
X 
              
                =
              
               np
              
                .
              
              array
              
                (
              
              
                [
              
              
                [
              
              
                1
              
              
                ,
              
              
                1
              
              
                ]
              
              
                ,
              
              
                [
              
              
                2
              
              
                ,
              
              
                1
              
              
                ]
              
              
                ,
              
              
                [
              
              
                3
              
              
                ,
              
              
                1.2
              
              
                ]
              
              
                ,
              
              
                [
              
              
                4
              
              
                ,
              
              
                1
              
              
                ]
              
              
                ,
              
              
                [
              
              
                5
              
              
                ,
              
              
                0.8
              
              
                ]
              
              
                ,
              
              
                [
              
              
                6
              
              
                ,
              
              
                1
              
              
                ]
              
              
                ]
              
              
                )
              
              
                from
              
               sklearn
              
                .
              
              decomposition 
              
                import
              
               PCA
model 
              
                =
              
               PCA
              
                (
              
              n_components
              
                =
              
              
                2
              
              
                )
              
              
model
              
                .
              
              fit
              
                (
              
              X
              
                )
              
              
                print
              
              
                (
              
              model
              
                .
              
              components_
              
                )
              
              
                print
              
              
                (
              
              model
              
                .
              
              n_components_
              
                )
              
              
                print
              
              
                (
              
              model
              
                .
              
              explained_variance_
              
                )
              
              
                print
              
              
                (
              
              model
              
                .
              
              explained_variance_ratio_
              
                )
              
              
                print
              
              
                (
              
              model
              
                .
              
              mean_
              
                )
              
              
                print
              
              
                (
              
              model
              
                .
              
              noise_variance_
              
                )

datasets

sklearn本身也提供了几个常见的数据集，如iris, diabetes, digits, covtype, kddcup99, boson, breast_cancer，都可以通过sklearn.datasets.load_iris类似的方法加载相应的数据集。它返回一个数据集。采用下列方式获取数据与标签。

更多文章、技术交流、商务合作、联系博主

微信扫码或搜索：z360901061

微信扫一扫加我为好友

QQ号联系： 360901061

您的支持是博主写作最大的动力，如果您喜欢我的文章，感觉我的文章对您有帮助，请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧，狠狠点击下面给点支持吧，站长非常感激您！手机微信长按不能支付解决办法：请将微信支付二维码保存到相册，切换到微信，然后点击微信右上角扫一扫功能，选择支付二维码完成支付。

【本文对您有帮助就好】元

2元

5元

10元

20元

自定义