sklearn介绍
scikit-learn是数据挖掘与分析的简单而有效的工具。
依赖于NumPy, SciPy和matplotlib。
它主要包含以下几部分内容:
从功能来分:
classification
Regression
Clustering
Dimensionality reduction
Model selection
经常用到的有clustering, classification(svm, tree, linear regression 等), decomposition, preprocessing, metrics等
cluster
阅读sklearn.cluster的API,可以发现里面主要有两个内容:一个是各种聚类方法的class如cluster.KMeans,一个是可以直接使用的聚类方法的函数
sklearn
.
cluster
.
k_means
(
X
,
n_clusters
,
init
=
'k-means++'
,
precompute_distances
=
'auto'
,
n_init
=
10
,
max_iter
=
300
,
verbose
=
False
,
tol
=
0.0001
,
random_state
=
None
,
copy_x
=
True
,
n_jobs
=
1
,
algorithm
=
'auto'
,
return_n_iter
=
False
)
所以实际使用中,对应也有两种方法。
在sklearn.cluster共有9种聚类方法,分别是
AffinityPropagation: 吸引子传播
AgglomerativeClustering: 层次聚类
Birch
DBSCAN
FeatureAgglomeration: 特征聚集
KMeans: K均值聚类
MiniBatchKMeans
MeanShift
SpectralClustering: 谱聚类
拿我们最熟悉的Kmeans举例说明:
采用类构造器,来构造Kmeans聚类器
首先API中KMeans的构造函数为:
sklearn
.
cluster
.
KMeans
(
n_clusters
=
8
,
init
=
'k-means++'
,
n_init
=
10
,
max_iter
=
300
,
tol
=
0.0001
,
precompute_distances
=
'auto'
,
verbose
=
0
,
random_state
=
None
,
copy_x
=
True
,
n_jobs
=
1
,
algorithm
=
'auto'
)
参数的意义:
n_clusters:簇的个数,即你想聚成几类
init: 初始簇中心的获取方法
n_init: 获取初始簇中心的更迭次数
max_iter: 最大迭代次数(因为kmeans算法的实现需要迭代)
tol: 容忍度,即kmeans运行准则收敛的条件
precompute_distances:是否需要提前计算距离
verbose: 冗长模式(不太懂是啥意思,反正一般不去改默认值)
random_state: 随机生成簇中心的状态条件。
copy_x: 对是否修改数据的一个标记,如果True,即复制了就不会修改数据。
n_jobs: 并行设置
algorithm: kmeans的实现算法,有:‘auto’, ‘full’, ‘elkan’, 其中 'full’表示用EM方式实现
下面给一个简单的例子:
import
numpy
as
np
from
sklearn
.
cluster
import
KMeans
data
=
np
.
random
.
rand
(
100
,
3
)
#生成一个随机数据,样本大小为100, 特征数为3
#假如我要构造一个聚类数为3的聚类器
estimator
=
KMeans
(
n_clusters
=
3
)
#构造聚类器
estimator
.
fit
(
data
)
#聚类
label_pred
=
estimator
.
label_
#获取聚类标签
centroids
=
estimator
.
cluster_centers_
#获取聚类中心
inertia
=
estimator
.
inertia_
# 获取聚类准则的最后值
直接采用kmeans函数:
import
numpy
as
np
from
sklearn
import
cluster
data
=
np
.
random
.
rand
(
100
,
3
)
#生成一个随机数据,样本大小为100, 特征数为3
k
=
3
# 假如我要聚类为3个clusters
[
centroid
,
label
,
inertia
]
=
cluster
.
k_means
(
data
,
k
)
classification
常用的分类方法有:
KNN最近邻:sklearn.neighbors
logistic regression逻辑回归: sklearn.linear_model.LogisticRegression
svm支持向量机: sklearn.svm
Naive Bayes朴素贝叶斯: sklearn.naive_bayes
Decision Tree决策树: sklearn.tree
Neural network神经网络: sklearn.neural_network
那么下面以KNN为例(主要是Nearest Neighbors Classification)来看看怎么使用这些方法:
from
sklearn
import
neighbors
,
datasets
# import some data to play with
iris
=
datasets
.
load_iris
(
)
n_neighbors
=
15
X
=
iris
.
data
[
:
,
:
2
]
# we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y
=
iris
.
target
weights
=
'distance'
# also set as 'uniform'
clf
=
neighbors
.
KNeighborsClassifier
(
n_neighbors
,
weights
=
weights
)
clf
.
fit
(
X
,
y
)
# if you have test data, just predict with the following functions
# for example, xx, yy is constructed test data
x_min
,
x_max
=
X
[
:
,
0
]
.
min
(
)
-
1
,
X
[
:
,
0
]
.
max
(
)
+
1
y_min
,
y_max
=
X
[
:
,
1
]
.
min
(
)
-
1
,
X
[
:
,
1
]
.
max
(
)
+
1
xx
,
yy
=
np
.
meshgrid
(
np
.
arange
(
x_min
,
x_max
,
h
)
,
np
.
arange
(
y_min
,
y_max
,
h
)
)
Z
=
clf
.
predict
(
np
.
c_
[
xx
.
ravel
(
)
,
yy
.
ravel
(
)
]
)
# Z is the label_pred
再比如svm:
from
sklearn
import
svm
X
=
[
[
0
,
0
]
,
[
1
,
1
]
]
y
=
[
0
,
1
]
#建立支持向量分类模型
clf
=
svm
.
SVC
(
)
#拟合训练数据,得到训练模型参数
clf
.
fit
(
X
,
y
)
#对测试点[2., 2.], [3., 3.]预测
res
=
clf
.
predict
(
[
[
2
.
,
2
.
]
,
[
3
.
,
3
.
]
]
)
#输出预测结果值
print
(
res
)
#get support vectors
print
(
"support vectors:"
,
clf
.
support_vectors_
)
#get indices of support vectors
print
(
"indices of support vectors:"
,
clf
.
support_
)
#get number of support vectors for each class
print
(
"number of support vectors for each class:"
,
clf
.
n_support_
)
当然SVM还有对应的回归模型SVR
from
sklearn
import
svm
X
=
[
[
0
,
0
]
,
[
2
,
2
]
]
y
=
[
0.5
,
2.5
]
clf
=
svm
.
SVR
(
)
clf
.
fit
(
X
,
y
)
res
=
clf
.
predict
(
[
[
1
,
1
]
]
)
print
(
res
)
逻辑回归
from
sklearn
import
linear_model
X
=
[
[
0
,
0
]
,
[
1
,
1
]
]
y
=
[
0
,
1
]
logreg
=
linear_model
.
LogisticRegression
(
C
=
1e5
)
#we create an instance of Neighbours Classifier and fit the data.
logreg
.
fit
(
X
,
y
)
res
=
logreg
.
predict
(
[
[
2
,
2
]
]
)
print
(
res
)
preprocessing
这一块通常我要用到的是Scale操作。而Scale类型也有很多,包括:
StandardScaler
MaxAbsScaler
MinMaxScaler
RobustScaler
Normalizer
等其他预处理操作
对应的有直接的函数使用:scale(), maxabs_scale(), minmax_scale(), robust_scale(), normaizer()。
import
numpy
as
np
from
sklearn
import
preprocessing
X
=
np
.
random
.
rand
(
3
,
4
)
#用scaler的方法
scaler
=
preprocessing
.
MinMaxScaler
(
)
X_scaled
=
scaler
.
fit_transform
(
X
)
#用scale函数的方法
X_scaled_convinent
=
preprocessing
.
minmax_scale
(
X
)
decomposition
NMF
import
numpy
as
np
X
=
np
.
array
(
[
[
1
,
1
]
,
[
2
,
1
]
,
[
3
,
1.2
]
,
[
4
,
1
]
,
[
5
,
0.8
]
,
[
6
,
1
]
]
)
from
sklearn
.
decomposition
import
NMF
model
=
NMF
(
n_components
=
2
,
init
=
'random'
,
random_state
=
0
)
model
.
fit
(
X
)
print
(
model
.
components_
)
print
(
model
.
reconstruction_err_
)
print
(
model
.
n_iter_
)
PCA
import
numpy
as
np
X
=
np
.
array
(
[
[
1
,
1
]
,
[
2
,
1
]
,
[
3
,
1.2
]
,
[
4
,
1
]
,
[
5
,
0.8
]
,
[
6
,
1
]
]
)
from
sklearn
.
decomposition
import
PCA
model
=
PCA
(
n_components
=
2
)
model
.
fit
(
X
)
print
(
model
.
components_
)
print
(
model
.
n_components_
)
print
(
model
.
explained_variance_
)
print
(
model
.
explained_variance_ratio_
)
print
(
model
.
mean_
)
print
(
model
.
noise_variance_
)
datasets
sklearn本身也提供了几个常见的数据集,如iris, diabetes, digits, covtype, kddcup99, boson, breast_cancer,都可以通过sklearn.datasets.load_iris类似的方法加载相应的数据集。它返回一个数据集。采用下列方式获取数据与标签。
from
sklearn
.
datasets
import
load_iris
iris
=
load_iris
(
)
X
=
iris
.
data
y
=
iris
.
target