使用scikit-learn做数据分析
用scrapy爬取boss上的python岗位数据,爬虫文章链接,分别用常用的分类和回归的方法挖掘数据关系。
读取数据
1 | import numpy as np |
数据归一化
离散标签值转化成数值,便于做回归分析,由于本次数据的离散特殊性,非整数值没有意义,数值看作更接近于某个整值的权重更好。
通过salary字段新建计算字段mean_salary。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46filename = 'Python_city.csv'
df = pd.read_csv(filename, encoding='utf-8')
cityname_mapping = {
'北京 ': 1,
'杭州 ': 2,
'武汉 ': 3,
'成都 ': 4,
'长沙 ': 5}
df['cityname'] = df['cityname'].map(cityname_mapping)
experience_mapping = {
'5-10年': 6,
'3-5年': 5,
'1-3年': 4,
'1年以内': 3,
'应届生': 2,
'经验不限': 1}
df['experience'] = df['experience'].map(experience_mapping)
company_size_mapping = {
'10000人以上': 6,
'1000-9999人': 5,
'500-999人': 4,
'100-499人': 3,
'20-99人': 2,
'0-20人':1}
df['company_size'] = df['company_size'].map(company_size_mapping)
education_mapping = {
'博士': 5,
'硕士': 4,
'本科': 3,
'大专': 2,
'学历不限': 1}
df['education'] = df['education'].map(education_mapping)
df['low_salary']=df['salary']
df['high_salary']=df['salary']
df['mean_salary']=df['salary']
for temp in df['salary']:
lowval=temp.split('-')[0].replace('k','')
newval=temp.split('-')[1].replace('k','')
df['low_salary'][df['low_salary']==temp]=lowval
df['high_salary'][df['high_salary']==temp]=newval
df['mean_salary'][df['mean_salary']==temp]=(float(newval)+float(lowval))/2
缺省值处理
可以设置删除缺省值或填充,或者使用其它方法,比如拉格朗日插值法等等。1
2df.dropna(axis=0, how='any')
#df.fillna(value=0)
选取训练数据和测试数据
选取城市地点,学历水平,工作经验纬度来分析对薪资的影响。
训练数据和验证数据4:1。1
2
3
4
5X_column=['cityname','education','experience']
X=df[X_column]
y=df.mean_salary
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=17)
回归方法
选取常用的回归方法分析比较。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import cross_val_score # K折交叉验证模块
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error,explained_variance_score #回归问题检验
from sklearn.metrics import accuracy_score, log_loss #分类问题检验
from sklearn import linear_model
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.kernel_ridge import KernelRidge
from sklearn import svm
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.neural_network import MLPRegressor
Regressor = [
linear_model.Ridge(alpha = .5),
linear_model.LinearRegression(),
max_iter=1,random_state=42,),
KernelRidge(alpha=1.0),
svm.SVR(),
KNeighborsRegressor(n_neighbors=2),
DecisionTreeRegressor(),
MLPRegressor(),
]
log_cols=["Regressor", "explained_variance_score", "mean_absolute_error"]
log = pd.DataFrame(columns=log_cols)
分类方法
选取常用的分类方法分析比较。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26from sklearn.model_selection import train_test_split
from sklearn.cross_validation import cross_val_score # K折交叉验证模块
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="rbf", C=0.025, probability=True),
NuSVC(probability=True),
DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier(),
GaussianNB(),
LinearDiscriminantAnalysis(),
QuadraticDiscriminantAnalysis()
]
log_cols=["Classifier", "Accuracy", "Log Loss"]
log = pd.DataFrame(columns=log_cols)
训练模型,输出验证集的正确率1
2
3
4
5
6
7
8
9
10
11
12
13
14
15for clf in Regressor:#或者分类的:classifiers
clf.fit(X_train, y_train)
name = clf.__class__.__name__
print('****Results****')
print(name)
train_predictions = clf.predict(X_test)
evs = explained_variance_score(y_test, train_predictions)
print("explained_variance_score: {:.4%}".format(evs))
train_predictions = clf.predict(X_test)
mae = mean_absolute_error(y_test, train_predictions)
print("mean_absolute_error: {}".format(mae))
log_entry = pd.DataFrame([[name, evs*100, mae]], columns=log_cols)
log = log.append(log_entry)
可视化结果查看
1 | sns.set_color_codes("muted") |
结果
示例结果:
matplotlib展示: