赛题
任务
根据小分子的属性特征,预测小分子在人体内清除率指标(即数据中的Label字段)。
数据
字段名 | 类型 | 说明 |
---|---|---|
ID | 整型 | 样本编号 |
Molecule_max_phase | 整型 | 分子的最长位相 |
Molecular weight | 浮点型 | 分子量 |
RO5_violations | 整型 | 违反新药5规则(RO5)的数量 |
AlogP | 浮点型 | 由ACD软件计算化合物的脂分配系数(该数据来自ChemBL) |
Features | 向量 | 小分子的矢量化表示 |
Label | 枚举/浮点型 | 单位时间内单位机体能将多少容积体液中的药物清除 |
注:本比赛id列不能算做有效特征。
评分标准
评分算法
采用RMSE,参考代码如下:
import numpy as np
def calc_rmse(y_pred, y_true):
return np.sqrt(((y_pred - y_true) ** 2).mean())
y_pred = np.array([1, 2, 1, 2, 3])
y_true = np.array([1, 1, 2, 2, 2])
rmse = calc_rmse(y_pred, y_true)
分数记录
编号 | 处理 | 线下 | 线上 |
---|---|---|---|
0323_1 | 原始特征 | 2.191179019653608 | |
0323_2 | features模 | 2.196203627935492 | |
0323_3 | 对Molecular_weight和AlogP进行exp | 2.1995688573340924 | |
步骤
0323_1
进行的操作
无,即为原始特征
代码
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import lightgbm as lgb
from tqdm import tqdm
import gc
import time
from numpy import nan
import category_encoders as ce
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split,cross_val_score
pd.set_option('max_columns', None)
pd.set_option('max_rows', None)
pd.set_option('float_format', lambda x: '%.6f' % x)
from tqdm import tqdm
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn.model_selection import KFold, StratifiedKFold
from scipy import stats
import datetime
import time
from scipy.stats import entropy, kurtosis
import multiprocessing
from gensim.models.word2vec import LineSentence
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from sklearn.metrics import f1_score, classification_report, mean_squared_error
import category_encoders as ce
seed = 2020
n_fold = 5
#读入
train=pd.read_csv('data/train_0312.csv')
test=pd.read_csv('data/test_noLabel_0312.csv')
#将Features 拆开
for i in tqdm(range(6924)):
train['Features'][i]=eval(train['Features'][i])
for i in tqdm(range(1731)):
test['Features'][i]=eval(test['Features'][i])
test=pd.concat([test,pd.DataFrame(test['Features'].tolist())],axis=1)
train=pd.concat([train,pd.DataFrame(train['Features'].tolist())],axis=1)
#确定使用的特征
ycol = 'Label'
feature_names = list(
filter(lambda x: x not in [ycol,'ID','Features'], train.columns))
model = lgb.LGBMRegressor(num_leaves=70,
max_depth=-1,
learning_rate=0.01,
n_estimators=10000,
subsample=0.9,
colsample_bytree=0.4,
random_state=seed,
metric=None
)
oof = []
prediction = test[['ID']]
prediction['Label'] = 0
df_importance_list = []
#开始训练
kfold = KFold(n_splits=n_fold, shuffle=False, random_state=seed)
for fold_id, (trn_idx, val_idx) in enumerate(kfold.split(train[feature_names])):
X_train = train.iloc[trn_idx][feature_names]
Y_train = train.iloc[trn_idx][ycol]
X_val = train.iloc[val_idx][feature_names]
Y_val = train.iloc[val_idx][ycol]
print('\nFold_{} Training ================================\n'.format(fold_id+1))
lgb_model = model.fit(X_train,
Y_train,
eval_names=['train', 'valid'],
eval_set=[(X_train, Y_train), (X_val, Y_val)],
verbose=300,
eval_metric='rmse',
early_stopping_rounds=100
)
pred_val = lgb_model.predict(X_val, num_iteration=lgb_model.best_iteration_)
df_oof = train.iloc[val_idx][['ID', ycol]].copy()
df_oof['pred'] = pred_val
oof.append(df_oof)
pred_test = lgb_model.predict(test[feature_names], num_iteration=lgb_model.best_iteration_)
prediction['Label'] += pred_test / n_fold
df_importance = pd.DataFrame({
'column': feature_names,
'importance': lgb_model.feature_importances_,
})
df_importance_list.append(df_importance)
del lgb_model, pred_val, pred_test, X_train, Y_train, X_val, Y_val
gc.collect()
df_importance = pd.concat(df_importance_list)
df_importance = df_importance.groupby(['column'])['importance'].agg(
'mean').sort_values(ascending=False).reset_index()
df_importance
df_oof = pd.concat(oof)
rmse = np.sqrt(mean_squared_error(df_oof[ycol], df_oof['pred']))
print('rmse:', rmse)
df_sub = pd.read_csv('data/submit_examp_0312.csv')
sub = prediction.copy(deep=True)
sub.to_csv('submission/{}.csv'.format(rmse), index=False, encoding='utf-8')
特征重要性
column | importance | |
---|---|---|
0 | AlogP | 4111.000000 |
1 | 3153 | 3345.200000 |
2 | 3154 | 3232.400000 |
3 | 3152 | 3205.800000 |
4 | 3163 | 2966.400000 |
5 | 3166 | 2814.800000 |
6 | 3157 | 2793.400000 |
7 | 3156 | 2685.000000 |
8 | Molecular weight | 2589.400000 |
9 | 3159 | 2526.600000 |
10 | 3161 | 2197.000000 |
11 | 3155 | 2146.400000 |
12 | 3158 | 2116.600000 |
13 | 3151 | 1794.200000 |
14 | 11 | 1408.200000 |
15 | 8 | 1264.400000 |
16 | 3167 | 1181.400000 |
17 | 3162 | 1080.200000 |
18 | 6 | 1042.400000 |
19 | 16 | 903.800000 |
20 | 3164 | 887.400000 |
注:仅展示前20
0323_2
进行的操作
- features向量的模
代码
train.fillna(method='ffill',inplace=True)
test.fillna(method='ffill',inplace=True)
train['mo']=0
test['mo']=0
a=lambda x:np.array(x)
for i in range(6924):
train['mo'][i]=np.linalg.norm(a(train['Features'][i]))
for i in range(1731):
test['mo'][i]=np.linalg.norm(a(test['Features'][i]))
特征重要性
column | importance | |
---|---|---|
0 | AlogP | 4235.400000 |
1 | 3153 | 3073.400000 |
2 | 3152 | 2963.800000 |
3 | 3154 | 2950.400000 |
4 | 3163 | 2847.400000 |
5 | 3166 | 2749.000000 |
6 | mo | 2670.200000 |
7 | 3157 | 2623.400000 |
8 | 3156 | 2566.800000 |
9 | 3159 | 2467.200000 |
10 | Molecular weight | 2212.200000 |
11 | 3158 | 1990.600000 |
12 | 3161 | 1979.400000 |
13 | 3155 | 1951.000000 |
14 | 3151 | 1489.800000 |
15 | 11 | 1315.200000 |
16 | 3167 | 1147.000000 |
17 | 8 | 1122.200000 |
18 | 6 | 968.400000 |
19 | 3162 | 931.400000 |
20 | 16 | 929.800000 |
画图
本博客所有文章除特别声明外,均采用 CC BY-SA 3.0协议 。转载请注明出处!