赛题

任务

根据小分子的属性特征,预测小分子在人体内清除率指标(即数据中的Label字段)。

数据

字段名 类型 说明
ID 整型 样本编号
Molecule_max_phase 整型 分子的最长位相
Molecular weight 浮点型 分子量
RO5_violations 整型 违反新药5规则(RO5)的数量
AlogP 浮点型 由ACD软件计算化合物的脂分配系数(该数据来自ChemBL)
Features 向量 小分子的矢量化表示
Label 枚举/浮点型 单位时间内单位机体能将多少容积体液中的药物清除

注:本比赛id列不能算做有效特征。

评分标准

评分算法

采用RMSE,参考代码如下:

import numpy as np
def calc_rmse(y_pred, y_true):
    return np.sqrt(((y_pred - y_true) ** 2).mean())

y_pred = np.array([1, 2, 1, 2, 3])
y_true = np.array([1, 1, 2, 2, 2])
rmse = calc_rmse(y_pred, y_true)

分数记录

编号 处理 线下 线上
0323_1 原始特征 2.191179019653608
0323_2 features模 2.196203627935492
0323_3 对Molecular_weight和AlogP进行exp 2.1995688573340924

步骤

0323_1

进行的操作

无,即为原始特征

代码

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import lightgbm as lgb
from tqdm import tqdm
import gc
import time
from numpy import nan
import category_encoders as ce
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split,cross_val_score
pd.set_option('max_columns', None)
pd.set_option('max_rows', None)
pd.set_option('float_format', lambda x: '%.6f' % x)
from tqdm import tqdm
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn.model_selection import KFold, StratifiedKFold
from scipy import stats
import datetime
import time
from scipy.stats import entropy, kurtosis
import multiprocessing
from gensim.models.word2vec import LineSentence
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from sklearn.metrics import f1_score, classification_report, mean_squared_error
import category_encoders as ce

seed = 2020
n_fold = 5

#读入
train=pd.read_csv('data/train_0312.csv')
test=pd.read_csv('data/test_noLabel_0312.csv')

#将Features 拆开
for i in tqdm(range(6924)):
 train['Features'][i]=eval(train['Features'][i])
for i in tqdm(range(1731)):
 test['Features'][i]=eval(test['Features'][i])
test=pd.concat([test,pd.DataFrame(test['Features'].tolist())],axis=1)
train=pd.concat([train,pd.DataFrame(train['Features'].tolist())],axis=1)

#确定使用的特征
ycol = 'Label'
feature_names = list(
    filter(lambda x: x not in [ycol,'ID','Features'], train.columns))
model = lgb.LGBMRegressor(num_leaves=70,
                          max_depth=-1,
                          learning_rate=0.01,
                          n_estimators=10000,
                          subsample=0.9,
                          colsample_bytree=0.4,
                          random_state=seed,
                          metric=None
                          )

oof = []
prediction = test[['ID']]
prediction['Label'] = 0
df_importance_list = []
#开始训练
kfold = KFold(n_splits=n_fold, shuffle=False, random_state=seed)
for fold_id, (trn_idx, val_idx) in enumerate(kfold.split(train[feature_names])):
    X_train = train.iloc[trn_idx][feature_names]
    Y_train = train.iloc[trn_idx][ycol]

    X_val = train.iloc[val_idx][feature_names]
    Y_val = train.iloc[val_idx][ycol]

    print('\nFold_{} Training ================================\n'.format(fold_id+1))

    lgb_model = model.fit(X_train,
                          Y_train,
                          eval_names=['train', 'valid'],
                          eval_set=[(X_train, Y_train), (X_val, Y_val)],
                          verbose=300,
                          eval_metric='rmse',
                          early_stopping_rounds=100
                          )

    pred_val = lgb_model.predict(X_val, num_iteration=lgb_model.best_iteration_)
    df_oof = train.iloc[val_idx][['ID', ycol]].copy()
    df_oof['pred'] = pred_val
    oof.append(df_oof)

    pred_test = lgb_model.predict(test[feature_names], num_iteration=lgb_model.best_iteration_)
    prediction['Label'] += pred_test / n_fold
    df_importance = pd.DataFrame({
        'column': feature_names,
        'importance': lgb_model.feature_importances_,
    })
    df_importance_list.append(df_importance)

    del lgb_model, pred_val, pred_test, X_train, Y_train, X_val, Y_val
    gc.collect()

df_importance = pd.concat(df_importance_list)
df_importance = df_importance.groupby(['column'])['importance'].agg(
    'mean').sort_values(ascending=False).reset_index()
df_importance

df_oof = pd.concat(oof)
rmse = np.sqrt(mean_squared_error(df_oof[ycol], df_oof['pred']))
print('rmse:', rmse)

df_sub = pd.read_csv('data/submit_examp_0312.csv')
sub = prediction.copy(deep=True)
sub.to_csv('submission/{}.csv'.format(rmse), index=False, encoding='utf-8')

特征重要性

column importance
0 AlogP 4111.000000
1 3153 3345.200000
2 3154 3232.400000
3 3152 3205.800000
4 3163 2966.400000
5 3166 2814.800000
6 3157 2793.400000
7 3156 2685.000000
8 Molecular weight 2589.400000
9 3159 2526.600000
10 3161 2197.000000
11 3155 2146.400000
12 3158 2116.600000
13 3151 1794.200000
14 11 1408.200000
15 8 1264.400000
16 3167 1181.400000
17 3162 1080.200000
18 6 1042.400000
19 16 903.800000
20 3164 887.400000

注:仅展示前20

0323_2

进行的操作

  1. features向量的模

代码

train.fillna(method='ffill',inplace=True)
test.fillna(method='ffill',inplace=True)

train['mo']=0
test['mo']=0

a=lambda x:np.array(x)
for i in range(6924):
    train['mo'][i]=np.linalg.norm(a(train['Features'][i]))

for i in range(1731):
    test['mo'][i]=np.linalg.norm(a(test['Features'][i]))

特征重要性

column importance
0 AlogP 4235.400000
1 3153 3073.400000
2 3152 2963.800000
3 3154 2950.400000
4 3163 2847.400000
5 3166 2749.000000
6 mo 2670.200000
7 3157 2623.400000
8 3156 2566.800000
9 3159 2467.200000
10 Molecular weight 2212.200000
11 3158 1990.600000
12 3161 1979.400000
13 3155 1951.000000
14 3151 1489.800000
15 11 1315.200000
16 3167 1147.000000
17 8 1122.200000
18 6 968.400000
19 3162 931.400000
20 16 929.800000

画图


Machine Learning     

本博客所有文章除特别声明外,均采用 CC BY-SA 3.0协议 。转载请注明出处!