A Walk Through of the IEEE-CIS Fraud Detection Challenge

Introduction

This is a brief walk through of the Kaggle challenge IEEE-CIS Fraud Detection. The process in this post is not meant to compete the top solution by performing an extre feature engineering and a greedy search for the best model with hyper-parameters. This is just to walk through the problem and demonstrate a relatively good solution, by doing feature analysis and a few experiments with reference to other’s methods.

The problem of this challenge is to detect payment frauds by using the data of the transactions and identities. The performance of the prediction is evaluated on ROC AUC. The reason why this measure is suitable for this problem (rather than Precision-Recall) can refer to the discussion here.

Look into the data

The provided dataset is broken into two files named identity and transaction, which are joined by TransactionID (note that NOT all the transactions have corresponding identity information).

Transaction Table

  • TransactionDT: timedelta from a given reference datetime (not an actual timestamp), the number of seconds in a day (60 * 60 * 24 = 86400)
  • TransactionAMT: transaction payment amount in USD
  • ProductCD: product code, the product for each transaction
  • card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
  • addr: address
  • dist: distance
  • P_ and (R__) emaildomain: purchaser and recipient email domain
  • C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
  • D1-D15: timedelta, such as days between previous transaction, etc.
  • M1-M9: match, such as names on card and address, etc.
  • Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

Among these variables, categorical variables are:

  • ProductCD
  • card1 - card6
  • addr1, addr2
  • Pemaildomain Remaildomain
  • M1 - M9

Identity Table

All the variable in this table are categorical:

  • DeviceType
  • DeviceInfo
  • id12 - id38

A more detailed explanation of the data can be found in the reply of this discussion.

Now let’s have a close look at the data.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import numpy as np
import pandas as pd
import plotly.express as px


DATA_DIR = '/content/drive/My Drive/colab-data/fraud detect/data'

tran_train = reduce_mem_usage(pd.read_csv(f'{DATA_DIR}/train_transaction.csv'))
id_train = reduce_mem_usage(pd.read_csv(f'{DATA_DIR}/train_identity.csv'))

tran_train.info()
tran_train.head()
id_train.info()
id_train.head()
Mem. usage decreased to 542.35 Mb (69.4% reduction)
Mem. usage decreased to 25.86 Mb (42.7% reduction)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 394 entries, TransactionID to V339
dtypes: float16(332), float32(44), int16(1), int32(2), int8(1), object(14)
memory usage: 542.3+ MB
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144233 entries, 0 to 144232
Data columns (total 41 columns):
TransactionID    144233 non-null int32
id_01            144233 non-null float16
id_02            140872 non-null float32
id_03            66324 non-null float16
id_04            66324 non-null float16
id_05            136865 non-null float16
id_06            136865 non-null float16
id_07            5155 non-null float16
id_08            5155 non-null float16
id_09            74926 non-null float16
id_10            74926 non-null float16
id_11            140978 non-null float16
id_12            144233 non-null object
id_13            127320 non-null float16
id_14            80044 non-null float16
id_15            140985 non-null object
id_16            129340 non-null object
id_17            139369 non-null float16
id_18            45113 non-null float16
id_19            139318 non-null float16
id_20            139261 non-null float16
id_21            5159 non-null float16
id_22            5169 non-null float16
id_23            5169 non-null object
id_24            4747 non-null float16
id_25            5132 non-null float16
id_26            5163 non-null float16
id_27            5169 non-null object
id_28            140978 non-null object
id_29            140978 non-null object
id_30            77565 non-null object
id_31            140282 non-null object
id_32            77586 non-null float16
id_33            73289 non-null object
id_34            77805 non-null object
id_35            140985 non-null object
id_36            140985 non-null object
id_37            140985 non-null object
id_38            140985 non-null object
DeviceType       140810 non-null object
DeviceInfo       118666 non-null object
dtypes: float16(22), float32(1), int32(1), object(17)
memory usage: 25.9+ MB
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

is_fraud = tran_train[['isFraud', 'TransactionID']].groupby('isFraud').count()

is_fraud['ratio'] = is_fraud['TransactionID'] / is_fraud['TransactionID'].sum()
fig_Y = px.bar(is_fraud, x=is_fraud.index, y='TransactionID',
               text='ratio',
               labels={'TransactionID': 'Number of transactions',
                       'x': 'is fraud'})
fig_Y.update_traces(texttemplate='%{text:.6p}')

Very imbalanced target varible

Positives of isFraud is very low of 3.5% in the entire dataset. For this classification problem, it’s very important to have high true positive rate. That is, how good can the model identify the fraud cases from all the fraud cases. So recall is in a sense more important than precision in this problem. Macro average of recall would be a good side metric for this problem. Of cource, in reality we need to consider the belance between the cost of a few frauds and the cost of handling cases.

In addition, we need to put some effort on the sampling and train-val split method, to ensure that the minority class samples have enough impact to the model while training. Class weights of the model could be set to see if there’s difference in performance.

Check missing values

Now let’s have a look at if there’s any missing value in the dataset. We can see from the table below that there’re quite a lot of missing values.

It’s hard to tell how we should handle with them before we look into each variable. Sometimes a missing value stands for something. It also depends on what kind of model we are going to use. We can leave them as missing value when using a tree model.

def missing_ratio_col(df):
    df_na = (df.isna().sum() / len(df)) * 100
    if isinstance(df, pd.DataFrame):
        df_na = df_na.drop(
            df_na[df_na == 0].index).sort_values(ascending=False)
        missing_data = pd.DataFrame(
            {'Missing Ratio %': df_na})
    else:
        missing_data = pd.DataFrame(
            {'Missing Ratio %': df_na}, index=[0])
            
    return missing_data

missing_ratio_col(tran_train)
missing_ratio_col(id_train)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Detailed look at each variable

There’re very good references of EDA and feature engineering on the dataset, so it’s meaningless to repeat here. Please check the list here if you’re interested:

Data transformation pipeline

Based on the references and my own analysis, here we have a pipeline of the transformations to perform on the dataset. It can be adjusted for experimenting. Explanation of the transformations see in code comments.

import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from typing import List, Callable


DATA_DIR = '/content/drive/My Drive/colab-data/fraud detect/data'


def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

    
def load_df(test_set: bool = False, nrows: int = None, sample_ratio: float = None, reduce_mem: bool = True) -> pd.DataFrame:
    if test_set:
        tran = pd.read_csv(f'{DATA_DIR}/test_transaction.csv', nrows=nrows)
        ids = pd.read_csv(f'{DATA_DIR}/test_identity.csv', nrows=nrows)
    else:
        tran = pd.read_csv(f'{DATA_DIR}/train_transaction.csv', nrows=nrows)
        ids = pd.read_csv(f'{DATA_DIR}/train_identity.csv', nrows=nrows)

    if sample_ratio:
        size = int(len(tran) * sample_ratio)
        tran = tran.sample(n=size, random_state=RAND_STATE)
        ids = ids.sample(n=size, random_state=RAND_STATE)
    df = tran.merge(ids, how='left', on='TransactionID')
    if reduce_mem:
        reduce_mem_usage(df)
    return df


def cat_cols(df: pd.DataFrame) -> List[str]:
    cols: List[str] = []

    cols.append('ProductCD')

    cols_card = [c for c in df.columns if 'card' in c]
    cols.extend(cols_card)

    cols_addr = ['addr1', 'addr2']
    cols.extend(cols_addr)

    cols_emaildomain = [c for c in df if 'email' in c]
    cols.extend(cols_emaildomain)

    cols_M = [c for c in df if c.startswith('M')]
    cols.extend(cols_M)

    cols.extend(['DeviceType', 'DeviceInfo'])

    cols_id = [c for c in df if c.startswith('id')]
    cols.extend(cols_id)

    return cols


def num_cols(df: pd.DataFrame, target_col='isFraud') -> List[str]:
    cols_cat = cat_cols(df)
    cats = df[cols_cat]
    cols_num = list(set(df.columns) - set(cols_cat))

    if target_col in cols_num:
        cols_num.remove(target_col)

    return cols_num


def missing_ratio_col(df):
    df_na = (df.isna().sum() / len(df)) * 100
    if isinstance(df, pd.DataFrame):
        df_na = df_na.drop(
            df_na[df_na == 0].index).sort_values(ascending=False)
        missing_data = pd.DataFrame({'Missing Ratio %': df_na})
    else:
        missing_data = pd.DataFrame({'Missing Ratio %': df_na}, index=[0])

    return missing_data


class NumColsNaMedianFiller(TransformerMixin, BaseEstimator):

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        cols_cat = cat_cols(df)
        cols_num = list(set(df.columns) - set(cols_cat))

        for col in cols_num:
            median = df[col].median()
            df[col].fillna(median, inplace=True)

        return df


class NumColsNegFiller(TransformerMixin, BaseEstimator):

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        cols_num = num_cols(df)

        for col in cols_num:
            df[col].fillna(-999, inplace=True)

        return df


class NumColsRatioDropper(TransformerMixin, BaseEstimator):
    def __init__(self, ratio: float = 0.5):
        self.ratio = ratio

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        # print(X[self.attribute_names].columns)

        cols_cat = cat_cols(df)
        cats = df[cols_cat]
        # nums = df.drop(columns=cols_cat)
        # cols_num = df[~df[cols_cat]].columns
        cols_num = list(set(df.columns) - set(cols_cat))
        nums = df[cols_num]

        ratio = self.ratio * 100
        missings = missing_ratio_col(nums)
        # print(missings)
        inds = missings[missings['Missing Ratio %'] > ratio].index
        df = df.drop(columns=inds)
        return df


class ColsDropper(TransformerMixin, BaseEstimator):
    def __init__(self, cols: List[str]):
        self.cols = cols

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        return df.drop(columns=self.cols)


class DataFrameSelector(TransformerMixin, BaseEstimator):
    def __init__(self, col_names):
        self.attribute_names = col_names

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        print(X[self.attribute_names].columns)

        return X[self.attribute_names].values


class DummyEncoder(TransformerMixin, BaseEstimator):

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        cols_cat = cat_cols(df)

        cats = df[cols_cat]
        noncats = df.drop(columns=cols_cat)

        cats = cats.astype('category')
        cats_enc = pd.get_dummies(cats, prefix=cols_cat, dummy_na=True)

        return noncats.join(cats_enc)


# Label encoding is OK when we're using tree models
class MyLabelEncoder(TransformerMixin, BaseEstimator):

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        cols_cat = cat_cols(df)

        for col in cols_cat:
            df[col] = df[col].astype('category').cat.add_categories(
                'missing').fillna('missing')
            le = preprocessing.LabelEncoder()
            # TODO add test set together to encoding
            # le.fit(df[col].astype(str).values)
            df[col] = le.fit_transform(df[col].astype(str).values)
        return df


class FrequencyEncoder(TransformerMixin, BaseEstimator):
    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        for col in self.cols:
            vc = df[col].value_counts(dropna=True, normalize=True).to_dict()
            vc[-1] = -1
            nm = col + '_FE'
            df[nm] = df[col].map(vc)
            df[nm] = df[nm].astype('float32')
        return df


class CombineEncoder(TransformerMixin, BaseEstimator):
    def __init__(self, cols_pairs: List[List[str]]):
        self.cols_pairs = cols_pairs

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        for pair in self.cols_pairs:
            col1 = pair[0]
            col2 = pair[1]
            nm = col1 + '_' + col2
            df[nm] = df[col1].astype(str) + '_' + df[col2].astype(str)
            df[nm] = df[nm].astype('category')
            # print(nm, ', ', end='')
        return df


class AggregateEncoder(TransformerMixin, BaseEstimator):
    def __init__(self, main_cols: List[str], uids: List[str], aggr_types: List[str],
                 fill_na: bool = True, use_na: bool = False):
        self.main_cols = main_cols
        self.uids = uids
        self.aggr_types = aggr_types
        self.use_na = use_na
        self.fill_na = fill_na

    def fit(self, X, y=None):
        return self

    def transform(self, df):
        for col in self.main_cols:
            for uid in self.uids:
                for aggr_type in self.aggr_types:
                    col_new = f'{col}_{uid}_{aggr_type}'
                    tmp = df.groupby([uid])[col].agg([aggr_type]).reset_index().rename(
                        columns={aggr_type: col_new})
                    tmp.index = list(tmp[uid])
                    tmp = tmp[col_new].to_dict()
                    df[col_new] = df[uid].map(tmp).astype('float32')
                    if self.fill_na:
                        df[col_new].fillna(-1, inplace=True)
        return df
from sklearn.pipeline import Pipeline

pipe = Pipeline(steps=[
    # Based on feature engineering from 
    # https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600#Encoding-Functions
    ('combine_enc', CombineEncoder(
        [['card1', 'addr1'], ['card1_addr1', 'P_emaildomain']])),
    ('freq_enc', FrequencyEncoder(
        ['addr1', 'card1', 'card2', 'card3', 'P_emaildomain'])),
    ('aggr_enc', AggregateEncoder(['TransactionAmt', 'D9', 'D11'], [
        'card1', 'card1_addr1', 'card1_addr1_P_emaildomain'], ['mean', 'std'])),

    # Drop columns that have certain high ratio of missing values, and then fill
    # in values e.g. median value. May not be used if using a tree model.
    ('reduce_missing', NumColsRatioDropper(0.5)),
    ('fillna_median', NumColsNaMedianFiller()),

    # Drop some columns that will not be used
    ('drop_cols_basic', ColsDropper(['TransactionID', 'TransactionDT', 'D6', 
                                     'D7', 'D8', 'D9', 'D12', 'D13', 'D14', 'C3',
                                     'M5', 'id_08', 'id_33', 'card4', 'id_07', 
                                     'id_14', 'id_21', 'id_30', 'id_32', 'id_34'])),

    # Drop some columns based on feature importance got from a model.
    ('drop_cols_feat_importance', ColsDropper(
        ['v107', 'v117', 'v119', 'v120', 'v27', 'v28', 'v305'])),

    ('fillna_negative', NumColsNegFiller()),

    # Encode categorical variables. Depending on the kind of model we use, 
    # we can choose between label encoding and onehot encoding.
    # ('onehot_enc', DummyEncoder()),
    ('label_enc', MyLabelEncoder()),
])

Split dataset

And as we want to predict future payment fraud based on the past data, so we should not shuffle the dataset when split training and testing sets, but just time-based split.

As this is a imbalanced dataset with 1 class of the target variable have only about 3.5%, so we may want to try sampling methods like over-sampling or SMOTE sampling on the training dataset.

RAND_STATE = 20200119

def data_split_v1(X: pd.DataFrame, y: pd.Series):
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.25, shuffle=False, random_state=RAND_STATE)

    return X_train, X_val, y_train, y_val


def data_split_oversample_v1(X: pd.DataFrame, y: pd.Series):
    from imblearn.over_sampling import RandomOverSampler

    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.25, shuffle=False, random_state=RAND_STATE)

    ros = RandomOverSampler(random_state=RAND_STATE)
    X_train, y_train = ros.fit_resample(X_train, y_train)

    return X_train, X_val, y_train, y_val


def data_split_smoteenn_v1(X: pd.DataFrame, y: pd.Series):
    from imblearn.combine import SMOTEENN

    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, shuffle=False, random_state=RAND_STATE)

    ros = SMOTEENN(random_state=RAND_STATE)
    X_train, y_train = ros.fit_resample(X_train, y_train)

    return X_train, X_val, y_train, y_val

Experiments

Now let’s start play with experimenting with simple models like Logistic Regression, or complex models like Gradient Boosting.

Here below is a scaffold for performing experiments.

import os
from datetime import datetime
import json
import pprint

from sklearn import metrics
from sklearn.pipeline import Pipeline
from typing import List, Callable
        
EXP_DIR = 'exp'

class Experiment:
    def __init__(self, df_nrows: int = None, transform_pipe: Pipeline = None,
                 data_split: Callable = None, model=None, model_class=None,
                 model_param: dict = None):
        self.df_nrows = df_nrows
        self.pipe = transform_pipe

        if data_split is None:
            self.data_split = data_split_v1
        else:
            self.data_split = data_split

        if model_class:
            self.model = model_class(**model_param)
        else:
            self.model = model

        self.model_param = model_param

    def transform(self, X):
        return self.pipe.fit_transform(X)

    def run(self, df_train: pd.DataFrame, save_exp: bool = True) -> float:
        # self.df = load_df(nrows=self.df_nrows)

        y = df_train['isFraud']
        X = df_train.drop(columns=['isFraud'])

        X = self.transform(X)

        X_train, X_val, Y_train, Y_val = self.data_split(X, y)

        # del X
        # gc.collect()

        self.model.fit(X_train, Y_train)

        Y_pred = self.model.predict(X_val)
        self.last_roc_auc = metrics.roc_auc_score(Y_val, Y_pred)

        if save_exp:
            self.save_result()

        return self.last_roc_auc
    
    def save_result(self, feature_importance:bool=False):
        save_time = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
        result = {}
        result['roc_auc'] = self.last_roc_auc
        result['transform'] = list(self.pipe.named_steps.keys())
        result['model'] = self.model.__class__.__name__
        result['model_param'] = self.model_param
        result['data_split'] = self.data_split.__name__
        result['num_sample_rows'] = self.df_nrows
        result['save_time'] = save_time
        if feature_importance:
            if hasattr(self.model, 'feature_importances_'):
                result['feature_importances_'] = dict(
                    zip(self.X.columns, self.model.feature_importances_.tolist()))
            if hasattr(self.model, 'feature_importance'):
                result['feature_importances_'] = dict(
                    zip(self.df.columns, self.model.feature_importance.tolist()))

        pprint.pprint(result, indent=4)

        if not os.path.exists(EXP_DIR):
            os.makedirs(EXP_DIR)
        with open(f'{EXP_DIR}/exp_{save_time}_{self.last_roc_auc:.4f}.json', 'w') as f:
            json.dump(result, f, indent=4)

import gc


del tran_train, id_train
gc.collect()

df_train = load_df()
df_train = load_df()
Mem. usage decreased to 650.48 Mb (66.8% reduction)

Logistic Regression as baseline

def exp1():
    from sklearn.linear_model import LogisticRegression

    pipe = Pipeline(steps=[
        ('combine_enc', CombineEncoder(
            [['card1', 'addr1'], ['card1_addr1', 'P_emaildomain']])),
        ('freq_enc', FrequencyEncoder(
            ['addr1', 'card1', 'card2', 'card3', 'P_emaildomain'])),
        ('aggr_enc', AggregateEncoder(['TransactionAmt', 'D9', 'D11'], [
         'card1', 'card1_addr1', 'card1_addr1_P_emaildomain'], ['mean', 'std'])),

        ('reduce_missing', NumColsRatioDropper(0.3)),
        ('fillna_median', NumColsNaMedianFiller()),

        ('drop_cols_basic', ColsDropper(['TransactionID', 'TransactionDT', 'C3', 'M5', 'id_08', 'id_33', 'card4', 'id_07', 'id_14', 'id_21', 'id_30', 'id_32', 'id_34'])),

        # Though onehot encoding is more appropriate for logistic regression, we
        # don't have enough memory to encode that many variables. So we take a 
        # step back using label encoding.
        # ('onehot_enc', DummyEncoder()),
        ('label_enc', MyLabelEncoder()),
    ])

    exp = Experiment(transform_pipe=pipe,
                      data_split=data_split_v1,
                      model_class=LogisticRegression,
                      # just use the default hyper paramenters
                      model_param={},
                     )
    exp.run(df_train=df_train)

exp1()
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



{   'data_split': 'data_split_v1',
    'model': 'LogisticRegression',
    'model_param': {},
    'num_sample_rows': None,
    'roc_auc': 0.4956463187232307,
    'save_time': '2020-03-26_20-27-08',
    'transform': [   'combine_enc',
                     'freq_enc',
                     'aggr_enc',
                     'reduce_missing',
                     'fillna_median',
                     'drop_cols_basic',
                     'label_enc']}

Gradient Boosting with LightGBM

Now let’s try a Gradient Boosting tree model using the LightGBM implementation, and tune a little on the hyper-parameters to make it a more complex model.

import lightgbm as lgb


class LgbmWrapper:
    def __init__(self, **param):
        self.param = param
        self.trained = None

    def fit(self, X_train, y_train):
        train = lgb.Dataset(X_train, label=y_train)
        self.trained = lgb.train(self.param, train)
        self.feature_importances_ = self.trained.feature_importance()
        return self.trained

    def predict(self, X_val):
        return self.trained.predict(X_val, num_iteration=self.trained.best_iteration)


def exp2():
    pipe = Pipeline(steps=[
        # Based on feature engineering from 
        # https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600#Encoding-Functions
        ('combine_enc', CombineEncoder(
            [['card1', 'addr1'], ['card1_addr1', 'P_emaildomain']])),
        ('freq_enc', FrequencyEncoder(
            ['addr1', 'card1', 'card2', 'card3', 'P_emaildomain'])),
        ('aggr_enc', AggregateEncoder(['TransactionAmt', 'D9', 'D11'], [
            'card1', 'card1_addr1', 'card1_addr1_P_emaildomain'], ['mean', 'std'])),
    
        # Drop some columns that will not be used
        ('drop_cols_basic', ColsDropper(['TransactionID', 'TransactionDT', 'D6', 
                                        'D7', 'D8', 'D9', 'D12', 'D13', 'D14', 'C3',
                                        'M5', 'id_08', 'id_33', 'card4', 'id_07', 
                                        'id_14', 'id_21', 'id_30', 'id_32', 'id_34'])),
    
        # Drop some columns based on feature importance got from a model.
        # ('drop_cols_feat_importance', ColsDropper(
        #     ['v107', 'v117', 'v119', 'v120', 'v27', 'v28', 'v305'])),
    
        ('fillna_negative', NumColsNegFiller()),
    
        # Label encoding used for tree models.
        # ('onehot_enc', DummyEncoder()),
        ('label_enc', MyLabelEncoder()),
    ])

    param_lgbm = {'objective': 'binary',
                  'boosting_type': 'gbdt',
                  'metric': 'auc',
                  'learning_rate': 0.01,
                  'num_leaves': 2**8,
                  'max_depth': -1,
                  'tree_learner': 'serial',
                  'colsample_bytree': 0.7,
                  'subsample_freq': 1,
                  'subsample': 0.7,
                  'n_estimators': 10000,
                  #  'n_estimators': 80000,
                  'max_bin': 255,
                  'n_jobs': -1,
                  'verbose': -1,
                  'seed': RAND_STATE,
                  # 'early_stopping_rounds': 100,
                  }


    exp = Experiment(transform_pipe=pipe,
                    data_split=data_split_v1,
                     model_class=LgbmWrapper,
                     model_param=param_lgbm,
                     )
    exp.run(df_train=df_train)


exp2()
/usr/local/lib/python3.6/dist-packages/lightgbm/engine.py:118: UserWarning:

Found `n_estimators` in params. Will use it instead of argument



{   'data_split': 'data_split_v1',
    'model': 'LgbmWrapper',
    'model_param': {   'boosting_type': 'gbdt',
                       'colsample_bytree': 0.7,
                       'learning_rate': 0.01,
                       'max_bin': 255,
                       'max_depth': -1,
                       'metric': 'auc',
                       'n_estimators': 10000,
                       'n_jobs': -1,
                       'num_leaves': 256,
                       'objective': 'binary',
                       'seed': 20200119,
                       'subsample': 0.7,
                       'subsample_freq': 1,
                       'tree_learner': 'serial',
                       'verbose': -1},
    'num_sample_rows': None,
    'roc_auc': 0.919589853747652,
    'save_time': '2020-03-27_09-55-43',
    'transform': [   'combine_enc',
                     'freq_enc',
                     'aggr_enc',
                     'drop_cols_basic',
                     'fillna_negative',
                     'label_enc']}

So we got local validation ROC AUC of about 0.9196, this is a looks OK score.

This model’s prediction on the test dataset got 0.9398 on publica leader board, and 0.9058 on private leader board. These scores have a somehow big gap to the top scores, but still good enough as there’re potentially many ways for improvement. For example, more different ways of transformations and engineering could be performed on the features, try model implementation like CatBoost and XGB, and search for better hyper-parameters. But it assumes you have plenty of computation resource and time.

Avatar
PENG, Cong
Ph.D. Student

Related