探索的データ解析 #18

Inazuma110 · 2021-04-24T07:26:45Z

もうちょっとデータに対する理解を深めたい．
ということで，pandas-profiling というライブラリがかなり便利だった．

実際に前処理後のデータ+目的変数に対してpandas-profilingを使ってみました．
https://github.com/Inazuma110/signate_studentcup2021/tree/inazuma110/data/html にあるHTMLファイルをブラウザで開いてください．こんな感じで各特徴量について可視化されています．

気づいたことや考察などありましたらコメントお願いします．

Inazuma110 · 2021-04-24T07:31:35Z

tempo max と min の特徴量の相関がとても高いので，片方削除しちゃっていい気がします．
その代わり #12 を追加したほうが良さそう．

ちなみに特徴量の次元が大きいと計算に時間がかかる他，汎用性が上がったりするそうです．
参考：https://qiita.com/shimopino/items/5fee7504c7acf044a521
(↑ここの特徴量選択などもやっていきたいところです．)

Inazuma110 · 2021-04-26T05:34:04Z

片方削除ではなく，平均を取った特徴量にしました．

def tempo2mean(df):
    # tempo カラムの各要素を '-' で区切ってリストにする
    df['tempo'] = df['tempo'].apply(lambda x: x.split('-'))
    df['tempo_min'] = df['tempo'].apply(lambda x: x[0])
    df['tempo_max'] = df['tempo'].apply(lambda x: x[1])
    df['tempo_min'] = df['tempo_min'].astype(float)
    df['tempo_max'] = df['tempo_max'].astype(float)
    df['tempo'] = (df['tempo_max']+df['tempo_min'])/2
    df = df.drop(columns='tempo_max')
    df = df.drop(columns='tempo_min')
    return df

Inazuma110 added the Observation 学習結果への考察など label Apr 24, 2021

Inazuma110 mentioned this issue Apr 26, 2021

kNNによる予測 #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

探索的データ解析 #18

探索的データ解析 #18

Inazuma110 commented Apr 24, 2021

Inazuma110 commented Apr 24, 2021 •

edited

Loading

Inazuma110 commented Apr 26, 2021

探索的データ解析 #18

探索的データ解析 #18

Comments

Inazuma110 commented Apr 24, 2021

Inazuma110 commented Apr 24, 2021 • edited Loading

Inazuma110 commented Apr 26, 2021

Inazuma110 commented Apr 24, 2021 •

edited

Loading