分類アルゴリズム～アルゴリズムに適応させるまでのおさらい～

こんにちは！EMです^^
それでは本日からは分類アルゴリズムシリーズをいくつか紹介していきたいと思っています！
詳しくは
ロジスティック回帰（Logistic Regression）
サポートベクターマシン（Support Vector Machine）
決定木（Decision Trees）
アンサンブル手法（Ensemble methods）:
ランダムフォレスト（Random Forests）,
勾配ぶースティング（Gradient Boosting）
上記の項目を今後取り上げていこうと思います。
それではさっそく始めていきましょう！
まず一連の流れとして、必要なものをインポートしていきます。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn as sk
pd.options.display.max_columns=500
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn import linear_model
 
 
def PlotBoundaries(model, X, Y) :
    '''
Helper function that plots the decision boundaries of a model and data (X,Y)
    '''
    
  x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
  y_min, y_max = X[:, 1].min() - 1,X[:, 1].max() + 1
  xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

  Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
  Z = Z.reshape(xx.shape)

  plt.contourf(xx, yy, Z, alpha=0.4)

    #Plot
  plt.scatter(X[:, 0], X[:, 1], c=Y,s=12, edgecolor='k')
  plt.show()
 
 
 
※このセクションで使用するデータはこのURLから取得する事が出来ます。
https：//www.kaggle.com/blastchar/telco-customer-churn
 
 
df = pd.read_csv('Telco-Customer-Churn.csv', index_col=0)
♯index_col=0は、インデックスの列名です。
　0を指定すると、最初の列がインデックスとなります。
　これをしなければ、Unnamed:0 という列が自動で付与されます。
 
df.head()
 
 
カテゴリを数値に変換する
 
これらのアルゴリズムは生データの情報をそのまま処理する事はできません。
その為最初の作業として、使える形式に変えるべく
すべてのカテゴリデータを「数値」に変換する必要があります。
 
 
df['gender'] = np.where(df['gender'] == 'Female', 0, 1)
df['Partner'] = np.where(df['Partner'] == 'No', 0, 1)
df['Dependents'] = df['Dependents'].map({'No':0, 'Yes':1})
df['PhoneService'] = df['PhoneService'].map({'No':0, 'Yes':1})
df['PaperlessBilling'] = df['PaperlessBilling'].map({'No':0, 'Yes':1})
df = pd.get_dummies(df, columns=['MultipleLines','InternetService','OnlineSecurity',
　　　　　　　　　　　　　　　　　　'OnlineBackup','DeviceProtection','TechSupport',
　　　　　　　　　　　　　　　　　　'StreamingTV','StreamingMovies','Contract',
　　　　　　　　　　　　　　　　　　'PaymentMethod'])
df['Churn']=df['Churn'].map({'No':0, 'Yes':1})
df['TotalCharges'] = pd.to_numeric(np.where(df['TotalCharges']==' ', 0,
　　　　　　　　　　　 df['TotalCharges']))
 
 
 
ここでは、先日記述したfeature universeを使っていきましょう。
モデルが上位の特徴を選択できる様に、feature universeのリストを定義します。
 
features_to_include = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'PaperlessBilling', 'MonthlyCharges', 'TotalCharges', 
　　　　'MultipleLines_No', 'MultipleLines_No phone service',
       'MultipleLines_Yes', 'InternetService_DSL',
       'InternetService_Fiber optic', 'InternetService_No',
       'OnlineSecurity_No', 'OnlineSecurity_No internet service',
       'OnlineSecurity_Yes', 'OnlineBackup_No',
       'OnlineBackup_No internet service', 'OnlineBackup_Yes',
       'DeviceProtection_No', 'DeviceProtection_No internet service',
       'DeviceProtection_Yes', 'TechSupport_No',
       'TechSupport_No internet service', 'TechSupport_Yes', 'StreamingTV_No',
       'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No', 'StreamingMovies_No internet service',
       'StreamingMovies_Yes', 'Contract_Month-to-month', 'Contract_One year',
       'Contract_Two year', 'PaymentMethod_Bank transfer (automatic)',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check']
 
 
特徴選択
特徴のある項目を選択するためのユニバースがあり、アルゴリズムを使用して
モデルに含める最適な特徴を選択していきます。
境界をうまく視覚化するために、2次元で表示していきます。
その為、scikit-learnの特徴選択（feature selection）パッケージを使用して
2つの特徴のみを保持します。
使用するパッケージは、ノンパラメトリック手法を使用して
ターゲットと共有される相互情報量(MI:Mutual Information)によって機能をランク付けします。
※ノンパラメトリック手法 - Wikipedia
 
2つの確率変数間の相互情報量（MI）は負ではない値（0かそれ以上）であり
変数間の依存関係を測定します。
2つの確率変数が依存しておらず、独立している場合に限り、ゼロに等しくなります。
値が大きいほど、依存度が高くなります。
 
from sklearn import feature_selection
 
feature_rank = feature_selection.mutual_info_classif(df[features_to_include], 
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　df['Churn'])
 
feature_rank_df = pd.DataFrame(list(zip(features_to_include, feature_rank)), 
　　　　　　　　　　　　　　　　　　　　　　columns=['Feature', 'Score'])
 
 
feature_rank_df.sort_values(by='Score', ascending = False).head()
 
 実際には、1番上のContract_Month-to-monthを最適な機能として選択する必要がありますが
値は2値変数（0か1、onかoffか等、いずれか一方の値をとる変数）であり
ポイント間の分離を視覚化すると、適切なプロットが作成されません。 
したがって、今回は他の2つの連続変数を選択します。
 two_features = ['tenure','MonthlyCharges']
 
 
 
test/trainに分割
 
X_trn, X_tst, Y_trn, Y_tst = train_test_split(df[features_to_include], 
　　　　　　　　　　　　　　　　　　　　　　　　　df['Churn'], test_size=0.4)
 
 
 
 
 
いかがだったでしょうか？
 
インポートからアルゴリズムに適応する一連の流れをざっとおさらいしてきました。
実際にはデータクリーニングの作業が加わったりするなど、人やデータによって
工程は違うと思うので、様々な流れを学んで頂けたらと思います^^
 
次からは今回の内容を元に、どんどんアルゴリズムに当てはめて
どのように動くか見ていきたいと思いますので、どうぞお楽しみに～^^
 
 
 
また詳しくデータ分析を学びたいという方は、マンツーマン指導が
無料体験できるみたいなので、ぜひ利用してみてくださいね～！
（私自身も無駄だったとは言いたくありませんが、独学で長い年月を遠回りしてきました…）
 
 
 
 
最後まで読んで頂きありがとうございました！