Random Forest运作理论
视频链接 https://www.bilibili.com/video/BV1Ra4y1E752/?spm_id_from=333.337.search-card.all.click&vd_source=a542d98d483fd367e498fc1f04b5dc10
定义
random forest is a collection of a bunch of decision tree
supervised machine learning, need labelled data
有监督学习,需要有标签的信息
在一张图片中,每一个像素都有一个像素值,我们可以设置不同的像素值代表了不同的质地
比如air < 10; 11 < pore < 60; pyrite>170
就变成了一个decision tree: root node, leaf node, internal node
once all possible branches in our decision tree end in leaf nodes, we’re done. We’ve trained a decision tree.
决策点的选择
决策点:gives a best split of input data
pick a node that gives the best split: use gini impurity
使用基尼系数来决定决策点的选择
决策树缺点:会过拟合,在训练集上表现好,但是在测试集上表现差
运作流程
random subset of features available
pick the one that gives the best split in data
最后的结果由大多数的决策树投票选出 majority
random forest 代码实现
视频链接 https://www.youtube.com/watch?v=YYjvkSJoui4
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
df = pd.read_csv("au_label.csv")
print(df.head())
sizes = df['label'].value_counts(sort = 1)
print(sizes)
#就会统计label的数量
df.drop(["people"], axis = 1, inplace = True)
#handle missing values
#df = df.dropna()
#convert non-numeric data to numeric
#eg: good-1 bad-0
#df.Productivity[df.Productivity == "bad"] = 0
#df.Productivity[df.Productivity == "good"] = 1
#define dependent variable
Y = df["label"].values
Y = Y.astype("int")
#Define independent variables
#column_selection3 = [' AU01_r', ' AU02_r', ' AU04_r', ' AU05_r', ' AU06_r', ' AU07_r', ' AU09_r', ' AU10_r', ' AU12_r', ' AU14_r', ' AU15_r', ' AU17_r', ' AU20_r', ' AU23_r', ' AU25_r', ' AU26_r',' AU45_r']
X = df.drop(labels = ["label"], axis = 1)
#now the data is ready
#split data into train and test datasets
#从sklearn中倒入切分数据集的函数
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state = 20)
#print(X_test)
#倒入随机森林模型
from sklearn.ensemble import RandomForestClassifier
#define a model
model = RandomForestClassifier(n_estimators = 10, random_state = 30)
#训练模型
model.fit(X_train, Y_train)
#test dataset
prediction_test = model.predict(X_test)
print(prediction_test)
#compare with Y_test, check if it is correct
from sklearn import metrics
print("Accuracy = ", metrics.accuracy_score(Y_test, prediction_test))
#figure out which feature is the most important
#找到权重更加大的特征
#print(model.feature_importances_)
feature_list = list(X.columns)
feature_imp = pd.Series(model.feature_importances_, index = feature_list).sort_values(ascending=False)
print(feature_imp)