The concept of a random forest is a popular machine learning technique used for both classification and regression tasks. It is an ensemble method that combines multiple decision trees to make predictions.
Here's how a random forest works:
- Data Preparation: The first step is to prepare the training data, which includes input features and corresponding labels or target values. Each instance in the training data should have a set of features and a known label or target value.
- Building Decision Trees: A random forest consists of a collection of decision trees. To build each tree, a random subset of the training data is selected, and a decision tree is constructed using that subset. The randomness comes from two sources: random sampling of the training data and random selection of features.
- Random Sampling: Each decision tree is built using a random subset of the training data. Random sampling allows each tree to have slightly different training instances, which introduces diversity into the forest.
- Random Feature Selection: When constructing each tree, a random subset of features is selected at each split point. This means that each decision tree considers only a subset of available features, which again adds randomness to the forest.
- Tree Construction: Each decision tree is constructed by recursively partitioning the training data based on feature values. The goal is to create splits that minimize impurity or maximize information gain, depending on the algorithm used.
- Making Predictions: Once the random forest is trained, predictions can be made by aggregating the predictions of all the individual trees. For classification tasks, the most common label predicted by the trees is chosen as the final prediction. For regression tasks, the average or median of the predicted values is taken.
The idea behind random forests is that by combining multiple decision trees, each with its own random variation, the collective predictions become more robust and less prone to overfitting. The randomness in sampling the data and selecting features helps to reduce the correlation between trees and improves the overall accuracy and generalization of the model.
Random forests have several advantages, such as handling high-dimensional data, capturing non-linear relationships, and being resistant to overfitting. They are widely used in various domains, including finance, healthcare, and computer vision, due to their flexibility and effectiveness in handling complex tasks.
Here is the sample code for random forest.
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
#Load the dataset
AH_data = pd.read_csv("tree_addhealth.csv")
data_clean = AH_data.dropna()
print(data_clean.dtypes)
print(data_clean.describe())
# Split into training and testing sets
predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN','age',
'ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1','ESTEEM1','VIOL1',
'PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV','PARPRES']]
targets = data_clean.TREG1
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
print(pred_train.shape)
print(pred_test.shape)
print(tar_train.shape)
print(tar_test.shape)
# Build model on training data
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=25)
classifier = classifier.fit(pred_train,tar_train)
predictions = classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train, tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)
"""
Running a different number of trees and see the effect
of that on the accuracy of the prediction
"""
trees = range(25)
accuracy = np.zeros(25)
for idx in range(len(trees)):
classifier = RandomForestClassifier(n_estimators=idx + 1)
classifier = classifier.fit(pred_train,tar_train)
predictions = classifier.predict(pred_test)
accuracy[idx] = sklearn.metrics.accuracy_score(tar_test, predictions)
plt.cla()
plt.plot(trees, accuracy)
'공부하며 성장하기 > 인공지능 AI' 카테고리의 다른 글
K-means Clustering (0) | 2023.05.21 |
---|---|
Lasso regression (0) | 2023.05.21 |
Decision Tree (0) | 2023.05.19 |
초기 신경망 이론과 모델 (0) | 2022.10.29 |
이미지 인코딩(Encoding)과 디코딩(Decoding) 과정 이해하기 (2) | 2022.10.18 |