A decision tree is a supervised machine learning algorithm that is used for both classification and regression tasks. It is a flowchart-like structure where each internal node represents a feature or attribute, each branch represents a decision rule, and each leaf node represents the outcome or class label.
The concept of a decision tree is based on a series of binary decisions that lead to a final prediction or decision. It starts with a root node that represents the entire dataset. The tree recursively splits the dataset based on the selected features, creating branches or sub-trees, until a termination condition is met. The splitting process aims to maximize the information gain or minimize the impurity at each step.
In a classification decision tree, the leaf nodes represent class labels or categories. The tree learns from the training data by selecting the most informative features and creating decision rules to classify instances into different classes. During the training process, the algorithm evaluates various features and splits the data based on the one that provides the most information gain or the best impurity reduction.
In a regression decision tree, the leaf nodes represent continuous values or predicted outcomes. The algorithm recursively splits the data based on feature values to create sub-trees until a stopping criterion is met. The predicted outcome in a leaf node is usually the average or mean of the target variable values within that particular region.
Decision trees have several advantages. They are easy to interpret and visualize, making them useful for understanding the decision-making process. Decision trees can handle both categorical and numerical data and can handle missing values. They can also capture non-linear relationships between features and the target variable.
However, decision trees can suffer from overfitting, where they memorize the training data too well and perform poorly on unseen data. To mitigate this, techniques like pruning, setting termination conditions, and using ensemble methods like random forests or gradient boosting can be employed.
In summary, a decision tree is a hierarchical structure that uses a series of binary decisions to classify or predict outcomes. It is a versatile and interpretable algorithm used in various domains for both classification and regression tasks.
Here is the code for building simple decision tree model.
import os
import matplotlib.pylab as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# Load the dataset
AH_data = pd.read_csv("tree_addhealth.csv")
data_clean = AH_data.dropna()
print(data_clean.dtypes)
print(data_clean.describe())
# """
# Modeling and Prediction
# """
# Split into training and testing sets
predictors = data_clean[['BIO_SEX', 'HISPANIC', 'WHITE', 'BLACK', 'NAMERICAN', 'ASIAN',
'age', 'ALCEVR1', 'ALCPROBS1', 'marever1', 'cocever1', 'inhever1', 'cigavail', 'DEP1',
'ESTEEM1', 'VIOL1', 'PASSIST', 'DEVIANT1', 'SCHCONN1', 'GPA1', 'EXPEL1', 'FAMCONCT', 'PARACTV',
'PARPRES']]
targets = data_clean.TREG1
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
print(pred_train.shape)
print(pred_test.shape)
print(tar_train.shape)
print(tar_test.shape)
# Build model on training data
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
sklearn.metrics.confusion_matrix(tar_test, predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)
# Displaying the decision tree
from sklearn import tree
from StringIO import StringIO
from io import StringIO
from StringIO import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file=out)
import pydotplus
graph = pydotplus.graph_from_dot_data(out.getvalue())
Image(graph.create_png())
'공부하며 성장하기 > 인공지능 AI' 카테고리의 다른 글
Lasso regression (0) | 2023.05.21 |
---|---|
Random Forest (0) | 2023.05.20 |
초기 신경망 이론과 모델 (0) | 2022.10.29 |
이미지 인코딩(Encoding)과 디코딩(Decoding) 과정 이해하기 (2) | 2022.10.18 |
Yolov5에서 ModelEMA와 model fuse가 의미하는 것 (2) | 2022.09.24 |