Introduction to Data Mining

Classification

Classification is a data mining technique used to predict the class label of a given data instance. It builds a model that maps input data attributes to the corresponding class. The model is built using a training dataset where the class label of each instance is known. Once the model is built, it can be used to predict the class of new, unseen instances.

Decision Trees

One of the most popular classification techniques is decision trees. A decision tree is a tree-like structure where each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. The tree is built recursively by selecting the best attribute to split the data at each internal node. The best attribute is the one that produces the highest information gain, which measures the reduction in uncertainty about the class label when splitting the data based on the attribute.

Example

To illustrate how classification works, consider a dataset of customers who have purchased a product and the corresponding product rating. The goal is to predict whether a new customer will like the product based on their age and income. The dataset is split into a training set and a test set. The training set is used to build a decision tree model, and the test set is used to evaluate the accuracy of the model. The decision tree model might look like this:

if income > 50k:
    if age > 30:
        return like
    else:
        return dislike
else:
    return like

The model predicts that a customer with income > 50k and age > 30 will like the product, while a customer with income > 50k and age <= 30 will dislike the product. A customer with income <= 50k will like the product regardless of their age.

Evaluation

To evaluate the accuracy of the classification model, various measures can be used, such as accuracy, precision, recall, and F1 score. Accuracy is the proportion of correctly classified instances, while precision is the proportion of true positives among the instances predicted as positive, recall is the proportion of true positives among the instances that are actually positive, and F1 score is the harmonic mean of precision and recall.

Take quiz (4 questions)

Previous unit

Association Rule Mining

Next unit

Clustering

All courses were automatically generated using OpenAI's GPT-3. Your feedback helps us improve as we cannot manually review every course. Thank you!