Data Preprocessing and Balancing:
- Filtered out classes with fewer than two samples to ensure that there are at least two examples for each category.
- Used the Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset, which helps in addressing the problem of imbalanced classes by creating synthetic samples.
- Ensured the dataset contains at least two classes with more than one sample each to proceed with SMOTE.
- After balancing, the dataset was split again into a training set (to learn from) and a testing set (to evaluate the model).
- Scaled the data to ensure that no feature disproportionately affects the model’s performance.
- Employed a logistic regression model wrapped in a pipeline with data scaling for classification purposes.
- Trained the logistic regression model on the balanced and scaled training data.
Model Evaluation and Results:
- The model’s predictions were compared against the actual data in the test set to evaluate performance.
- The accuracy obtained was approximately 45%, which indicates that the model correctly predicted the class nearly half of the time.
- A confusion matrix was generated to show the model’s predictions in detail, highlighting where it got confused between different classes.
- A classification report provided a breakdown of performance metrics for each class, including precision (correct predictions out of all predictions for a class), recall (correct predictions out of all actual instances of a class), and the F1-score (a harmonic mean of precision and recall).
Summary of Findings:
- The overall accuracy suggests the model may not be highly effective for this particular dataset as it stands.
- Some classes were predicted with high accuracy (classes with a large number of synthetic samples created by SMOTE), while others were not, which could indicate model overfitting to the resampled data or an inherent complexity in the dataset that makes some classes hard to distinguish.
- The detailed results from the confusion matrix and classification report suggest that the model’s performance varies significantly across different classes, with some classes having higher precision and recall than others.