Similarity Graph-Based Semi-supervised Methods for Multiclass Data Classification

The purpose of the study was to determine whether graph-based machine learning techniques, which have increased prevalence in the last few years, can accurately classify data into one of many clusters, while requiring less labeled training data and parameter tuning as opposed to traditional machine learning algorithms. We hypothesized that traditional machine learning algorithms, such as support vector machines (SVM), neural networks, and random forests, would perform accurately with less labeled training data and parameter tuning compared to their graph-based counterparts. We tested three traditional algorithms, (SVM, neural networks, and random forests), and two graph-based algorithms, (K Nearest Neighbors (KNN) and a graph-based adaptation of the classical Merriman-Bence-Osher scheme for estimating mean curvature). We ran each algorithm across three datasets of varying dimensionality, or number of features – the data banknote dataset, letter recognition dataset, and breast cancer dataset contained 5, 26, and 30 features, respectively. Algorithms were analyzed using training data, taken as a subset of each overall dataset, and averaged across four iterations. Our results did not support the hypothesis as the traditional algorithms did not outperform the graph-based techniques on all datasets, regardless of dimensionality. We determined that the accuracy of graph-based and traditional classification algorithms depends directly upon the number of features of each dataset, the number of classes in each dataset, and the amount of labeled training data used.