Articles | Journal of Emerging Investigators

Similarity Graph-Based Semi-supervised Methods for Multiclass Data Classification

Balaji et al. | Sep 11, 2021

The purpose of the study was to determine whether graph-based machine learning techniques, which have increased prevalence in the last few years, can accurately classify data into one of many clusters, while requiring less labeled training data and parameter tuning as opposed to traditional machine learning algorithms. The results determined that the accuracy of graph-based and traditional classification algorithms depends directly upon the number of features of each dataset, the number of classes in each dataset, and the amount of labeled training data used.

Using text embedding models as text classifiers with medical data

Goel et al. | Nov 19, 2024

This article describes the classification of medical text data using vector databases and text embedding. Various large language models were used to generate this medical data for the classification task.

Collaboration beats heterogeneity: Improving federated learning-based waste classification

Chong et al. | Jul 18, 2023

Based on the success of deep learning, recent works have attempted to develop a waste classification model using deep neural networks. This work presents federated learning (FL) for a solution, as it allows participants to aid in training the model using their own data. Results showed that with less clients, having a higher participation ratio resulted in less accuracy degradation by the data heterogeneity.

Deep learning for pulsar detection: Investigating hyperparameter effects on TensorFlow classification accuracy

Upadhyay et al. | Jan 31, 2026

This study investigates how the hyperparameters epochs and batch size affect the classification accuracy of a convolutional neural network (CNN) trained on pulsar candidate data. Our results reveal that accuracy improves with increasing number of epochs and smaller batch sizes, suggesting that with optimized hyperparameters, high accuracy may be achievable with minimal training. These findings offer insights that could help create more efficient machine learning classification models for pulsar signal detection, with the potential of accelerating pulsar discovery and advancing astrophysical research.

Optimizing data augmentation to improve machine learning accuracy on endemic frog calls

Anand et al. | Mar 09, 2025

The mountain chain of the Western Ghats on the Indian peninsula, a UNESCO World Heritage site, is home to about 200 frog species, 89 of which are endemic. Distinctive to each frog species, their vocalizations can be used for species recognition. Manually surveying frogs at night during the rain in elephant and big cat forests is difficult, so being able to autonomously record ambient soundscapes and identify species is essential. An effective machine learning (ML) species classifier requires substantial training data from this area. The goal of this study was to assess data augmentation techniques on a dataset of frog vocalizations from this region, which has a minimal number of audio recordings per species. Consequently, enhancing an ML model’s performance with limited data is necessary. We analyzed the effects of four data augmentation techniques (Time Shifting, Noise Injection, Spectral Augmentation, and Test-Time Augmentation) individually and their combined effect on the frog vocalization data and the public environmental sounds dataset (ESC-50). The effect of combined data augmentation techniques improved the model's relative accuracy as the size of the dataset decreased. The combination of all four techniques improved the ML model’s classification accuracy on the frog calls dataset by 94%. This study established a data augmentation approach to maximize the classification accuracy with sparse data of frog call recordings, thereby creating a possibility to build a real-world automated field frog species identifier system. Such a system can significantly help in the conservation of frog species in this vital biodiversity hotspot.

Transfer learning and data augmentation in osteosarcoma cancer detection

Chu et al. | Jun 03, 2023

Osteosarcoma is a type of bone cancer that affects young adults and children. Early diagnosis of osteosarcoma is crucial to successful treatment. The current methods of diagnosis, which include imaging tests and biopsy, are time consuming and prone to human error. Hence, we used deep learning to extract patterns and detect osteosarcoma from histological images. We hypothesized that the combination of two different technologies (transfer learning and data augmentation) would improve the efficacy of osteosarcoma detection in histological images. The dataset used for the study consisted of histological images for osteosarcoma and was quite imbalanced as it contained very few images with tumors. Since transfer learning uses existing knowledge for the purpose of classification and detection, we hypothesized it would be proficient on such an imbalanced dataset. To further improve our learning, we used data augmentation to include variations in the dataset. We further evaluated the efficacy of different convolutional neural network models on this task. We obtained an accuracy of 91.18% using the transfer learning model MobileNetV2 as the base model with various geometric transformations, outperforming the state-of-the-art convolutional neural network based approach.

Predicting smoking status based on RNA sequencing data

Yang et al. | Aug 30, 2024

Given an association between nicotine addiction and gene expression, we hypothesized that expression of genes commonly associated with smoking status would have variable expression between smokers and non-smokers. To test whether gene expression varies between smokers and non-smokers, we analyzed two publicly-available datasets that profiled RNA gene expression from brain (nucleus accumbens) and lung tissue taken from patients identified as smokers or non-smokers. We discovered statistically significant differences in expression of dozens of genes between smokers and non-smokers. To test whether gene expression can be used to predict whether a patient is a smoker or non-smoker, we used gene expression as the training data for a logistic regression or random forest classification model. The random forest classifier trained on lung tissue data showed the most robust results, with area under curve (AUC) values consistently between 0.82 and 0.93. Both models trained on nucleus accumbens data had poorer performance, with AUC values consistently between 0.65 and 0.7 when using random forest. These results suggest gene expression can be used to predict smoking status using traditional machine learning models. Additionally, based on our random forest model, we proposed KCNJ3 and TXLNGY as two candidate markers of smoking status. These findings, coupled with other genes identified in this study, present promising avenues for advancing applications related to the genetic foundation of smoking-related characteristics.

Effects of data amount and variation in deep learning-based tuberculosis diagnosis in chest X-ray scans

Bhorkar et al. | Apr 28, 2025

The authors developed and tested machine learning methods to diagnose tuberculosis from pulmonary X-ray scans.

Model selection and optimization for poverty prediction on household data from Cambodia

Wong et al. | Sep 29, 2023

Here the authors sought to use three machine learning models to predict poverty levels in Cambodia based on available household data. They found teat multilayer perceptron outperformed the other models, with an accuracy of 87 %. They suggest that data-driven approaches such as these could be used more effectively target and alleviate poverty.

Class distinctions in automated domestic waste classification with a convolutional neural network

Goel et al. | Apr 29, 2025

Domestic waste classification using convolutional neural network

Browse Articles

Similarity Graph-Based Semi-supervised Methods for Multiclass Data Classification

Using text embedding models as text classifiers with medical data

Collaboration beats heterogeneity: Improving federated learning-based waste classification

Deep learning for pulsar detection: Investigating hyperparameter effects on TensorFlow classification accuracy

Optimizing data augmentation to improve machine learning accuracy on endemic frog calls

Transfer learning and data augmentation in osteosarcoma cancer detection

Predicting smoking status based on RNA sequencing data

Effects of data amount and variation in deep learning-based tuberculosis diagnosis in chest X-ray scans

Model selection and optimization for poverty prediction on household data from Cambodia

Class distinctions in automated domestic waste classification with a convolutional neural network

Search Articles

Popular Tags

Browse Articles

Search Articles

Category

School Level

Popular Tags