Articles | Journal of Emerging Investigators

Optimizing data augmentation to improve machine learning accuracy on endemic frog calls

Anand et al. | Mar 09, 2025

The mountain chain of the Western Ghats on the Indian peninsula, a UNESCO World Heritage site, is home to about 200 frog species, 89 of which are endemic. Distinctive to each frog species, their vocalizations can be used for species recognition. Manually surveying frogs at night during the rain in elephant and big cat forests is difficult, so being able to autonomously record ambient soundscapes and identify species is essential. An effective machine learning (ML) species classifier requires substantial training data from this area. The goal of this study was to assess data augmentation techniques on a dataset of frog vocalizations from this region, which has a minimal number of audio recordings per species. Consequently, enhancing an ML model’s performance with limited data is necessary. We analyzed the effects of four data augmentation techniques (Time Shifting, Noise Injection, Spectral Augmentation, and Test-Time Augmentation) individually and their combined effect on the frog vocalization data and the public environmental sounds dataset (ESC-50). The effect of combined data augmentation techniques improved the model's relative accuracy as the size of the dataset decreased. The combination of all four techniques improved the ML model’s classification accuracy on the frog calls dataset by 94%. This study established a data augmentation approach to maximize the classification accuracy with sparse data of frog call recordings, thereby creating a possibility to build a real-world automated field frog species identifier system. Such a system can significantly help in the conservation of frog species in this vital biodiversity hotspot.

Validating DTAPs with large language models: A novel approach to drug repurposing

Curtis et al. | Mar 02, 2025

Here, the authors investigated the integration of large language models (LLMs) with drug target affinity predictors (DTAPs) to improve drug repurposing, demonstrating a significant increase in prediction accuracy, particularly with GPT-4, for psychotropic drugs and the sigma-1 receptor. This novel approach offers to potentially accelerate and reduce the cost of drug discovery by efficiently identifying new therapeutic uses for existing drugs.

Forecasting air quality index: A statistical machine learning and deep learning approach

Pasula et al. | Feb 17, 2025

Here the authors investigated air quality forecasting in India, comparing traditional time series models like SARIMA with deep learning models like LSTM. The research found that SARIMA models, which capture seasonal variations, outperform LSTM models in predicting Air Quality Index (AQI) levels across multiple Indian cities, supporting the hypothesis that simpler models can be more effective for this specific task.

The effect of Poisson sprinkling methods on causal sets in 1+1-dimensional flat spacetime

Deshpande et al. | Feb 14, 2025

The causal set theory (CST) is a theory of the small-scale structure of spacetime, which provides a discrete approach to describing quantum gravity. Studying the properties of causal sets requires methods for constructing appropriate causal sets. The most commonly used approach is to perform a random sprinkling. However, there are different methods for sprinkling, and it is not clear how each commonly used method affects the results. We hypothesized that the methods would be statistically equivalent, but that some noticeable differences might occur, such as a more uniform distribution for the sub-interval sprinkling method compared to the direct sprinkling and edge bias compensation methods. We aimed to assess this hypothesis by analyzing the results of three different methods of sprinkling. For our analysis, we calculated distributions of the longest path length, interval size, and paths of various lengths for each sprinkling method. We found that the methods were statistically similar. However, one of the methods, sub-interval sprinkling, showed some slight advantages over the other two. These findings can serve as a point of reference for active researchers in the field of causal set theory, and is applicable to other research fields working with similar graphs.

Machine learning predictions of additively manufactured alloy crack susceptibilities

Gowda et al. | Nov 12, 2024

Additive manufacturing (AM) is transforming the production of complex metal parts, but challenges like internal cracking can arise, particularly in critical sectors such as aerospace and automotive. Traditional methods to assess cracking susceptibility are costly and time-consuming, prompting the use of machine learning (ML) for more efficient predictions. This study developed a multi-model ML pipeline that predicts solidification cracking susceptibility (SCS) more accurately by considering secondary alloy properties alongside composition, with Random Forest models showing the best performance, highlighting a promising direction for future research into SCS quantification.

Monitoring drought using explainable statistical machine learning models

Cheung et al. | Oct 28, 2024

Droughts have a wide range of effects, from ecosystems failing and crops dying, to increased illness and decreased water quality. Drought prediction is important because it can help communities, businesses, and governments plan and prepare for these detrimental effects. This study predicts drought conditions by using predictable weather patterns in machine learning models.

Gradient boosting with temporal feature extraction for modeling keystroke log data

Barretto et al. | Oct 04, 2024

Although there has been great progress in the field of Natural language processing (NLP) over the last few years, particularly with the development of attention-based models, less research has contributed towards modeling keystroke log data. State of the art methods handle textual data directly and while this has produced excellent results, the time complexity and resource usage are quite high for such methods. Additionally, these methods fail to incorporate the actual writing process when assessing text and instead solely focus on the content. Therefore, we proposed a framework for modeling textual data using keystroke-based features. Such methods pay attention to how a document or response was written, rather than the final text that was produced. These features are vastly different from the kind of features extracted from raw text but reveal information that is otherwise hidden. We hypothesized that pairing efficient machine learning techniques with keystroke log information should produce results comparable to transformer techniques, models which pay more or less attention to the different components of a text sequence in a far quicker time. Transformer-based methods dominate the field of NLP currently due to the strong understanding they display of natural language. We showed that models trained on keystroke log data are capable of effectively evaluating the quality of writing and do it in a significantly shorter amount of time compared to traditional methods. This is significant as it provides a necessary fast and cheap alternative to increasingly larger and slower LLMs.

Machine learning for retinopathy prediction: Unveiling the importance of age and HbA1c with XGBoost

Ramachandran et al. | Sep 05, 2024

The purpose of our study was to examine the correlation of glycosylated hemoglobin (HbA1c), blood pressure (BP) readings, and lipid levels with retinopathy. Our main hypothesis was that poor glycemic control, as evident by high HbA1c levels, high blood pressure, and abnormal lipid levels, causes an increased risk of retinopathy. We identified the top two features that were most important to the model as age and HbA1c. This indicates that older patients with poor glycemic control are more likely to show presence of retinopathy.

Adults’ attitudes toward non-alcoholic beer purchases and consumption by children and adolescents

An et al. | Aug 23, 2024

Consumption of non-alcoholic beverages, like non-alcoholic beer, is growing in popularity in the United States. These beverages raise important societal questions, such as whether minors should be allowed to purchase or consume non-alcoholic beer. An and An investigate this issue by surveying adults to see if they support minors purchasing and consuming non-alcoholic beer.

Vineyard vigilance: Harnessing deep learning for grapevine disease detection

Mandal et al. | Aug 21, 2024

Globally, the cultivation of 77.8 million tons of grapes each year underscores their significance in both diets and agriculture. However, grapevines face mounting threats from diseases such as black rot, Esca, and leaf blight. Traditional detection methods often lag, leading to reduced yields and poor fruit quality. To address this, authors used machine learning, specifically deep learning with Convolutional Neural Networks (CNNs), to enhance disease detection.

Browse Articles

Optimizing data augmentation to improve machine learning accuracy on endemic frog calls

Validating DTAPs with large language models: A novel approach to drug repurposing

Forecasting air quality index: A statistical machine learning and deep learning approach

The effect of Poisson sprinkling methods on causal sets in 1+1-dimensional flat spacetime

Machine learning predictions of additively manufactured alloy crack susceptibilities

Monitoring drought using explainable statistical machine learning models

Gradient boosting with temporal feature extraction for modeling keystroke log data

Machine learning for retinopathy prediction: Unveiling the importance of age and HbA1c with XGBoost

Adults’ attitudes toward non-alcoholic beer purchases and consumption by children and adolescents

Vineyard vigilance: Harnessing deep learning for grapevine disease detection

Search Articles

Popular Tags

Browse Articles

Search Articles

Category

School Level

Popular Tags