Articles | Journal of Emerging Investigators

A machine learning approach for abstraction and reasoning problems without large amounts of data

Isik et al. | Jun 25, 2022

While remarkable in its ability to mirror human cognition, machine learning and its associated algorithms often require extensive data to prove effective in completing tasks. However, data is not always plentiful, with unpredictable events occurring throughout our daily lives that require flexibility by artificial intelligence utilized in technology such as personal assistants and self-driving vehicles. Driven by the need for AI to complete tasks without extensive training, the researchers in this article use fluid intelligence assessments to develop an algorithm capable of generalization and abstraction. By forgoing prioritization on skill-based training, this article demonstrates the potential of focusing on a more generalized cognitive ability for artificial intelligence, proving more flexible and thus human-like in solving unique tasks than skill-focused algorithms.

A Data-Centric Analysis of “Stop and Frisk” in New York City

Bhat et al. | Apr 18, 2021

The death of George Floyd has shed light on the disproportionate level of policing affecting non-Whites in the United States of America. To explore whether non-Whites were disproportionately targetted by New York City's "Stop and Frisk" policy, the authors analyze publicly available data on the practice between 2003-2019. Their results suggest African Americans were indeed more likely to be stopped by the police until 2012, after which there was some improvement.

Optimizing tennis strategy: a data-driven analysis of point importance

Singla et al. | Dec 21, 2025

The authors looked at the importance of different point breakdowns needed to win in a game of tennis.

Effects of data amount and variation in deep learning-based tuberculosis diagnosis in chest X-ray scans

Bhorkar et al. | Apr 28, 2025

The authors developed and tested machine learning methods to diagnose tuberculosis from pulmonary X-ray scans.

Locating sources of a high energy cosmic ray extensive air shower using HiSPARC data

Aziz et al. | Oct 24, 2023

Using the data provided by the University of Twente High School Project on Astrophysics Research with Cosmics (HiSPARC), an analysis of locations for possible high-energy cosmic ray air showers was conducted. An example includes an analysis conducted of the high-energy rain shower recorded in January 2014 and the use of Stellarium™ to discern its location.

Comparing model-centric and data-centric approaches to determine the efficiency of data-centric AI

La et al. | Apr 20, 2023

In this study, three models are used to test the hypothesis that data-centric artificial intelligence (AI) will improve the performance of machine learning.

Similarity Graph-Based Semi-supervised Methods for Multiclass Data Classification

Balaji et al. | Sep 11, 2021

The purpose of the study was to determine whether graph-based machine learning techniques, which have increased prevalence in the last few years, can accurately classify data into one of many clusters, while requiring less labeled training data and parameter tuning as opposed to traditional machine learning algorithms. The results determined that the accuracy of graph-based and traditional classification algorithms depends directly upon the number of features of each dataset, the number of classes in each dataset, and the amount of labeled training data used.

Evaluating the effectiveness of synthetic training data for day-ahead wind speed prediction in the Great Lakes

Wycoff et al. | Dec 21, 2025

The authors looked at the feasibility to predict wind speeds that will have less reliance on using historical data.

Uncovering the hidden trafficking trade with geographic data and natural language processing

Aqid et al. | Oct 14, 2024

The authors use machine learning to develop an evidence-based detection tool for identifying human trafficking.

Predicting smoking status based on RNA sequencing data

Yang et al. | Aug 30, 2024

Given an association between nicotine addiction and gene expression, we hypothesized that expression of genes commonly associated with smoking status would have variable expression between smokers and non-smokers. To test whether gene expression varies between smokers and non-smokers, we analyzed two publicly-available datasets that profiled RNA gene expression from brain (nucleus accumbens) and lung tissue taken from patients identified as smokers or non-smokers. We discovered statistically significant differences in expression of dozens of genes between smokers and non-smokers. To test whether gene expression can be used to predict whether a patient is a smoker or non-smoker, we used gene expression as the training data for a logistic regression or random forest classification model. The random forest classifier trained on lung tissue data showed the most robust results, with area under curve (AUC) values consistently between 0.82 and 0.93. Both models trained on nucleus accumbens data had poorer performance, with AUC values consistently between 0.65 and 0.7 when using random forest. These results suggest gene expression can be used to predict smoking status using traditional machine learning models. Additionally, based on our random forest model, we proposed KCNJ3 and TXLNGY as two candidate markers of smoking status. These findings, coupled with other genes identified in this study, present promising avenues for advancing applications related to the genetic foundation of smoking-related characteristics.

Browse Articles

A machine learning approach for abstraction and reasoning problems without large amounts of data

A Data-Centric Analysis of “Stop and Frisk” in New York City

Optimizing tennis strategy: a data-driven analysis of point importance

Effects of data amount and variation in deep learning-based tuberculosis diagnosis in chest X-ray scans

Locating sources of a high energy cosmic ray extensive air shower using HiSPARC data

Comparing model-centric and data-centric approaches to determine the efficiency of data-centric AI

Similarity Graph-Based Semi-supervised Methods for Multiclass Data Classification

Evaluating the effectiveness of synthetic training data for day-ahead wind speed prediction in the Great Lakes

Uncovering the hidden trafficking trade with geographic data and natural language processing

Predicting smoking status based on RNA sequencing data

Search Articles

Popular Tags

Browse Articles

Search Articles

Category

School Level

Popular Tags