Browse Articles

The impact of genetic analysis on the early detection of colorectal cancer

Agrawal et al. | Aug 24, 2023

The impact of genetic analysis on the early detection of colorectal cancer

Although the 5-year survival rate for colorectal cancer is below 10%, it increases to greater than 90% if it is diagnosed early. We hypothesized from our research that analyzing non-synonymous single nucleotide variants (SNVs) in a patient's exome sequence would be an indicator for high genetic risk of developing colorectal cancer.

Read More...

Quantitative definition of chemical synthetic pathway complexity of organic compounds

Baranwal et al. | Jun 19, 2023

Quantitative definition of chemical synthetic pathway complexity of organic compounds

Irrespective of the final application of a molecule, synthetic accessibility is the rate-determining step in discovering and developing novel entities. However, synthetic complexity is challenging to quantify as a single metric, since it is a composite of several measurable metrics, some of which include cost, safety, and availability. Moreover, defining a single synthetic accessibility metric for both natural products and non-natural products poses yet another challenge given the structural distinctions between these two classes of compounds. Here, we propose a model for synthetic accessibility of all chemical compounds, inspired by the Central Limit Theorem, and devise a novel synthetic accessibility metric assessing the overall feasibility of making chemical compounds that has been fitted to a Gaussian distribution.

Read More...

A novel encoding technique to improve non-weather-based models for solar photovoltaic forecasting

Ahmed et al. | Jun 09, 2023

A novel encoding technique to improve non-weather-based models for solar photovoltaic forecasting

Several studies have applied different machine learning (ML) techniques to the area of forecasting solar photovoltaic power production. Most of these studies use weather data as inputs to predict power production; however, there are numerous practical issues with the procurement of this data. This study proposes models that do not use weather data as inputs, but rather use past power production data as a more practical substitute to weather-based models. Our proposed models demonstrate a better, cheaper, and more reliable alternatives to current weather models.

Read More...

Modeling and optimization of epidemiological control policies through reinforcement learning

Rao et al. | May 23, 2023

Modeling and optimization of epidemiological control policies through reinforcement learning

Pandemics involve the high transmission of a disease that impacts global and local health and economic patterns. Epidemiological models help propose pandemic control strategies based on non-pharmaceutical interventions such as social distancing, curfews, and lockdowns, reducing the economic impact of these restrictions. In this research, we utilized an epidemiological Susceptible, Exposed, Infected, Recovered, Deceased (SEIRD) model – a compartmental model for virtually simulating a pandemic day by day.

Read More...

Using machine learning to develop a global coral bleaching predictor

Madireddy et al. | Feb 21, 2023

Using machine learning to develop a global coral bleaching predictor
Image credit: Madireddy, Bosch, and McCalla

Coral bleaching is a fatal process that reduces coral diversity, leads to habitat loss for marine organisms, and is a symptom of climate change. This process occurs when corals expel their symbiotic dinoflagellates, algae that photosynthesize within coral tissue providing corals with glucose. Restoration efforts have attempted to repair damaged reefs; however, there are over 360,000 square miles of coral reefs worldwide, making it challenging to target conservation efforts. Thus, predicting the likelihood of bleaching in a certain region would make it easier to allocate resources for conservation efforts. We developed a machine learning model to predict global locations at risk for coral bleaching. Data obtained from the Biological and Chemical Oceanography Data Management Office consisted of various coral bleaching events and the parameters under which the bleaching occurred. Sea surface temperature, sea surface temperature anomalies, longitude, latitude, and coral depth below the surface were the features found to be most correlated to coral bleaching. Thirty-nine machine learning models were tested to determine which one most accurately used the parameters of interest to predict the percentage of corals that would be bleached. A random forest regressor model with an R-squared value of 0.25 and a root mean squared error value of 7.91 was determined to be the best model for predicting coral bleaching. In the end, the random model had a 96% accuracy in predicting the percentage of corals that would be bleached. This prediction system can make it easier for researchers and conservationists to identify coral bleaching hotspots and properly allocate resources to prevent or mitigate bleaching events.

Read More...

Gradient boosting with temporal feature extraction for modeling keystroke log data

Barretto et al. | Oct 04, 2024

Gradient boosting with temporal feature extraction for modeling keystroke log data
Image credit: Barretto and Barretto 2024.

Although there has been great progress in the field of Natural language processing (NLP) over the last few years, particularly with the development of attention-based models, less research has contributed towards modeling keystroke log data. State of the art methods handle textual data directly and while this has produced excellent results, the time complexity and resource usage are quite high for such methods. Additionally, these methods fail to incorporate the actual writing process when assessing text and instead solely focus on the content. Therefore, we proposed a framework for modeling textual data using keystroke-based features. Such methods pay attention to how a document or response was written, rather than the final text that was produced. These features are vastly different from the kind of features extracted from raw text but reveal information that is otherwise hidden. We hypothesized that pairing efficient machine learning techniques with keystroke log information should produce results comparable to transformer techniques, models which pay more or less attention to the different components of a text sequence in a far quicker time. Transformer-based methods dominate the field of NLP currently due to the strong understanding they display of natural language. We showed that models trained on keystroke log data are capable of effectively evaluating the quality of writing and do it in a significantly shorter amount of time compared to traditional methods. This is significant as it provides a necessary fast and cheap alternative to increasingly larger and slower LLMs.

Read More...

Machine learning for retinopathy prediction: Unveiling the importance of age and HbA1c with XGBoost

Ramachandran et al. | Sep 05, 2024

Machine learning for retinopathy prediction: Unveiling the importance of age and HbA1c with XGBoost

The purpose of our study was to examine the correlation of glycosylated hemoglobin (HbA1c), blood pressure (BP) readings, and lipid levels with retinopathy. Our main hypothesis was that poor glycemic control, as evident by high HbA1c levels, high blood pressure, and abnormal lipid levels, causes an increased risk of retinopathy. We identified the top two features that were most important to the model as age and HbA1c. This indicates that older patients with poor glycemic control are more likely to show presence of retinopathy.

Read More...

Part of speech distributions for Grimm versus artificially generated fairy tales

Arvind et al. | Nov 16, 2024

Part of speech distributions for Grimm versus artificially generated fairy tales
Image credit: Nayalia Y.

Here, the authors wanted to explore mathematical paradoxes in which there are multiple contradictory interpretations or analyses for a problem. They used ChatGPT to generate a novel dataset of fairy tales. They found statistical differences between the artificially generated text and human produced text based on the distribution of parts of speech elements.

Read More...