Can the nucleotide content of a DNA sequence predict the sequence accessibility?

(1) St. John’s School, Houston Texas , (2) The University of Texas MD Anderson Cancer Center, Houston, Texas

Cover photo for Can the nucleotide content of a DNA sequence predict the sequence accessibility?
Image credit: Warren Umoh

Sequence accessibility is an important factor affecting gene expression. Sequence accessibility or openness impacts the likelihood that a gene is transcribed and translated into a protein and performs functions and manifests traits. The DNA, which carries the genes, is packaged as chromatin. There are two types of chromatin, heterochromatin and euchromatin. Heterochromatin tends to be inaccessible and thus is often not expressed. In contrast, euchromatin is more accessible and is expressed. Accessibility of a gene depends on the type of chromatin it is in, and with increased accessibility, there is a greater likelihood of gene transcription and expression. There are many potential factors that affect the accessibility of a gene. In this study, our hypothesis was that the content of nucleotides in a genetic sequence predicts its accessibility. Using a machine learning linear regression model, we studied the relationship between nucleotide content and accessibility. DNA sequences are made up of four nucleotides. We compared the quantity of each of these four nucleotides, adenosine, thymine, guanine, and cytosine either as single nucleotide or in specific combinations of two nucleotides with sequence accessibility using the K562 cell line. Of all the combinations tried, we discovered that the cytosine-guanine combination content had the highest positive correlation with accessibility, and therefore with gene expression. This correlation allows us to better predict which genetic sequences will be more frequently expressed based solely on the nucleotide content and sequence. Predicting gene expression through machine learning algorithms promises to catalyze our ability to understand the structure and function of specific gene sequences.

Download Full Article as PDF