Successful Predictive Modeling of Pollen Fitness Phenotypes Is Enabled by Measures of Expression Specificity

Image
Photo of Sebastian Mueller
Event Speaker
Sebastian Mueller
Ph.D. student in Bioengineering, with Machine Learning and a Computational Biology Emphasis
Event Type
CBEE Seminar
Date
Event Location
Kelly 1001
Event Description

The male gametophyte of flowering plants, primarily visible as pollen, is required for sexual reproduction. It delivers sperm cells to the female gametophyte for double fertilization, which enables the subsequent development of the seed. Due to its haploid nature, mutations that affect pollen function can result in a quantitative phenotypic effect on pollen fitness, detectable when the mutant transmission rate differs from the Mendelian ratio. In maize, a large set of fluorescently-marked insertional mutations, the Ds-GFP lines, provides a resource for measuring the effect of single gene mutations in the gametophyte by determining the ratio of mutant (Green Fluorescent Protein-marked) to wild-type progeny kernels in reciprocal outcrosses – i.e., how each mutation affects pollen fitness. We have developed a machine learning framework that uses expression profiling (e.g., RNA-seq) and genomic feature (e.g., Ka/Ks ratio) data provided by MaizeGDB (https://mfs.maizegdb.org/) to predict which genes significantly contribute to pollen fitness in maize. The framework is based on a pollen fitness dataset derived from measuring mutant transmission rates, using a computer vision pipeline that analyzes maize ear images, for 267 validated Ds-GFP insertions into single genes. Modeling efforts to predict genes with high fitness effects upon mutation (vs. no fitness effect) demonstrate considerable success, attaining auROC values up to 90%. Successful models correctly predict 7/9 pollen fitness mutants previously identified in the literature. Current analyses show that RNA and protein expression data are substantial contributors to predicting the pollen fitness class. Notably, tissue specificity information is the most critical input for achieving strong model performance. Additionally, other genomic features, such as amino acid composition and distribution, Ka/Ks ratio, and measures of synteny can also contribute to well-performing models. Our results suggest that expression data from either RNA-seq or proteomic profiling are among the most information rich sources for predicting phenotype from genome scale data.

Speaker Biography

Sebastian is a PhD student in the BioE program, with Machine Learning and Computational Biology emphasis. He completed his BS degree in Environmental Sciences at Tulane University (first 2 years) and Oregon State University (final 2 years), and his MS degree in Environmental Sciences here at Oregon State University. Sebastian is pursuing his PhD studies in Megraw Lab in the Department of Botany and Plant Pathology at Oregon State University (Computer Science adjunct affiliation), with emphasis on genomics, bioinformatics, machine learning, and transcriptional regulation. Sebastian is pursuing his first project on maize pollen development, co-advised by Prof. John Fowler. This project seeks to predict phenotypic pollen fitness outcomes as a result of loss-of-function mutations in specific maize genes, using only genotypic and genomic information about these genes.