Handling Missing Data in Cancer Cell Line Datasets

Handling Missing Data in Cancer Cell Line Datasets

Category: Research Poster

Author(s): Mussa Hassen

Presenter(s): Mussa Hassen

Mentors(s): Tianying Wang

Ignoring missing data is a critical issue across multiple disciplines in research such as epidemiology, bioinformatics, and climate science, and has garnered increasing attention in recent years. Ignoring this form of missingness or interpreting it as random can result in biased estimations and inaccurate statistical inferences, jeopardizing research goals. This project aims to analyze missing data from the Cancer Cell Line Encyclopedia (CCLE). Using the R programming language, a database was curated by compiling data from multiple sources. Cancer cell line variables were combined based on genetic characteristics and other omics data, followed by a data-cleansing process to ensure reliability. In this project, visual tools were developed to dimensionalize the exploration of data patterns, including the distribution of diseases across different groups and the relationships between age, gender, and cancer types. Missing values in the dataset were imputed using the k-nearest neighbors classification algorithm. Key outcomes of this work include the establishment of a unified system for organizing and analyzing data, the identification of patterns and causes of missing data (decoding NA values from zeros), and the creation of visual representations to enhance data interpretation. This research contributes to a foundation for future studies that use cell line datasets for predicting cancer cell responses to treatments in personalized medicine for cancer patients.