Projects

Multi-omics integration and statistical modeling of BRD4-mediated transcriptional condensates identifies transcription factors linked to clinical outcomes

I collaborated with experimental biologists to investigate how acidic pH affects BRD4-mediated transcriptional condensates, immune gene regulation, and cancer-relevant regulatory programs. I integrated ATAC-seq, ChIP-seq, RNA-seq, and TCGA clinical outcome data across biological conditions, then applied statistical modeling, differential signal analysis, unsupervised clustering, transcription factor regulatory analysis, and survival analysis to connect epigenomic changes with transcriptional regulation and clinical outcome associations.

This analysis identified RELA, IRF family, and STAT family transcription factors as candidate regulators associated with BRD4-mediated, pH-sensitive transcriptional condensates in mouse macrophages. Experimental validation confirmed that these factors showed binding patterns aligned with BRD4 across pH conditions. In colon cancer, multivariable survival analysis further linked putative condensate-associated regulatory activity in tumor cells to poorer prognosis, while T cell-associated regulatory activity was more likely associated with better survival.

Source: bioRxiv | GitHub | Poster

Large-scale CTCF ChIP-seq integration and SQL-based genomic resource development

I built CTCFexplorer (www.ctcf.info), a PostgreSQL-backed genomic data resource and web platform for large-scale CTCF binding analysis. I collected, curated, and standardized 2,097 human and 1,345 mouse CTCF ChIP-seq datasets from NCBI GEO across diverse cell types, then developed a workflow for metadata curation, data cleaning, quality control, peak calling, signal processing, genome-wide binding integration, and statistical analysis.

The integrated analysis defined 531,851 high-confidence human and 297,825 mouse CTCF binding sites and identified constitutive and cell type-specific CTCF binding events at genome scale. I structured the processed data into SQL tables and built a web-accessible resource supporting search, visualization, download, data mining, and regulatory interpretation. The project also uses GitHub-based CI/CD auto-deployment to support reproducible updates and maintainable delivery of the public database.

Source: ctcf.info | Cancer Research | GitHub | Poster

Statistical framework for transcription factor clustering and disease-relevant regulatory mechanism analysis

I developed a quantitative framework to measure genomic clustering tendency of transcription factor motifs and ChIP-seq binding profiles at genome scale. The analysis covered 571 human transcription factor sequence motifs and 6,650 transcription factor ChIP-seq profiles across diverse cell types, enabling systematic comparison of how transcription factors cluster in regulatory regions and how these patterns relate to super-enhancer biology.

I integrated super-enhancer annotations, ATAC-seq, Hi-C, LLPS-related protein properties, ICGC mutation data, and TCGA clinical outcome data to connect clustered transcription factor binding with chromatin accessibility, chromatin interactions, transcriptional condensate biology, cancer mutation patterns, and patient survival. This work identified clustered transcription factor binding as a data-driven feature associated with cell-type-specific super-enhancers, active regulatory regions, and cancer-relevant transcriptional mechanisms.

Source: Nucleic Acids Research | GitHub

Machine learning framework for cell-type-specific 3D chromatin structure analysis

I developed an end-to-end supervised machine learning workflow for single-cell spatial chromatin tracing data from mouse cortex. The workflow included data cleaning, coordinate imputation, Delaunay tessellation-based 3D feature engineering, random forest classification, 5-fold cross-validation, learning-curve analysis, visualization, and biological interpretation.

I converted 39,735 single-cell chromatin traces into 11,628 structural edge features and achieved approximately 0.8 classification accuracy for major cell-type groups, demonstrating that 3D chromatin architecture contains cell-type-specific structural signals. The project shows how topology-based feature design and supervised learning can extract interpretable biological patterns from high-dimensional spatial genome organization data.

Source: GitHub

Protein sequence-structure-function modeling using topology-based computational methods

I built a curated structural protein dataset of 10,220 nonredundant protein chains with codon, amino acid, secondary-structure, and Cα-coordinate annotations. I developed Delaunay tessellation-based topology features and amino-acid/codon-level statistical potentials to model relationships among synonymous codons, protein topology, secondary structure, and function.

I applied computational mutagenesis, statistical testing, and machine learning to assess cancer-associated synonymous mutations and alpha-helix structural patterns. This work extended protein topology modeling from amino-acid-level analysis to codon-level interpretation and provided a quantitative framework for evaluating how sequence variation may influence protein structure and function.

Source: GitHub