Hi! I am Jennifer. I study Engineering Science at University of Toronto. During this program, I have learnt about engineering disciplines, programming skills, and various interdisciplinary topics. In addition, I have developed a strong interest in data science and machine learning. My current research focuses on Large Foundation Models Alignment, Reinforcement Learning, and Agent-Human Alignment (multi-agent interactions and alignment). I am deeply interested in using AI to solve real-world problems, and I am particularly fascinated by the potential of AI to enhance human capabilities, and for social good. I am fortunate to be guided by Professor Julian McAuley, Professor Han Liu, Professor William Cunningham, and Professor Luis Seco on various research projects.
In my free time, I enjoy outdoor activities and all kinds of sports. I love exploring new technologies: you can see my blog posts here. If you would love to have a chat, feel free to shoot me an email!
Cell-JEPA: Latent Representation Learning for Single-Cell Transcriptomics
WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization
Read More
Group-Relative Policy Optimization (GRPO) has emerged as an effective approach for training language models on complex reasoning tasks by normalizing rewards within groups of rollouts. However, GRPO's group-relative advantage estimation critically depends on dense step-wise reward signals throughout the reasoning process. In practice, obtaining such dense supervision requires expensive human annotations of intermediate reasoning steps or carefully designed step-wise reward functions. This creates a significant challenge specific to group-relative methods: while GRPO performs best with dense intermediate feedback, real-world scenarios often provide only sparse outcome supervision—such as final answer correctness or binary trajectory labels. We propose Weakly-Supervised Group-Relative Policy Optimization (WS-GRPO), which addresses this unique limitation by learning to extract dense preference signals from sparse outcome supervision while preserving GRPO's group-relative normalization benefits. WS-GRPO operates in two phases: first, it trains a preference model to distinguish between successful and unsuccessful reasoning patterns using only trajectory-level outcomes; second, it leverages this learned preference model to provide step-wise weakly-supervised rewards that are combined with sparse terminal rewards during group-relative policy optimization. By treating consecutive partial trajectories as preference pairs, our method generates dense feedback signals that complement GRPO's group normalization mechanism without requiring step-by-step human annotations. Theoretically, we provide comprehensive guarantees for WS-GRPO establishing preference model consistency under trajectory-level supervision, policy robustness to preference errors with controllable degradation rates, and generalization bounds that decompose error sources across policy learning, preference modeling, and their interaction. Our experiments on reasoning benchmarks demonstrate that WS-GRPO achieves competitive performance using only weak supervision, making group-relative policy optimization practical when detailed process supervision is limited.
SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes
From Verifiable Rewards to Policy Learning: A Survey of Reinforcement Learning from Verifiable Rewards
Read More
Reinforcement Learning from Verifiable Rewards (RLVR) trains language model policies using feedback from verifiers such as exact-match checkers, execution-based tests, or constraint validators. Despite its success across mathematical reasoning, code generation, and instruction following, the RLVR literature still lacks unified terminology and systematic organization. This survey provides the first comprehensive systematization of RLVR methods. We formalize theoretical foundations and introduce three orthogonal taxonomies: (1) when verification is applied (terminal, step-level, trajectory-level); (2) how rewards are computed (human ground truth, model judges, programmatic oracles, environment feedback); and (3) which policy-learning (on-policy, hybrid). By establishing comprehensive taxonomies and consolidating evaluation resources, this survey provides a foundation for systematic comparison, informed method selection, and identification of open challenges in RLVR research.
Realistic CDSS Drug Dosing with End-to-end Recurrent Q-learning for Dual Vasopressor Control
ADPv2: A Hierarchical Histological Tissue Type-Annotated Dataset for Potential Biomarker Discovery of Colorectal Disease
Basics
Name: Yuntong (Jennifer) Zhang
Label: 4th year Undergrad in Engineering Science at University of Toronto
Email: jenniferyt.zhang@mail.utoronto.ca
Summary: Interested in foundation models, RL, AI for society.
Education
BASc in Engineering Science
University of Toronto
Graduated with High Distinction.
Work
Total Fund Risk Analyst Intern
Ontario Teachers' Pension Plan
- Developed novel statistical algorithms for financial risk calculations such as VaR and Expected Tail Loss.
- Researched and implemented LLM multi-agent systems for chatbot Question-Answering with in-house large financial datasets.
- Enhanced and maintained current risk calculation pipeline including data extraction, risk scenario generation, risk calculation, and risk dash-boarding.
#Langchain, #Python, #SQL, #julia, #Csharp, #flask
Business Analyst Intern
EQ Bank
- Conducted comprehensive analysis of the bank's financial reconciliation process, identifying key bottlenecks and inefficiencies.
- Developed an automated reconciliation system for financial reporting, streamlining the workflow and reducing manual errors.
- Designed and implemented an ETL pipeline for financial data, improving data consistency and reducing data redundancy by 30%.
#Python, #SQL, #Tableau, #VBA, #Excel
Data Engineering Intern
National Reserach Council
- Processed and analyzed large-scale geographical data to investigate factors contributing to fire incidents along railroads.
- Developed and compared machine learning models to predict active fire locations, achieving an accuracy of 89%.
- Created an interactive Tableau dashboard to visualize predicted high-risk fire areas along railroad routes, providing actionable insights for preventive measures.
- Collaborated with a cross-functional team to integrate machine learning models into the existing data pipeline, enhancing the overall efficiency of the data processing workflow.
#scikit-learn, #Optuna, #geopandas, #Python, #Csharp, #Tableau, #SQL
Skills
- Programming: Python, SQL, C, C++, Julia, HTML, CSS, JavaScript
- Deep Learning Programming: PyTorch, JAX
- Machine Learning:
- Deep Neural Networks
- Computer Vision
- Reinforcement Learning
- Self-supervised Learning
- Representation Learning
- Foundation Models
Interests
- Visual Reasoning of Large Foundation Models
- Reinforcement Learning
- AI Alignment
- AI and Society