Hi! I am Jennifer. I study Engineering Science at University of Toronto. During this program, I have learnt about engineering disciplines, programming skills, and various interdisciplinary topics. In addition, I have developed a strong interest in data science and machine learning. My current research focuses on Large Foundation Models Alignment, Reinforcement Learning, and Agent-Human Alignment (multi-agent interactions and alignment). I am deeply interested in using AI to solve real-world problems, and I am particularly fascinated by the potential of AI to enhance human capabilities, and for social good. I am fortunate to be guided by Professor Julian McAuley, Professor Han Liu, Professor William Cunningham, and Professor Luis Seco on various research projects.

In my free time, I enjoy outdoor activities and all kinds of sports. I love exploring new technologies: you can see my blog posts here. If you would love to have a chat, feel free to shoot me an email!

Sunflower

In-Context Algorithm Emulation in Fixed-Weight Transformers

J. Y.-C. Hu*, Hude Liu*, J. Y. Zhang*, Han Liu
ICLR 2026. * Equal contribution.

Are Hallucinations Bad Estimations?

Hude Liu*, J. Y.-C. Hu*, J. Y. Zhang, Z. Song, Han Liu
Preprint.

Cell-JEPA: Latent Representation Learning for Single-Cell Transcriptomics

A. Elsheikh*, R.-X. Wang*, W. Wu*, Y. Wen, P. Dibaeinia, J. Y. Zhang, J. Y.-C. Hu, M. Knudson, S. Badu, S.-H. Sun, A. A. Khan, H. Liu.
In submission to ICML 2026.

pdf

WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization

G. Mundada, R. Surana, J. Y. Zhang, X. Li, T. Yu, L. Yao, J. Shang, J. McAuley, J. Wu
In submission to ICML 2026.

Read More

Group-Relative Policy Optimization (GRPO) has emerged as an effective approach for training language models on complex reasoning tasks by normalizing rewards within groups of rollouts. However, GRPO's group-relative advantage estimation critically depends on dense step-wise reward signals throughout the reasoning process. In practice, obtaining such dense supervision requires expensive human annotations of intermediate reasoning steps or carefully designed step-wise reward functions. This creates a significant challenge specific to group-relative methods: while GRPO performs best with dense intermediate feedback, real-world scenarios often provide only sparse outcome supervision—such as final answer correctness or binary trajectory labels. We propose Weakly-Supervised Group-Relative Policy Optimization (WS-GRPO), which addresses this unique limitation by learning to extract dense preference signals from sparse outcome supervision while preserving GRPO's group-relative normalization benefits. WS-GRPO operates in two phases: first, it trains a preference model to distinguish between successful and unsuccessful reasoning patterns using only trajectory-level outcomes; second, it leverages this learned preference model to provide step-wise weakly-supervised rewards that are combined with sparse terminal rewards during group-relative policy optimization. By treating consecutive partial trajectories as preference pairs, our method generates dense feedback signals that complement GRPO's group normalization mechanism without requiring step-by-step human annotations. Theoretically, we provide comprehensive guarantees for WS-GRPO establishing preference model consistency under trajectory-level supervision, policy robustness to preference errors with controllable degradation rates, and generalization bounds that decompose error sources across policy learning, preference modeling, and their interaction. Our experiments on reasoning benchmarks demonstrate that WS-GRPO achieves competitive performance using only weak supervision, making group-relative policy optimization practical when detailed process supervision is limited.

SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes

C. Wang*, X. Li*, J. Y. Zhang, J. Wu, C. Huang, L. Yao, J. McAuley, J. Shang
In submission to ACL 2026.

pdf

From Verifiable Rewards to Policy Learning: A Survey of Reinforcement Learning from Verifiable Rewards

G. Mundada, R. Surana, S. Yu, J. Y. Zhang, Z. Huang, Y. Xiong, X. Li, Y. Xia, R. Jain, C. Huang, N. L. Kuang, T. Yu, R. A. Rossi, D. Zhou, L. Yao, J. Shang, J. McAuley, J. Wu
In submission to ACL 2026.

Read More

Reinforcement Learning from Verifiable Rewards (RLVR) trains language model policies using feedback from verifiers such as exact-match checkers, execution-based tests, or constraint validators. Despite its success across mathematical reasoning, code generation, and instruction following, the RLVR literature still lacks unified terminology and systematic organization. This survey provides the first comprehensive systematization of RLVR methods. We formalize theoretical foundations and introduce three orthogonal taxonomies: (1) when verification is applied (terminal, step-level, trajectory-level); (2) how rewards are computed (human ground truth, model judges, programmatic oracles, environment feedback); and (3) which policy-learning (on-policy, hybrid). By establishing comprehensive taxonomies and consolidating evaluation resources, this survey provides a foundation for systematic comparison, informed method selection, and identification of open challenges in RLVR research.

Realistic CDSS Drug Dosing with End-to-end Recurrent Q-learning for Dual Vasopressor Control

W. Y. Zou, J. Feng, A. Kalimouttou, J. Y. Zhang, C. W. Seymour, R. Pirracchio
NeurIPS 2025 Workshop Learning from Time Series for Health.

pdf

ADPv2: A Hierarchical Histological Tissue Type-Annotated Dataset for Potential Biomarker Discovery of Colorectal Disease

Z. Yang, K. Li, S. G. Ramandi, P. Brassard, H. Khellaf, V. Q-H. Trinh, J. Y. Zhang, L. Chen, C. Rowsell, S. Varma, K. Plantaniotis, M. S. Hosseini
Journal of Pathology Informatics.

pdf

Methane Emission Drivers and Baseline Calculations

J. Y. Zhang, B. A. faraz, L. A. Seco
Preprint.

Basics

Name: Yuntong (Jennifer) Zhang

Label: 4th year Undergrad in Engineering Science at University of Toronto

Email: jenniferyt.zhang@mail.utoronto.ca

Summary: Interested in foundation models, RL, AI for society.

Education

2021 - 2026

BASc in Engineering Science

University of Toronto

Graduated with High Distinction.

Work

Sep 2024 - May 2025

Total Fund Risk Analyst Intern

Ontario Teachers' Pension Plan

  • Developed novel statistical algorithms for financial risk calculations such as VaR and Expected Tail Loss.
  • Researched and implemented LLM multi-agent systems for chatbot Question-Answering with in-house large financial datasets.
  • Enhanced and maintained current risk calculation pipeline including data extraction, risk scenario generation, risk calculation, and risk dash-boarding.

#Langchain, #Python, #SQL, #julia, #Csharp, #flask


May 2024 - Aug 2024

Business Analyst Intern

EQ Bank

  • Conducted comprehensive analysis of the bank's financial reconciliation process, identifying key bottlenecks and inefficiencies.
  • Developed an automated reconciliation system for financial reporting, streamlining the workflow and reducing manual errors.
  • Designed and implemented an ETL pipeline for financial data, improving data consistency and reducing data redundancy by 30%.

#Python, #SQL, #Tableau, #VBA, #Excel


May 2023 - Dec 2023

Data Engineering Intern

National Reserach Council

  • Processed and analyzed large-scale geographical data to investigate factors contributing to fire incidents along railroads.
  • Developed and compared machine learning models to predict active fire locations, achieving an accuracy of 89%.
  • Created an interactive Tableau dashboard to visualize predicted high-risk fire areas along railroad routes, providing actionable insights for preventive measures.
  • Collaborated with a cross-functional team to integrate machine learning models into the existing data pipeline, enhancing the overall efficiency of the data processing workflow.

#scikit-learn, #Optuna, #geopandas, #Python, #Csharp, #Tableau, #SQL

Skills

  • Programming: Python, SQL, C, C++, Julia, HTML, CSS, JavaScript
  • Deep Learning Programming: PyTorch, JAX
  • Machine Learning:
    • Deep Neural Networks
    • Computer Vision
    • Reinforcement Learning
    • Self-supervised Learning
    • Representation Learning
    • Foundation Models
Skills Image

Interests

  • Visual Reasoning of Large Foundation Models
  • Reinforcement Learning
  • AI Alignment
  • AI and Society