GR2 Technical Report

Yufei Li, Zaiwei Zhang, Mingfu Liang, Kavosh Asadi, Jay Xu, Jimmy Kim, Chongyang Bai, Jieyi Zhang, Hongye Xie, Prachi Agrawal, Dian Yu, Tianyi Chen, Jean-Pascal Billaud, Garret Buell, YK, Zhu, Sachin Patil, Brooke Bian, Zhou Fang, Kevin Huang, Shiva Sudanagunta, Yuzhen Huang, Emma Lu, Chris O'Brien, Yang Song, Lihong Li, Jacob Tao, Zhicheng Zhu, Chao Li, Gaoxiang Liu, Neil Wu, Zhongyin Hu, Li Han, Loki Chen, Ming Lei, Greg Rehm, Siyuan Song, Tianwei Zhang, Li Li, Ketan Singh, Yavuz Yetim, Ilyas Atishev, Satendra Gera, Ashkan Sadeghi, Rachel Yan, Nikko Mizutani, Shuaiwen Wang, Song Yang, Zhijing Li, Jiang Liu, Mengying Sun, Fei Tian, Xiaohan Wei, Chonglin Sun, Parish Aggarwal, Kaushik Rangadurai, Zhi Hua, Frank Shyu, Ruchit Sharma, Liyuan Li, Shike Mei, Wenlin Chen, Santanu Kolay, Ben Schulte, Deepak Chandra, Adam, Song, Sandeep Pandey, Xi Liu, Hamed Firooz, Luke Simon

Industrial recommendation systems serve billions of users through a multi-stage funnel -- retrieval, early-stage ranking, and re-ranking -- where the final re-ranking step disproportionately shapes user engagement and downstream performance, particularly for carousel and grid display formats. Despite growing enthusiasm for Large Language Models (LLMs) in recommendation, three gaps hinder industrial adoption: (1) most efforts target retrieval and ranking, leaving re-ranking -- the stage closest to the final user experience -- largely underexplored; (2) LLMs are typically deployed zero-shot or via supervised fine-tuning, underutilizing the reasoning capabilities unlocked by reinforcement learning (RL) on verifiable rewards; (3) deployed catalogs index billions of items with non-semantic identifiers that lie outside any base-LLM vocabulary. We present GR2 (Generative Reasoning Re-Ranker), an end-to-end framework that combines (i) mid-training on semantic IDs produced by a tokenizer with >=99% uniqueness, (ii) reasoning-trace distilled from a stronger teacher via targeted prompting and rejection sampling, and (iii) RL with verifiable rewards purpose-built for re-ranking. To make GR2 resource-viable, we further (iv) introduce a context compressor that amortizes training cost, On-Policy Distillation (OPD) as a scalable alternative to SFT -- which we find collapses at industrial scale -- and reasoning distillation for low-latency serving. GR2 delivers +18.7% R@1, +7.1% R@3, and +9.6% N@3 over legacy baselines on industrial-scale traffic. We further find that reward design is critical in re-ranking: LLMs often hack rewards by preserving the incoming order or exploiting position bias, motivating conditional verifiable rewards as essential industrial components.

Open Source

Research Brief

GR2 is an end-to-end Generative Reasoning Re-Ranker framework that leverages mid-training, reasoning distillation, and reinforcement learning with verifiable rewards to significantly improve industrial-scale recommendation re-ranking performance.

Industrial recommendation systems rely heavily on the final re-ranking stage to engage users, especially for carousel and grid displays. Current Large Language Model (LLM) applications in this domain face three key challenges: neglect of the re-ranking stage, underutilization of LLM reasoning via Reinforcement Learning (RL), and incompatibility with non-semantic industrial item identifiers. To address these, GR2 (Generative Reasoning Re-Ranker) integrates several innovations: (i) mid-training on semantically tokenized IDs, (ii) reasoning-trace distillation from a powerful teacher model, and (iii) RL with specifically designed verifiable rewards for re-ranking. For resource efficiency, GR2 also features a context compressor, On-Policy Distillation (OPD) as a scalable alternative to fine-tuning, and reasoning distillation for low-latency inference. The system demonstrates substantial performance gains, achieving +18.7% R@1, +7.1% R@3, and +9.6% N@3 over traditional baselines on real-world industrial traffic. A critical finding is the importance of careful reward design, as LLMs can exploit flaws like preserving input order or position bias, highlighting the need for conditional verifiable rewards.

Potential Applications

E-commerce product recommendation carousels and grids, where item order significantly impacts sales.
Content feed personalization (e.g., news articles, videos, social media posts) on platforms like YouTube, TikTok, or Instagram.
Discovery interfaces for services or applications, improving the relevance and engagement of displayed options.
Advertising platforms, optimizing the sequencing of ad placements within user interfaces.

30/100

Paper Trustworthiness Index

High Skepticism

High Skepticism / Self-Published

This document should be treated with critical skepticism. It contains unverified scientific claims or was self-published.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record

0 / 25

The abstract does not provide any information about the authors or their institutional affiliations, making it impossible to assess their track record or prestige.

Technical Rigor & Methodology

25 / 30

The paper outlines an end-to-end framework with multiple advanced components (mid-training, reasoning distillation, RL with verifiable rewards, context compression, OPD). It identifies specific industrial gaps, reports quantitative performance metrics against legacy baselines, and discusses insights on reward design, suggesting strong technical rigor and experimentation.

Reproducibility & Openness

0 / 25

The abstract makes no mention of open-sourcing code, datasets, or pre-trained weights, nor does it provide any links to repositories or supplementary materials for reproducibility.

Community Vetting & Peer Review

5 / 20

The paper is described as a 'Technical Report,' which typically implies it may be a preprint or an internal document, and does not explicitly state peer-review status by a recognized conference or journal. This limits the initial community vetting score.

Detailed Evidence Assessment

Verified Evidence & Citations

GR2 combines mid-training, reasoning-trace distillation, and RL with verifiable rewards.

“combines (i) mid-training on semantic IDs produced by a tokenizer with >=99% uniqueness, (ii) reasoning-trace distilled from a stronger teacher via targeted prompting and rejection sampling, and (iii) RL with verifiable rewards purpose-built for re-ranking.”

GR2 introduces a context compressor, On-Policy Distillation (OPD), and reasoning distillation for resource viability and serving.

“To make GR2 resource-viable, we further (iv) introduce a context compressor that amortizes training cost, On-Policy Distillation (OPD) as a scalable alternative to SFT -- which we find collapses at industrial scale -- and reasoning distillation for low-latency serving.”

GR2 achieves significant performance gains over legacy baselines.

“GR2 delivers +18.7% R@1, +7.1% R@3, and +9.6% N@3 over legacy baselines on industrial-scale traffic.”

Reward design is critical in re-ranking, as LLMs can exploit position bias.

“We further find that reward design is critical in re-ranking: LLMs often hack rewards by preserving the incoming order or exploiting position bias, motivating conditional verifiable rewards as essential industrial components.”

Uncertainties & Omissions

• Omission:Specific details about the dataset(s) used beyond 'industrial-scale traffic'.

• Omission:Names of the 'legacy baselines' used for comparison.

• Omission:Details on the specific RL algorithms or reward functions implemented.

• Omission:Information on the author's affiliations, funding, or institutions.

• Omission:Links or information regarding the public availability of code, data, or models.

• Uncertainty:The exact nature and scale of the 'industrial-scale traffic' and 'legacy baselines' are not specified, making direct comparison to other research difficult.

• Uncertainty:The specific tokenizer architecture and how 'semantic IDs' are generated are not detailed.

• Uncertainty:The composition of the 'stronger teacher' model for reasoning-trace distillation is not described.

• Uncertainty:The definition and verification process of 'conditional verifiable rewards' are not elaborated.