From Pixels to Words -- Towards Native One-Vision Models at Scale Paper • 2605.28820 • Published 5 days ago • 68
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence Paper • 2605.25979 • Published 7 days ago • 25
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning Paper • 2605.20342 • Published 13 days ago • 34
FileGram: Grounding Agent Personalization in File-System Behavioral Traces Paper • 2604.04901 • Published Apr 6 • 40
LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model Paper • 2603.01068 • Published Mar 1 • 22
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Paper • 2510.14979 • Published Oct 16, 2025 • 70
view article Article NEO-unify: Building Native Multimodal Unified Models End to End sensenova • Mar 5 • 163
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? Paper • 2603.03241 • Published Mar 3 • 87
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling Paper • 2602.12279 • Published Feb 12 • 20
CoPE-VideoLM: Codec Primitives For Efficient Video Language Models Paper • 2602.13191 • Published Feb 13 • 32
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence Paper • 2602.08683 • Published Feb 9 • 52
GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning Paper • 2602.12099 • Published Feb 12 • 62
Innovator-VL: A Multimodal Large Language Model for Scientific Discovery Paper • 2601.19325 • Published Jan 27 • 82
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Paper • 2601.10611 • Published Jan 15 • 35
DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset Paper • 2601.10305 • Published Jan 15 • 37