Nice to meet you!
I am a researcher at Google DeepMind, specializing in multimodal machine learning with a primary focus on audio, post-training, and reinforcement learning. I am a core contributor to the development and delivery of the Gemini 2.5 series, advancing state-of-the-art audio generation.
Research Interests
- Machine Learning: Deep learning, large-scale transformer architectures, interpretable and explainable AI, model efficiency and optimization
- Natural Language Processing: Sequence modeling, attention mechanisms, language understanding, multilingual and cross-lingual models, generative language models
- Audio & Speech: Audio tokenization, modality alignment, speech/sound understanding, audio generation, and multimodal learning
Publications
-
Lei, Zhihong, Xingyu Na, Mingbin Xu, Ernest Pusateri, Christophe Van Gysel, Yuanyuan Zhang, Shiyi Han, and Zhen Huang.
Contextualization of ASR with LLM using phonetic retrieval-based augmentation.
In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2025. -
Han, Shiyi, Mingbin Xu, Zhihong Lei, Zhen Huang, and Xingyu Na.
Enhancing CTC-based speech recognition with diverse modeling units.
In Proc. Interspeech 2024, pp. 4583-4587. 2024. -
Lei, Zhihong, Ernest Pusateri, Shiyi Han, Leo Liu, Mingbin Xu, Tim Ng, Ruchir Travadi et al.
Personalization of CTC-based end-to-end speech recognition using pronunciation-driven subword tokenization.
In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10096-10100. IEEE, 2024. -
Xu, Mingbin, Alex Jin, Sicheng Wang, Mu Su, Tim Ng, Henry Mason, Shiyi Han et al.
Conformer-Based Speech Recognition On Extreme Edge-Computing Devices.
In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pp. 131-139. 2024. -
Xu, Mingbin, Congzheng Song, Ye Tian, Neha Agrawal, Filip Granqvist, Rogier van Dalen, Xiao Zhang et al.
Training large-vocabulary neural language models by private federated learning for resource-constrained devices.
In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023. -
Lei, Zhihong, Mingbin Xu, Shiyi Han, Leo Liu, Zhen Huang, Tim Ng, Yuanyuan Zhang et al.
Acoustic model fusion for end-to-end speech recognition.
In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1-7. IEEE, 2023. -
Yang, Cheng, Maosong Sun, Haoran Liu, Shiyi Han, Zhiyuan Liu, and Huanbo Luan.
Neural diffusion model for microscopic cascade study.
IEEE Transactions on Knowledge and Data Engineering 33, no. 3 (2019): 1128-1139.
Past Experience
- Researcher, Character AI
- Worked on voice cloning and text-to-speech.
- Senior Researcher, Apple
- Improved on device ASR from model, algorithm and system levels.
- Projects I have worked on were showcased at WWDC 2023/2024. They were mentioned in MKBHD’s videos and received positive reactions. The usage of our on-device ASR system increased 2 to 6 times in many locales.
- ML Engineer, LinkedIn
- Worked on named entity tagging and linking to resolve ambiguous entities in the knowledge graph.
- M.S. in Computer Science, Brown University
- Worked with a professor and a local hospital on a project using existing language models to help accelerate biology research.
- Learned from a group of professors who were laser-focused on teaching quality. The operating systems, algorithms, and programming languages courses are some of my fondest memories.
- ML Intern, LinkedIn
- Finetune pretrained language models for document classification.
- B.S. in Computer Science, Beihang University
- Took many foundational courses that I am grateful for throughout my life, such as computer architecture, compilers, and more.
- Research Intern, Microsoft Research Asia
- Applied Transformer models to language modeling tasks, but did not scale up the training data.
- Worked on reinforcement learning for machine translation. Also learned to use Theano in a TensorFlow/PyTorch era.
- Inspired by capsule networks and worked on layer weight compression.
- Research Intern, Bytedance AI Lab
- Research on recommender systems and speech-related topics.
Favorite place
So far, probably Tokyo ≈ NYC ≥ London > Bay Area ≈ Seattle. I’m a foodie, so I’m very biased because of that.