Nice to meet you!

I am a researcher at Google DeepMind, specializing in multimodal machine learning with a primary focus on audio, post-training, and reinforcement learning. I am a core contributor to the development and delivery of the Gemini 2.5 series, advancing state-of-the-art audio generation.

Research Interests

  • Machine Learning: Deep learning, large-scale transformer architectures, interpretable and explainable AI, model efficiency and optimization
  • Natural Language Processing: Sequence modeling, attention mechanisms, language understanding, multilingual and cross-lingual models, generative language models
  • Audio & Speech: Audio tokenization, modality alignment, speech/sound understanding, audio generation, and multimodal learning

Publications

  • Lei, Zhihong, Xingyu Na, Mingbin Xu, Ernest Pusateri, Christophe Van Gysel, Yuanyuan Zhang, Shiyi Han, and Zhen Huang.
    Contextualization of ASR with LLM using phonetic retrieval-based augmentation.
    In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2025.

  • Han, Shiyi, Mingbin Xu, Zhihong Lei, Zhen Huang, and Xingyu Na.
    Enhancing CTC-based speech recognition with diverse modeling units.
    In Proc. Interspeech 2024, pp. 4583-4587. 2024.

  • Lei, Zhihong, Ernest Pusateri, Shiyi Han, Leo Liu, Mingbin Xu, Tim Ng, Ruchir Travadi et al.
    Personalization of CTC-based end-to-end speech recognition using pronunciation-driven subword tokenization.
    In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10096-10100. IEEE, 2024.

  • Xu, Mingbin, Alex Jin, Sicheng Wang, Mu Su, Tim Ng, Henry Mason, Shiyi Han et al.
    Conformer-Based Speech Recognition On Extreme Edge-Computing Devices.
    In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pp. 131-139. 2024.

  • Xu, Mingbin, Congzheng Song, Ye Tian, Neha Agrawal, Filip Granqvist, Rogier van Dalen, Xiao Zhang et al.
    Training large-vocabulary neural language models by private federated learning for resource-constrained devices.
    In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023.

  • Lei, Zhihong, Mingbin Xu, Shiyi Han, Leo Liu, Zhen Huang, Tim Ng, Yuanyuan Zhang et al.
    Acoustic model fusion for end-to-end speech recognition.
    In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1-7. IEEE, 2023.

  • Yang, Cheng, Maosong Sun, Haoran Liu, Shiyi Han, Zhiyuan Liu, and Huanbo Luan.
    Neural diffusion model for microscopic cascade study.
    IEEE Transactions on Knowledge and Data Engineering 33, no. 3 (2019): 1128-1139.

Past Experience

  • Researcher, Character AI
    • Worked on voice cloning and text-to-speech.
  • Senior Researcher, Apple
    • Improved on device ASR from model, algorithm and system levels.
    • Projects I have worked on were showcased at WWDC 2023/2024. They were mentioned in MKBHD’s videos and received positive reactions. The usage of our on-device ASR system increased 2 to 6 times in many locales.
  • ML Engineer, LinkedIn
    • Worked on named entity tagging and linking to resolve ambiguous entities in the knowledge graph.
  • M.S. in Computer Science, Brown University
    • Worked with a professor and a local hospital on a project using existing language models to help accelerate biology research.
    • Learned from a group of professors who were laser-focused on teaching quality. The operating systems, algorithms, and programming languages courses are some of my fondest memories.
  • ML Intern, LinkedIn
    • Finetune pretrained language models for document classification.
  • B.S. in Computer Science, Beihang University
    • Took many foundational courses that I am grateful for throughout my life, such as computer architecture, compilers, and more.
  • Research Intern, Microsoft Research Asia
    • Applied Transformer models to language modeling tasks, but did not scale up the training data.
    • Worked on reinforcement learning for machine translation. Also learned to use Theano in a TensorFlow/PyTorch era.
    • Inspired by capsule networks and worked on layer weight compression.
  • Research Intern, Bytedance AI Lab
    • Research on recommender systems and speech-related topics.

Favorite place

So far, probably Tokyo ≈ NYC ≥ London > Bay Area ≈ Seattle. I’m a foodie, so I’m very biased because of that.