Nice to meet you!

I am a researcher at Google DeepMind, specializing in multimodal machine learning with a primary focus on audio, post-training, and reinforcement learning. I am a core contributor to the development and delivery of the Gemini 2.5 series, advancing state-of-the-art audio generation.

Research Interests

Machine Learning: Deep learning, large-scale transformer architectures, interpretable and explainable AI, model efficiency and optimization
Natural Language Processing: Sequence modeling, attention mechanisms, language understanding, multilingual and cross-lingual models, generative language models
Audio & Speech: Audio tokenization, modality alignment, speech/sound understanding, audio generation, and multimodal learning

Publications

Lei, Zhihong, Xingyu Na, Mingbin Xu, Ernest Pusateri, Christophe Van Gysel, Yuanyuan Zhang, Shiyi Han, and Zhen Huang.
Contextualization of ASR with LLM using phonetic retrieval-based augmentation.
In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2025.
Han, Shiyi, Mingbin Xu, Zhihong Lei, Zhen Huang, and Xingyu Na.
Enhancing CTC-based speech recognition with diverse modeling units.
In Proc. Interspeech 2024, pp. 4583-4587. 2024.
Lei, Zhihong, Ernest Pusateri, Shiyi Han, Leo Liu, Mingbin Xu, Tim Ng, Ruchir Travadi et al.
Personalization of CTC-based end-to-end speech recognition using pronunciation-driven subword tokenization.
In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10096-10100. IEEE, 2024.
Xu, Mingbin, Alex Jin, Sicheng Wang, Mu Su, Tim Ng, Henry Mason, Shiyi Han et al.
Conformer-Based Speech Recognition On Extreme Edge-Computing Devices.
In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track), pp. 131-139. 2024.
Xu, Mingbin, Congzheng Song, Ye Tian, Neha Agrawal, Filip Granqvist, Rogier van Dalen, Xiao Zhang et al.
Training large-vocabulary neural language models by private federated learning for resource-constrained devices.
In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023.
Lei, Zhihong, Mingbin Xu, Shiyi Han, Leo Liu, Zhen Huang, Tim Ng, Yuanyuan Zhang et al.
Acoustic model fusion for end-to-end speech recognition.
In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1-7. IEEE, 2023.
Yang, Cheng, Maosong Sun, Haoran Liu, Shiyi Han, Zhiyuan Liu, and Huanbo Luan.
Neural diffusion model for microscopic cascade study.
IEEE Transactions on Knowledge and Data Engineering 33, no. 3 (2019): 1128-1139.

Past Experience

Researcher, Character AI
- Worked on voice cloning and text-to-speech.
Senior Researcher, Apple
- Improved on device ASR from model, algorithm and system levels.
- Projects I have worked on were showcased at WWDC 2023/2024. They were mentioned in MKBHD’s videos and received positive reactions. The usage of our on-device ASR system increased 2 to 6 times in many locales.
ML Engineer, LinkedIn
- Worked on named entity tagging and linking to resolve ambiguous entities in the knowledge graph.
M.S. in Computer Science, Brown University
- Worked with a professor and a local hospital on a project using existing language models to help accelerate biology research.
- Learned from a group of professors who were laser-focused on teaching quality. The operating systems, algorithms, and programming languages courses are some of my fondest memories.
ML Intern, LinkedIn
- Finetune pretrained language models for document classification.
B.S. in Computer Science, Beihang University
- Took many foundational courses that I am grateful for throughout my life, such as computer architecture, compilers, and more.
Research Intern, Microsoft Research Asia
- Applied Transformer models to language modeling tasks, but did not scale up the training data.
- Worked on reinforcement learning for machine translation. Also learned to use Theano in a TensorFlow/PyTorch era.
- Inspired by capsule networks and worked on layer weight compression.
Research Intern, Bytedance AI Lab
- Research on recommender systems and speech-related topics.

Favorite place

So far, probably Tokyo ≈ NYC ≥ London > Bay Area ≈ Seattle. I’m a foodie, so I’m very biased because of that.

Michael (Shiyi) Han

Research Interests

Publications

Past Experience

Favorite place