Joanna Hong
- joanna2587 [at] gmail [dot] com
- joannahong.github.io
- New York City, New York

I am a Research Scientist at Google DeepMind in New York City, working on smart dictation and voice editing for Audio Gemini. I received my Ph.D. in Electrical Engineering from KAIST, advised by Professor Yong Man Ro in the Integrated Vision Language Lab. My thesis focused on human speech understanding through multimodal representation learning and was recognized with the Outstanding Dissertation Award from the School of Electrical Engineering. My research centers on building robust and scalable speech and audio technologies for human-AI interaction, including speech enhancement, separation, and speaker diarization. I am also interested in multimodal learning that integrates audio, visual, and textual modalities to improve machine understanding. More information can be found in my Curriculum Vitae.
Work Experiences
Research Scientist
Working in Audio Gemini Dictation team, advancing speech and audio capabilities for Audio Gemini, with a focus on voice dictation and intelligent voice editing systems.
- Manager: Dr. Quan Wang
Member of Technical Staff
Contributed to the development of Trillion-7B, a multilingual 7-billion-parameter large language model designed for practical, real-world applications. Efforts included optimizing model architecture and supporting training infrastructure, aligned with Trillion Labs’ mission to build a powerful Korean foundation model.
Research Scientist Intern
Worked on robust audiovisual representation learning with missing modality scenarios, enabling the recovery of absent information when only a single modality (e.g., audio or video) is available.
- Manager: Dr. Anurag Kumar
- Peers: Buye Xu, Jacob Donley, Ke Tan, Honglie Chen, Sanjeel Parekh
Publications
Conferences (* indicates equal contribution)
- Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [link]
- Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model [link]
- DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding [link]
- Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring [link]
- Lip-to-Speech Synthesis in the Wild with Multi-task Learning [link]
- VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection [link]
- Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition [link]
- SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory [link]
- Lip to Speech Synthesis with Visual Context Attentional GAN [link]
- Multi-Modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video [link]
- Unsupervised Disentangling of Viewpoint and Residues Variations by Substituting Representations for Robust Face Recognition [link]
- Comprehensive Facial Expression Synthesis Using Human-Interpretable Language [link]
- Face Tells Detailed Expression: Generating Comprehensive Facial Expression Sentence Through Facial Action Units [link]
Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeonghun Yeo, Yong Man Ro
Association for Computational Linguistics (ACL), 2024 (Oral)
Joanna Hong, Se Jin Park, and Yong Man Ro
Findings of the Association for Computational Linguistics: EMNLP, 2023
Jeongsoo Choi*, Joanna Hong*, and Yong Man Ro
IEEE/CVF International Conference on Computer Vision (ICCV), 2023
Joanna Hong*, Minsu Kim*, Jeongsoo Choi, and Yong Man Ro
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
Minsu Kim*, Joanna Hong*, and Yong Man Ro
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023
Joanna Hong*, Minsu Kim, and Yong Man Ro
European Conference on Computer Vision (ECCV), 2022
Joanna Hong*, Minsu Kim*, Daehun Yoo, and Yong Man Ro
Interspeech, 2022 (Oral)
Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, and Yong Man Ro
AAAI Conference on Artificial Intelligence (AAAI), 2022 (Oral)
Minsu Kim, Joanna Hong, and Yong Man Ro
Conference on Neural Information Processing Systems (NeuIPS), 2021
Minsu Kim*, Joanna Hong*, Se Jin Park, and Yong Man Ro
IEEE/CVF International Conference on Computer Vision (ICCV), 2021
Minsu Kim, Joanna Hong, Junho Kim, Hong Joo Lee, and Yong Man Ro
International Conference on Pattern Recognition (ICPR), 2021
Joanna Hong, Jung Uk Kim, Sangmin Lee, Yong Man Ro
IEEE International Conference on Image Processing (ICIP), 2020
Joanna Hong, Hong Joo Lee, Yelin Kim, and Yong Man Ro
International Conference on Multimedia Modeling (MMM), 2020
Journals
- Speech Reconstruction with Reminiscent Sound via Visual Voice Memory [link]
- Cromm-vsr: Cross-modal memory augmented visual speech recognition [link]
Joanna Hong, Minsu Kim, Se Jin Park, and Yong Man Ro
IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2021
Minsu Kim, Joanna Hong, Se Jin Park, and Yong Man Ro
IEEE Transactions on Multimedia (TMM), 2021