Joanna Hong

Research Scientist

image

I am a Research Scientist at Google DeepMind in New York City, working on smart dictation and voice editing for Audio Gemini. I received my Ph.D. in Electrical Engineering from KAIST, advised by Professor Yong Man Ro in the Integrated Vision Language Lab. My thesis focused on human speech understanding through multimodal representation learning and was recognized with the Outstanding Dissertation Award from the School of Electrical Engineering. My research centers on building robust and scalable speech and audio technologies for human-AI interaction, including speech enhancement, separation, and speaker diarization. I am also interested in multimodal learning that integrates audio, visual, and textual modalities to improve machine understanding. More information can be found in my Curriculum Vitae.


Work Experiences

Research Scientist

Google DeepMind | 2025 - present

Working in Audio Gemini Dictation team, advancing speech and audio capabilities for Audio Gemini, with a focus on voice dictation and intelligent voice editing systems.

Member of Technical Staff

Trillion Labs | 2024 - 2025

Contributed to the development of Trillion-7B, a multilingual 7-billion-parameter large language model designed for practical, real-world applications. Efforts included optimizing model architecture and supporting training infrastructure, aligned with Trillion Labs’ mission to build a powerful Korean foundation model.

Research Scientist Intern

Meta Reality Labs | 2023 - 2024

Worked on robust audiovisual representation learning with missing modality scenarios, enabling the recovery of absent information when only a single modality (e.g., audio or video) is available.

  • Manager: Dr. Anurag Kumar
  • Peers: Buye Xu, Jacob Donley, Ke Tan, Honglie Chen, Sanjeel Parekh

Publications

Conferences (* indicates equal contribution)

  • Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [link]
  • Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeonghun Yeo, Yong Man Ro
    Association for Computational Linguistics (ACL), 2024 (Oral)

  • Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model [link]
  • Joanna Hong, Se Jin Park, and Yong Man Ro
    Findings of the Association for Computational Linguistics: EMNLP, 2023

  • DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding [link]
  • Jeongsoo Choi*, Joanna Hong*, and Yong Man Ro
    IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  • Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring [link]
  • Joanna Hong*, Minsu Kim*, Jeongsoo Choi, and Yong Man Ro
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  • Lip-to-Speech Synthesis in the Wild with Multi-task Learning [link]
  • Minsu Kim*, Joanna Hong*, and Yong Man Ro
    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

  • VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection [link]
  • Joanna Hong*, Minsu Kim, and Yong Man Ro
    European Conference on Computer Vision (ECCV), 2022

  • Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition [link]
  • Joanna Hong*, Minsu Kim*, Daehun Yoo, and Yong Man Ro
    Interspeech, 2022 (Oral)

  • SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory [link]
  • Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, and Yong Man Ro
    AAAI Conference on Artificial Intelligence (AAAI), 2022 (Oral)

  • Lip to Speech Synthesis with Visual Context Attentional GAN [link]
  • Minsu Kim, Joanna Hong, and Yong Man Ro
    Conference on Neural Information Processing Systems (NeuIPS), 2021

  • Multi-Modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video [link]
  • Minsu Kim*, Joanna Hong*, Se Jin Park, and Yong Man Ro
    IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  • Unsupervised Disentangling of Viewpoint and Residues Variations by Substituting Representations for Robust Face Recognition [link]
  • Minsu Kim, Joanna Hong, Junho Kim, Hong Joo Lee, and Yong Man Ro
    International Conference on Pattern Recognition (ICPR), 2021

  • Comprehensive Facial Expression Synthesis Using Human-Interpretable Language [link]
  • Joanna Hong, Jung Uk Kim, Sangmin Lee, Yong Man Ro
    IEEE International Conference on Image Processing (ICIP), 2020

  • Face Tells Detailed Expression: Generating Comprehensive Facial Expression Sentence Through Facial Action Units [link]
  • Joanna Hong, Hong Joo Lee, Yelin Kim, and Yong Man Ro
    International Conference on Multimedia Modeling (MMM), 2020

Journals

  • Speech Reconstruction with Reminiscent Sound via Visual Voice Memory [link]
  • Joanna Hong, Minsu Kim, Se Jin Park, and Yong Man Ro
    IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2021

  • Cromm-vsr: Cross-modal memory augmented visual speech recognition [link]
  • Minsu Kim, Joanna Hong, Se Jin Park, and Yong Man Ro
    IEEE Transactions on Multimedia (TMM), 2021