Blaise Cruz

profile2.jpg

Mabuhay! đź‘‹

I’m a PhD student at MBZUAI supervised by Dr. Alham Fikri Aji and Prof. Thamar Solorio specializing in problems at the intersection of Multilinguality and Low-resource Languages.

Particularly, I am interested in understanding the behavior of models when constrained under low-resource multilingual domains. I’ve collaborated with many talented colleagues on various topics under this umbrella, including:

  • Code Switching – Multilingual speakers naturally code-switch in two or more languages when speaking to peers, but multilingual models are still lacking in capabilities to understand and execute this phenomenon.
  • Resources & Evaluation – More data is often the best remedy to “very little data”. In addition to working on Filipino resources, I have also done work for Southeast Asian Languages and beyond.
  • Applications in Low-resource – Employing creative techniques to improve performance in tasks such as Multilingual Translation, Question Generation, Fake News Detection, and more – all constrained under low-resource settings.

Prior to my PhD, I was Lead Research Engineer at Samsung Research in the Philippines where I worked on low-resource machine translation and dialogue generation. I have also been previously affiliated with the University of the Philippines, De La Salle University, and Senti AI.

If you’re interested in collaborating or if you want to chat about low-resource languages, feel free to get in touch! You may reach me through my email me (at) blaisecruz (dot) com.


News

Jan 26, 2025 Two papers, World Cuisines and Thank You, Stingray, have been accepted in NAACL 2025!
Dec 01, 2024 We are proud to announce the creation of the ACL Special Interest Group on Southeast Asian NLP (ACL SIGSEA)!
Oct 18, 2024 We release World Cuisines, a massive multilingual and multicultural VQA benchmark dataset. Preprint can be accessed here.
Sep 25, 2024 CVQA has been accepted as an oral presentation in NeurIPS 2024!
Sep 09, 2024 One paper in EMNLP and two papers in WMT 2024 were accepted!

Latest posts

Jun 12, 2024 Welcome!

Selected Publications

  1. EMNLP
    SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
    Holy Lovenia†, Rahmad Mahendra†, Salsabil Maulana Akbar†, Lester James V. Miranda†, Jennifer Santoso†, Elyanah Aco†, Akhdan Fadhilah†, Jonibek Mansurov†, Joseph Marvin Imperial†, Onno P. Kampman†, Joel Ruben Antony Moniz†, Muhammad Ravi Shulthan Habibi†, Frederikus Hudi†, Railey Montalan†, Ryan Ignatius, and 46 more authors
    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
  2. EMNLP
    Oral Presentation
    Multilingual Large Language Models Are Not (Yet) Code-Switchers
    Ruochen Zhang*, Samuel Cahyawijaya*, Jan Christian Blaise Cruz*, Genta Indra Winata*, and Alham Fikri Aji*
    In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
  3. LREC
    Improving Large-scale Language Models and Resources for Filipino
    Jan Christian Blaise Cruz, and Charibeth Cheng
    In Proceedings of the 13th Language Resources and Evaluation Conference (LREC), 2022