WAXAL:非洲语言语音技术的大规模开放资源库

内容总结:
谷歌发布大规模非洲语言语音数据集WAXAL,助力填补数字鸿沟
2026年3月6日,谷歌研究团队正式推出大规模开放语音数据集WAXAL,旨在为非洲语言语音技术发展提供关键基础设施。该数据集首批涵盖撒哈拉以南非洲地区超过26个国家、使用人数超一亿的27种本土语言,以开放许可协议发布,致力于推动更包容、更贴合当地语言特性的语音技术发展。
当前,语音助手、自动转录等语音技术已深刻改变人机交互方式,但其发展成果长期集中于少数资源丰富的语言。在拥有超过2000种语言的撒哈拉以南非洲地区,数以亿计的人口仍难以用母语享受技术便利。为应对这一挑战,谷歌研究团队自2021年起,与非洲多所高校及社区组织深度合作,共同构建了WAXAL数据集。
WAXAL包含两大核心部分:一是约1846小时的自动语音识别(ASR)数据,通过“图像描述”方式采集真实场景下的自然对话,有效捕捉了语调变化、语码转换等语言特征;二是超过565小时的高保真文本转语音(TTS)数据,由当地社区成员参与录制,部分录音在自建专业录音室中完成,确保了语音质量。所有数据均采用知识共享许可协议(CC-BY-4.0)开放,以促进技术研发与应用。
项目坚持“由非洲、为非洲”的合作原则,数据收集工作完全由非洲本土学术及社区机构主导。乌干达马凯雷雷大学、加纳大学、数字乌姆干达组织(与亚的斯亚贝巴大学合作)及塞内加尔非洲数学科学研究所等机构均为关键合作伙伴。这种协作模式确保了数据能真实反映当地语言生态,并使得合作伙伴保有数据所有权,共同致力于资源开放。
基于WAXAL,合作团队已产出多项重要研究成果:包括开发首套针对阿坎语言语障碍者的开源数据集、构建涵盖5种加纳语言的5000小时语音库、对13种非洲语言进行主流语音模型性能评估,以及系统梳理非洲语言语音技术资源现状。这些工作为在资源有限环境下开发包容性语音技术提供了重要参考。
WAXAL的发布标志着在弥合语言数字鸿沟方面迈出了关键一步。谷歌表示将持续扩展该数据集,希望其能成为非洲语言数字保存的重要资源,并为未来技术创新奠定基础。
中文翻译:
WAXAL:为非洲语言语音技术打造的大规模开放资源
2026年3月6日
Google Research 高级产品经理 Tavonga Siyavora 与项目负责人 Abdoulaye Diack
WAXAL 为非洲语音技术提供了关键的开源基础。该资源包含27种本土语言的大规模自动语音识别(ASR)与文本转语音(TTS)数据,采用高度开放的许可协议,旨在赋能非洲人工智能生态,构建能够更好反映地区独特语言多样性的鲁棒语音系统。
快速了解
虚拟助手、自动转录等语音技术已彻底改变人机交互方式。然而,其发展成果长期集中于少数资源丰富的语言,导致全球数亿人口——尤其是使用超过2000种语言的撒哈拉以南非洲地区居民——难以用母语享受关键技术带来的便利。数年前,Google Research 团队开始着手应对这一挑战。
为满足这一迫切需求,我们推出 WAXAL:一个大规模开放语音数据集,首批涵盖27种撒哈拉以南非洲语言,使用人口超1亿,覆盖26个以上国家。该项目自2021年启动,历经多年努力,并与非洲学术及社区组织合作完成。WAXAL 提供高质量、开放许可的数据,为构建鲁棒语音系统奠定基础。此次发布包含约1,846小时的转录自然语音(用于自动语音识别),以及超过565小时的高保真录音(用于文本转语音)。所有资源均以知识共享许可协议(CC-BY-4.0)发布,旨在推动相关研究,促进开发符合非洲语言特点的包容性语音技术。我们计划持续扩展 WAXAL,纳入更多语言,以持续助力弥合数字鸿沟。
WAXAL 简介
WAXAL 致力于为超1亿使用者解决数据稀缺问题,以赋能区域人工智能研究生态。为支持鲁棒语音技术开发,该语料库整合了两类专项数据集,全面覆盖语音识别与合成任务:
- WAXAL-ASR(自然理解):包含约1,846小时转录音频,收录自然、非预设的语音。参与者无需朗读脚本,而是根据覆盖50多个主题的视觉素材,以母语进行描述。这种以图像引导的采集方式,有效捕捉了真实的语言变体,包括声调细节和语码转换,相比传统方法获得了更自然的语音样本。
- WAXAL-TTS(高保真生成):为助力合成自然语音而设计,包含超过565小时高质量、音素平衡的音频。TTS 数据采集过程注重协作:本地社区成员两人一组,共同撰写1万至2万词的脚本,并交替担任朗读者与录音者。为确保专业级音质,部分参与者利用项目资助搭建了定制录音室。录音后经分段处理、与文本对齐,并经过精度与质量审核。
WAXAL 语料库同时聚焦非预设 ASR 数据与高保真 TTS 音频,旨在支持全双工会话系统的开发。具体而言,ASR 部分有助于建模真实场景中多样化的自然语音输入,而高质量 TTS 部分则为生成清晰、自然的语音输出提供了纯净的参考数据。下表列出当前数据集涵盖的27种语言:
植根非洲人工智能生态
WAXAL 项目的核心在于坚持与非洲人工智能生态深度协作、直接贡献。数据采集工作完全由非洲学术及社区组织主导,Google 专家则在世界级数据采集实践方面提供指导。这种协作模式确保了语料库由服务对象共同构建、为其所用;各合作伙伴采用统一方法,分别专注于特定语言子集。我们的合作伙伴包括:马凯雷雷大学(为9种语言采集 ASR 和/或 TTS 数据)、加纳大学(专注于8种语言,采用上述图像引导的 ASR 数据采集方法)。其他重要合作方包括 Digital Umuganda(与亚的斯亚贝巴大学合作),在多种地区语言的 ASR 采集中发挥关键作用;Media Trust、Loud n Clear 以及非洲数学科学研究所塞内加尔中心则主导了多种地区语言的高质量录音室 TTS 录制。
该框架基于一项根本原则:合作伙伴保留所采集数据的所有权,并共同承诺将所有数据集向更广泛的社区开放。这种深度协作与开放共享的理念,已催生多项重要的衍生研究与成果:
- 在此框架下,合作伙伴已推动多项新研究,例如开发了《社区主导的障碍语音采集实践指南》。该研究创建了首个针对阿坎语使用者(如脑瘫、口吃人群)的开源数据集,并证明面对面的图像引导采集比文本提示更适用于此类群体。这项工作为低资源环境下开发包容性语音技术提供了重要路线图。
- 此外,项目支持了一项重要研究,发布了涵盖五种加纳语言(阿坎语、埃维语、达格巴尼语、达加雷语、伊克波索语)的5,000小时语音语料库。该研究通过受控众包方式捕捉自然语调,为构建适应西非语言多样性的鲁棒 ASR 与 TTS 系统奠定了基础。
- 其他关键研究聚焦于对四种前沿模型(Whisper、XLS-R、MMS、W2v-BERT)在13种非洲语言上进行基准测试。该研究分析了性能随训练数据增加的扩展规律,为数据效率提供了重要见解,并指出扩展效益高度依赖于语言复杂度与领域对齐程度。
- 最后,一项系统性文献综述正式发布,梳理了涵盖111种非洲语言的74个数据集,以厘清当前语音技术的前沿态势。该综述强调,亟需开发多领域会话语料库,并采用更符合语言特性的评估指标(如字符错误率 CER),以更好地评估形态丰富及声调语言场景下的模型性能。
总结与未来方向
WAXAL 是弥合数字鸿沟的关键里程碑,为27种撒哈拉以南非洲语言提供了高质量、开放获取的语音资源。通过与非洲学术及社区组织的深度协作,这一项目赋能了非洲大陆的人工智能生态,并保护了语言多样性。我们希望 WAXAL 将持续作为非洲语言数字保存的重要资源,并为未来创新奠定基础。Google 将持续投入,计划不断扩展 WAXAL 数据集。
致谢
我们衷心感谢以下合作伙伴为缩小语言鸿沟、为非洲大陆数百万使用者构建更包容的数字未来所做出的重要贡献:马凯雷雷大学、加纳大学、Digital Umuganda、亚的斯亚贝巴大学、非洲数学科学研究所塞内加尔中心、Media Trust 以及 Loud and Clear Communications Ltd。
英文来源:
WAXAL: A large-scale open resource for African language speech technology
March 6, 2026
Tavonga Siyavora, Senior Product Manager, and Abdoulaye Diack, Program Manager, Google Research
WAXAL provides a critical, open-access foundation for African speech technology. Featuring a large corpus of ASR and TTS data for 27 native languages under a highly permissive license, WAXAL empowers the African AI ecosystem to build robust speech systems that better reflect the region's unique linguistic diversity.
Quick links
Voice-enabled technologies like virtual assistants and automated transcription have transformed how we interact with computers. However, their benefits disproportionately favor a handful of high-resource languages. This divide has left hundreds of millions of people — particularly in Sub-Saharan Africa, home to over 2,000 distinct languages — unable to access essential technology in their native tongues. Several years ago, the team at Google Research set out to help tackle this problem.
To address this critical need, we introduce WAXAL: a large-scale, openly accessible speech dataset that initially covers 27 Sub-Saharan African languages spoken by over 100 million speakers across more than 26 countries. Developed through a multi-year effort beginning in 2021, in collaboration with African academic and community organizations, WAXAL provides the high-quality, permissively licensed data necessary to build robust speech systems. Setting a foundational milestone, this initial release features approximately 1,846 hours of transcribed natural speech for automatic speech recognition (ASR) and over 565 hours of high-fidelity recordings for text-to-speech (TTS). We are releasing these resources under a Creative Commons license (CC-BY-4.0) to catalyze research and enable inclusive voice-enabled technologies tailored to the unique linguistic characteristics of the continent. We intend for the WAXAL collection to continuously evolve and expand to include additional languages as part of our ongoing effort to bridge the digital divide.
Introducing WAXAL
By addressing critical data scarcity for over 100 million speakers, WAXAL aims to empower the regional AI research ecosystem. To support the development of robust speech technologies, the corpus integrates two specialized datasets designed to provide comprehensive coverage for both speech recognition and synthesis tasks.
- WAXAL-ASR (Spontaneous Understanding): Comprising approximately 1,846 hours of transcribed audio, this dataset captures natural, unscripted speech. Instead of reading scripts, diverse participants were asked to describe visual stimuli covering 50+ topics in their native language. This image-prompted elicitation captured authentic linguistic variations, including tonal nuances and code-switching. This method successfully yielded more natural speech than traditional methods.
- WAXAL-TTS (High-Fidelity Generation): Designed to facilitate the creation of natural-sounding synthetic voices, this dataset contains over 565 hours of high-quality, phonetically balanced audio. The TTS collection process was highly collaborative: local community members worked in pairs to draft scripts of 10,000–20,000 words, alternating reader and recorder roles. To ensure professional-grade acoustics, some participants used project funding to build custom studio boxes. The resulting recordings were then segmented, matched with the script text, and reviewed for accuracy and quality.
The WAXAL corpus's dual focus on unscripted ASR data and high-fidelity TTS audio is designed to enable the development of full-duplex conversational systems. Specifically, the ASR component facilitates the modeling of varied, spontaneous speech input typical of real-world scenarios, while the high-quality TTS component provides the clean reference data required for generating clear, natural output. The table below lists the 27 languages currently included in the dataset:
Anchoring in the African AI ecosystem
Crucial to the WAXAL project was our commitment to working with, and contributing directly to, the African AI ecosystem. The data collection effort was led entirely by African academic and community organizations, guided by Google experts on world-class data collection practices. This collaborative approach ensured the corpus was built by and for the community it serves; with shared methodology each partner focused on a specific subset of languages. Our partners included Makerere University, which collected ASR and/or TTS data for nine different languages, and the University of Ghana, which focused its efforts on eight languages, using the ASR image-prompted data collection methodology outlined above. Additional key collaborators were Digital Umuganda, in partnership with Addis Ababa University, who were instrumental in leading the ASR collection for several regional languages. For the high-quality, studio-recorded voices, Media Trust, Loud n Clear and African Institute for Mathematical Sciences Senegal spearheaded the TTS recordings across various regional languages.
This framework is fundamentally rooted in the principle that our partners retain ownership of the collected data toward the shared commitment to make all datasets openly available for the broader community. This deep collaboration and open-access philosophy have already enabled notable derivative research and publications. - Through this framework, our partners have already enabled new research, such as the development of a cookbook for community-driven collection of impaired speech . This research resulted in the first open-source dataset for Akan speakers with conditions like cerebral palsy and stammering, and demonstrated that in-person, image-prompted elicitation is more effective than text-based prompts for these populations. This work provides a vital roadmap for developing inclusive speech technologies in low-resource environments.
- Furthermore, the initiative supported a major study that introduced a 5,000-hour speech corpus for five Ghanaian languages — Akan, Ewe, Dagbani, Dagaare, and Ikposo. This work established infrastructure for building robust ASR and TTS systems tailored to the linguistic diversity of West Africa by using a controlled crowdsourcing approach to capture natural, spontaneous intonations.
- Other essential research has focused on benchmarking four state-of-the-art models (Whisper, XLS-R, MMS, and W2v-BERT) across 13 African languages. This study analyzed how performance scales with increased training data, offering key insights into data efficiency and highlighting that scaling benefits are strongly dependent on linguistic complexity and domain alignment.
- Finally, a systematic literature review was published, cataloging 74 datasets across 111 African languages to map the current frontier of speech technology. This review emphasized the urgent need for multi-domain conversational corpora and the adoption of linguistically informed metrics, such as Character Error Rate (CER), to better evaluate performance in morphologically rich and tonal language contexts.
Conclusion and future directions
WAXAL represents a key milestone in bridging the digital divide, offering a high-quality, open-access speech resource for 27 Sub-Saharan African languages. Developed through deep collaboration with African academic and community organizations, this initiative empowers the continent’s AI ecosystem and preserves linguistic diversity. We hope WAXAL will continue to serve as a vital resource for the digital preservation of African languages and a foundation for future innovations. Google remains committed to this effort, with plans to continuously expand the WAXAL dataset.
Acknowledgements
We are grateful to our partners at Makerere University, the University of Ghana, Digital Umuganda, University of Addis Ababa, the African Institute for Mathematical Sciences Senegal, Media Trust and Loud and Clear Communications Ltd for their essential contributions in reducing the language gap and building a more inclusive digital future for millions of speakers across the African continent.
文章标题:WAXAL:非洲语言语音技术的大规模开放资源库
文章链接:https://blog.qimuai.cn/?post=3506
本站文章均为原创,未经授权请勿用于任何商业用途