Unlocking Data from Silos

Gates Foundation Supported Datasets for AI Model Training

This program develops a scalable framework for unlocking high-quality proprietary content through ethical licensing agreements. The initiative bridges the data gap that prevents development of high-performing AI models for underrepresented languages and domains by facilitating access to datasets for researchers and organizations working on AI for social good.

Apply for Access

Available Datasets

Explore our curated collection of datasets designed to advance AI research and development for public good.

Chichewa Language Dataset
Chichewa Language Dataset

A comprehensive dataset of Chichewa language text and audio, including literature, news articles, household surveys, and radio broadcasts. Data sources include the National Statistical Office, Malawi Institute of Education, the Malawi Times, Malawi Capital Radio, and others.

Learn more →
MITS Dataset
MITS Dataset

Minimmaly Invasive Tissue Sampling (MITS) data including clinical records, imaging data, and patient demographics. Data cover samples taken from five countries: Nepal, Rwanda, Sierra Leone, Uganda, and Bangladesh.

Learn more →

Apply to Access Datasets

Researchers, academic institutions, and organizations working on AI for social good can apply for access to these datasets. The application process involves submitting a research proposal, demonstrating ethical data handling practices, and committing to open-source publication of derived insights. We prioritize applications that show potential for significant impact on global development challenges and align with the Gates Foundation's mission.

Submit Application

Applications are reviewed on a rolling basis. Typical response time is 2-3 weeks.