Development Data Partnership Clickthrough Sublicense Agreement
All datasets in this catalog are governed by the following sublicense agreement. By accessing any dataset, you agree to these terms.
1. Parties; Clickthrough Acceptance; Authority
This Clickthrough Sublicense Agreement (this "Sublicense") is entered into by and between: the International Bank for Reconstruction and Development ("IBRD") and the International Development Association ("IDA"), collectively, the "World Bank"; and the individual natural person accepting these terms and, where such individual is acting within the scope of employment, appointment, or affiliation, the employing or affiliated institution on whose behalf such individual acts (collectively, "Authorized User" or "You").
By selecting "I Agree," submitting, participating in, or acting under an Approved Proposal through the Development Data Partnership Platform (the "Portal"), or accessing any Dataset, You (i) acknowledge that You have read and understood this Sublicense, (ii) agree to be legally bound by its terms, and (iii) consent to the use of electronic records and signatures for purposes of formation and evidencing acceptance.
If You do not agree, You shall not access or use any Dataset. You represent and warrant that You have full legal power and authority to enter into this Sublicense and, where applicable, to bind Your institution. Authorized User and the affiliated institution (if any) shall be jointly and severally liable for all obligations and liabilities arising under this Sublicense.
2. Eligibility; Approved Proposal; Scope of Authorization
2.1 Proposal-Based Access. Access to any Dataset is conditioned on the World Bank's prior written approval of a Proposal submitted through the Portal (an "Approved Proposal"). The World Bank may approve, deny, condition, suspend, or revoke approval in its sole discretion at any time.
2.2 Purpose; Duration; No Expansion by Proposal. You may access and use the Dataset solely (a) for the purposes expressly described in the Approved Proposal, (b) for Non-Commercial Research Use, and (c) during the project duration approved in the Approved Proposal. An Approved Proposal is incorporated by reference solely to delimit scope and conditions of use and shall not amend this Sublicense or expand any rights granted herein.
2.3 Material Changes Require Prior Approval. Any "Material Change" requires renewed approval through the Portal prior to implementation. "Material Change" includes, without limitation, any change to: purpose; methodology; model architecture; base model; research personnel; compute environment; storage architecture; access controls; duration; anticipated outputs; publication plan; or release plan for Derived Outputs.
3. Definitions
Authorized User means a natural person affiliated with an eligible institution, identified by name in an Approved Proposal, who has accepted this Sublicense and has been issued access credentials by the World Bank. Access and authentication must be conducted solely through the institutional, domain-linked email address verified in the Proposal. Authorization is personal, limited, non-transferable, and non-delegable.
Dataset means any organized collection of data made available through the Portal and/or API, including without limitation audio, text, metadata, sampled content, segmented or chunked content, shuffled content, transformed content, or other provisioned content.
Data means any information from or derived from a Dataset made available through the Portal and/or API, including without limitation chunked, shuffled, sampled, altered, transformed, watermarked, security-marked, trace-marked, or otherwise protected content.
Model means any machine learning, artificial intelligence, statistical, computational, or automated system trained, fine-tuned, evaluated, or otherwise derived in whole or in part from the Dataset.
Derived Output means any Model or artifact created using the Dataset, including without limitation model weights, embeddings, evaluation checkpoints, tokenizers, vocabularies, synthetic data, fine-tuning artifacts, evaluation artifacts, or other outputs, provided that such output does not permit reconstruction of more than an insubstantial portion of the Dataset.
Non-Commercial Research Use means research, education, or development activity conducted without the primary intent of commercial advantage or monetary compensation and without commercial deployment, commercialization, productization, or monetization of any output, whether directly or indirectly.
Permissive Open-Source License means an Open Source Initiative (OSI)-approved license that permits commercial and non-commercial use, modification, and redistribution with minimal restrictions (e.g., MIT, Apache 2.0, BSD 2-Clause, BSD 3-Clause).
4. License Grant; Scope; Restrictions
4.1 Limited License. Subject to the terms and conditions of this Sublicense, the World Bank grants You a limited, non-exclusive, non-transferable, non-sublicensable, revocable, royalty-free license to access and use the Dataset solely for Non-Commercial Research Use and solely within the scope of the Approved Proposal.
4.2 Permitted Uses. You may: (a) access and process the Dataset for the approved research purpose; (b) create Derived Outputs in compliance with all restrictions; and (c) publish research findings that do not reveal, reconstruct, or enable reconstruction of Dataset content.
4.3 Prohibited Uses. You shall not: redistribute, sublicense, or make the Dataset available to third parties; use the Dataset for commercial purposes; attempt to re-identify, de-anonymize, or reverse-engineer any content; use the Dataset in any retrieval-augmented generation (RAG) system; or store or process the Dataset outside the approved compute environment.
5. Data Handling; Security; Retention
You shall implement and maintain administrative, technical, and physical safeguards appropriate to the sensitivity of the Dataset. Access must be limited to Authorized Users named in the Approved Proposal. All copies of the Dataset must be securely deleted upon expiration or termination of the Approved Proposal, and You shall certify such deletion in writing within thirty (30) days.
6. Model Training; Derived Outputs; Release
Any Model trained or fine-tuned using the Dataset must be released under a Permissive Open-Source License. Model weights, training code, and evaluation artifacts must be made publicly available. You shall implement safeguards to prevent memorization, verbatim reproduction, or reconstruction of Dataset content in Model outputs.
7. Publication; Attribution
You may publish research findings derived from the Dataset, provided that: publications do not reveal raw Dataset content; the World Bank and applicable Data Providers are acknowledged; and the World Bank is notified of publications within thirty (30) days.
8. Monitoring; Audit; Enforcement
The World Bank reserves the right to monitor usage, conduct audits, and request compliance reports. You shall cooperate fully with any audit or investigation. Non-compliance may result in immediate termination of access, revocation of the Approved Proposal, and referral for further action.
9. Termination
This Sublicense may be terminated by the World Bank at any time, with or without cause, upon written notice. Upon termination, all rights granted cease immediately and You must destroy all copies of the Dataset and certify such destruction.
10. Intellectual Property
Derivative Works and Derived Outputs are owned by You, subject to the mandatory open release obligations, reconstruction limitations, and full compliance with this Sublicense. Ownership of the Dataset and all rights therein remain with the World Bank and/or the applicable Dataset providers.
11. Indemnification
You shall indemnify, defend, and hold harmless the World Bank and its officers, employees, and agents from and against any and all claims, demands, actions, proceedings, liabilities, damages, losses, costs, and expenses arising out of or relating to any breach of this Sublicense, misuse of the Dataset, violation of applicable law, distribution or release of Derived Outputs, any conflict with upstream base-model license terms, and any verbatim or near-verbatim reproduction caused by Model architecture, training choices, or release choices.
12. Disclaimer
THE DATASET IS PROVIDED "AS IS" AND "AS AVAILABLE." THE WORLD BANK DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, ACCURACY, COMPLETENESS, TITLE, AND NON-INFRINGEMENT.
13. Limitation of Liability
TO THE MAXIMUM EXTENT PERMITTED AND CONSISTENT WITH THE WORLD BANK'S PRIVILEGES AND IMMUNITIES, IN NO EVENT SHALL THE WORLD BANK BE LIABLE FOR ANY INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL, EXEMPLARY, OR PUNITIVE DAMAGES. THE WORLD BANK'S AGGREGATE LIABILITY ARISING OUT OF OR RELATING TO THIS SUBLICENSE SHALL NOT EXCEED ONE THOUSAND UNITED STATES DOLLARS (USD $1,000).
14. Privileges and Immunities
Nothing in this Sublicense shall be construed as a waiver, express or implied, of any privileges and immunities of IBRD or IDA, all of which are expressly reserved.
15. Dispute Resolution
The Parties shall attempt in good faith to resolve any dispute amicably within ninety (90) days of written notice. Any dispute not resolved within such period shall be finally settled by arbitration under the UNCITRAL Arbitration Rules by three (3) arbitrators. The appointing authority shall be the Secretary-General of the Permanent Court of Arbitration. The language shall be English.
16. Entire Agreement; Hierarchy; Severability
This Sublicense, together with the Approved Proposal solely for purposes of delimiting authorized scope, constitutes the entire agreement governing Dataset access and use. In the event of any conflict, this Sublicense controls unless dataset-specific Portal terms expressly state otherwise in writing. If any provision is held invalid or unenforceable, the remainder shall remain in full force and effect.
Clickthrough Certification
By clicking "I Agree," You certify and warrant that: You are authorized to bind Yourself and Your institution (if applicable); You will use the Dataset solely for Non-Commercial Research Use and strictly within the Approved Proposal; You will not redistribute raw Dataset content; You will not use the Dataset in any RAG system; You will publicly release Models under a Permissive Open-Source License as required; and You understand that access is conditional, revocable, monitored, and subject to enforcement.
How Data are Transferred
Rather than offering bulk downloads of entire collections, this catalog delivers content exclusively through the API. This design is a deliberate requirement of the IP licensing agreements with each data provider: datasets are made available only for permitted research and AI-training purposes, and the terms of those agreements must be enforced programmatically at the point of transfer.
The API serves files as discrete, size-bounded chunks and enforces per-request and per-key rate limits. This means:
- No unrestricted bulk export. Consumers cannot download an entire collection in a single
request. The
limitparameter caps how many file records are returned per call, and when the result set exceeds that cap a random sample is returned rather than the full set. - Controlled access via API keys. Every request must include a valid
X-Api-Keyheader. Keys are issued only to approved research teams, making it possible to audit usage, suspend access, and enforce the scope of use agreed with the provider. - Compliance with IP licensing terms. Data providers retain ownership of the underlying content. The chunked, API-mediated delivery model ensures that the platform can uphold usage restrictions (e.g., non-commercial research only, attribution requirements, redistribution prohibitions) that would be impossible to enforce with unrestricted file downloads.
IP Protection
To protect the intellectual property rights of content providers, the API enforces two complementary mechanisms that prevent reconstruction of original source material.
Chunked & Randomized Delivery
All content is pre-segmented into small, paragraph-sized chunks before storage. When a query is executed, only the relevant chunks are returned — and their order is randomized on every request. This ensures that even repeated queries for the same content yield a different ordering each time, making it computationally impractical to reassemble or reconstruct the original source document from API responses.
Daily API Call Limits
Each registered user is subject to a daily rate limit on API calls. This cap is
enforced per API key and resets at midnight UTC. Rate limits bound the total volume of content any
single user can retrieve within a 24-hour window, providing an additional safeguard against bulk
extraction attempts. If your limit is exceeded, the API returns a 429 Too Many Requests
response.
API Overview
The Low Resource Language Library API is an ASP.NET 8 Web API that serves Croissant 1.0
JSON-LD
responses from a SQLite/PostgreSQL database of dataset metadata. Use it to search and filter file-level
records by keyword, theme, author, and year. The API is read-only and requires an API key passed via the
X-Api-Key
header.
Base URL:
https://api.malawi-lrl.org/v1
Authentication: Pass your key on every request via the
X-Api-Key
header.
Swagger UI: Open
https://api.malawi-lrl.org/v1/swagger
in your browser, click Authorize (top right), enter your API key, and all test requests
will include the header automatically.
Endpoints
All parameters are optional.
When the total number of matching files exceeds limit, a random sample is returned.
| Parameter | Type | Default | Description |
|---|---|---|---|
keyword |
string | — | Partial, case-insensitive keyword match |
theme |
string | — | Partial, case-insensitive UNBIS theme match |
author |
string | — | Partial, case-insensitive author name match |
year |
string | — | Exact year, e.g. 2024 |
limit |
int | 30 | Max files in response (1–100). Random sample taken when total > limit. |
Example requests:
Example response:
Returns the API status and the
total number of files in the database. Also requires X-Api-Key.
Python SDK — DDPLRLL Dataset Reader
Python client for the Low Resource Language Library API. Queries Croissant JSON-LD metadata, downloads the referenced PDF files, and saves a local JSON-LD with rewritten file paths.
Installation
Configuration
All settings can be provided via
CLI flags, environment variables (prefixed DDPLRLL_),
or a .env
file.
| Env Variable | CLI Flag | Default | Description |
|---|---|---|---|
DDPLRLL_API_BASE_URL |
--api-url |
https://lrllapi.azurewebsites.net |
Base URL of the API |
DDPLRLL_API_KEY |
--api-key |
(empty) | X-Api-Key header value |
DDPLRLL_KEYWORD |
--keyword |
— | Filter by keyword |
DDPLRLL_THEME |
--theme |
— | Filter by theme |
DDPLRLL_AUTHOR |
--author |
— | Filter by author |
DDPLRLL_YEAR |
--year |
— | Filter by year |
DDPLRLL_LIMIT |
--limit |
30 |
Max file entries (1–100) |
DDPLRLL_OUTPUT_DIR |
--output |
./output |
Output directory |
DDPLRLL_DOWNLOAD_FILES |
--no-download |
true |
Download PDFs |
DDPLRLL_MAX_CONCURRENT_DOWNLOADS |
--concurrency |
5 |
Parallel downloads |
CLI Usage
Python API
Output Structure
After downloading, each scContentUrl
in dataset.jsonld
is rewritten from the remote URL to the absolute local path:
Using with mlcroissant
The saved dataset.jsonld
is a valid Croissant 1.0
document. Install the mlcroissant
package to load it directly:
Load and iterate records:
Inspect metadata:
End-to-end: ddplrll-reader → mlcroissant:
Using with pandas
Using with Hugging Face Datasets
End-to-end: query → pandas → analysis
Metadata Standards
Standards incorporated across all schemas:
Metadata Schemas by Data Type
About This Catalog
This catalog provides a centralized registry of low-resource language datasets, intended for researchers and engineers working on natural language processing (NLP), automatic speech recognition (ASR), and large language model (LLM) fine-tuning for low-resource languages.
This catalog aims to bridge the language data gap by licesnsing data from broadcasters, publishers, government institutions, and community organizations for non-commercial use.
🎯 Purpose
Provide a single point of discovery for available proprietary low-resource language data that is suitable for ML model training, so researchers don't have to hunt across disparate sources.
📋 Data Types
Audio (radio broadcasts, podcasts, recorded surveys), text (news articles, books, educational materials), and structured transcriptions paired with audio for ASR training.
🔑 Access
All metadata is freely browsable. Data downloads require registration and acceptance of the data use agreement -- see licensing restrictions on the "license" tab.
🤝 Contributing
We welcome new data contributions. If your organization produces low-resource language content and would like to make it available for research, please contact us to discuss ingestion and licensing.
📖 Citation
If you use data from this catalog in your research, please cite the catalog and the individual data providers. Citation formats are available on each collection's detail page.
⚖️ Ethics & Privacy
Household survey recordings have been reviewed for PII and anonymized. All data is shared under terms agreed upon with the originating institutions. Researchers must adhere to the data use agreement.
Apply to Access Datasets
Researchers, academic institutions, and organizations working on AI for social good can apply for access to these datasets. The application process involves submitting a research proposal, demonstrating ethical data handling practices, and committing to open-source publication of derived insights. We prioritize applications that show potential for significant impact on global development challenges and align with the Gates Foundation's mission.
Submit ApplicationApplications are reviewed on a rolling basis. Typical response time is 2–3 weeks.