Low-Resource Language Data Catalog

🔍

Development Data Partnership Clickthrough Sublicense Agreement

All datasets in this catalog are governed by the following sublicense agreement. By accessing any dataset, you agree to these terms.

1. Parties; Clickthrough Acceptance; Authority

This Clickthrough Sublicense Agreement (this "Sublicense") is entered into by and between: the International Bank for Reconstruction and Development ("IBRD") and the International Development Association ("IDA"), collectively, the "World Bank"; and the individual natural person accepting these terms and, where such individual is acting within the scope of employment, appointment, or affiliation, the employing or affiliated institution on whose behalf such individual acts (collectively, "Authorized User" or "You").

By selecting "I Agree," submitting, participating in, or acting under an Approved Proposal through the Development Data Partnership Platform (the "Portal"), or accessing any Dataset, You (i) acknowledge that You have read and understood this Sublicense, (ii) agree to be legally bound by its terms, and (iii) consent to the use of electronic records and signatures for purposes of formation and evidencing acceptance.

If You do not agree, You shall not access or use any Dataset. You represent and warrant that You have full legal power and authority to enter into this Sublicense and, where applicable, to bind Your institution. Authorized User and the affiliated institution (if any) shall be jointly and severally liable for all obligations and liabilities arising under this Sublicense.

2. Eligibility; Approved Proposal; Scope of Authorization

2.1 Proposal-Based Access. Access to any Dataset is conditioned on the World Bank's prior written approval of a Proposal submitted through the Portal (an "Approved Proposal"). The World Bank may approve, deny, condition, suspend, or revoke approval in its sole discretion at any time.

2.2 Purpose; Duration; No Expansion by Proposal. You may access and use the Dataset solely (a) for the purposes expressly described in the Approved Proposal, (b) for Non-Commercial Research Use, and (c) during the project duration approved in the Approved Proposal. An Approved Proposal is incorporated by reference solely to delimit scope and conditions of use and shall not amend this Sublicense or expand any rights granted herein.

2.3 Material Changes Require Prior Approval. Any "Material Change" requires renewed approval through the Portal prior to implementation. "Material Change" includes, without limitation, any change to: purpose; methodology; model architecture; base model; research personnel; compute environment; storage architecture; access controls; duration; anticipated outputs; publication plan; or release plan for Derived Outputs.

3. Definitions

Authorized User means a natural person affiliated with an eligible institution, identified by name in an Approved Proposal, who has accepted this Sublicense and has been issued access credentials by the World Bank. Access and authentication must be conducted solely through the institutional, domain-linked email address verified in the Proposal. Authorization is personal, limited, non-transferable, and non-delegable.

Dataset means any organized collection of data made available through the Portal and/or API, including without limitation audio, text, metadata, sampled content, segmented or chunked content, shuffled content, transformed content, or other provisioned content.

Data means any information from or derived from a Dataset made available through the Portal and/or API, including without limitation chunked, shuffled, sampled, altered, transformed, watermarked, security-marked, trace-marked, or otherwise protected content.

Model means any machine learning, artificial intelligence, statistical, computational, or automated system trained, fine-tuned, evaluated, or otherwise derived in whole or in part from the Dataset.

Derived Output means any Model or artifact created using the Dataset, including without limitation model weights, embeddings, evaluation checkpoints, tokenizers, vocabularies, synthetic data, fine-tuning artifacts, evaluation artifacts, or other outputs, provided that such output does not permit reconstruction of more than an insubstantial portion of the Dataset.

Non-Commercial Research Use means research, education, or development activity conducted without the primary intent of commercial advantage or monetary compensation and without commercial deployment, commercialization, productization, or monetization of any output, whether directly or indirectly.

Permissive Open-Source License means an Open Source Initiative (OSI)-approved license that permits commercial and non-commercial use, modification, and redistribution with minimal restrictions (e.g., MIT, Apache 2.0, BSD 2-Clause, BSD 3-Clause).

4. License Grant; Scope; Restrictions

4.1 Limited License. Subject to the terms and conditions of this Sublicense, the World Bank grants You a limited, non-exclusive, non-transferable, non-sublicensable, revocable, royalty-free license to access and use the Dataset solely for Non-Commercial Research Use and solely within the scope of the Approved Proposal.

4.2 Permitted Uses. You may: (a) access and process the Dataset for the approved research purpose; (b) create Derived Outputs in compliance with all restrictions; and (c) publish research findings that do not reveal, reconstruct, or enable reconstruction of Dataset content.

4.3 Prohibited Uses. You shall not: redistribute, sublicense, or make the Dataset available to third parties; use the Dataset for commercial purposes; attempt to re-identify, de-anonymize, or reverse-engineer any content; use the Dataset in any retrieval-augmented generation (RAG) system; or store or process the Dataset outside the approved compute environment.

5. Data Handling; Security; Retention

You shall implement and maintain administrative, technical, and physical safeguards appropriate to the sensitivity of the Dataset. Access must be limited to Authorized Users named in the Approved Proposal. All copies of the Dataset must be securely deleted upon expiration or termination of the Approved Proposal, and You shall certify such deletion in writing within thirty (30) days.

6. Model Training; Derived Outputs; Release

Any Model trained or fine-tuned using the Dataset must be released under a Permissive Open-Source License. Model weights, training code, and evaluation artifacts must be made publicly available. You shall implement safeguards to prevent memorization, verbatim reproduction, or reconstruction of Dataset content in Model outputs.

7. Publication; Attribution

You may publish research findings derived from the Dataset, provided that: publications do not reveal raw Dataset content; the World Bank and applicable Data Providers are acknowledged; and the World Bank is notified of publications within thirty (30) days.

8. Monitoring; Audit; Enforcement

The World Bank reserves the right to monitor usage, conduct audits, and request compliance reports. You shall cooperate fully with any audit or investigation. Non-compliance may result in immediate termination of access, revocation of the Approved Proposal, and referral for further action.

9. Termination

This Sublicense may be terminated by the World Bank at any time, with or without cause, upon written notice. Upon termination, all rights granted cease immediately and You must destroy all copies of the Dataset and certify such destruction.

10. Intellectual Property

Derivative Works and Derived Outputs are owned by You, subject to the mandatory open release obligations, reconstruction limitations, and full compliance with this Sublicense. Ownership of the Dataset and all rights therein remain with the World Bank and/or the applicable Dataset providers.

11. Indemnification

You shall indemnify, defend, and hold harmless the World Bank and its officers, employees, and agents from and against any and all claims, demands, actions, proceedings, liabilities, damages, losses, costs, and expenses arising out of or relating to any breach of this Sublicense, misuse of the Dataset, violation of applicable law, distribution or release of Derived Outputs, any conflict with upstream base-model license terms, and any verbatim or near-verbatim reproduction caused by Model architecture, training choices, or release choices.

12. Disclaimer

THE DATASET IS PROVIDED "AS IS" AND "AS AVAILABLE." THE WORLD BANK DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, ACCURACY, COMPLETENESS, TITLE, AND NON-INFRINGEMENT.

13. Limitation of Liability

TO THE MAXIMUM EXTENT PERMITTED AND CONSISTENT WITH THE WORLD BANK'S PRIVILEGES AND IMMUNITIES, IN NO EVENT SHALL THE WORLD BANK BE LIABLE FOR ANY INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL, EXEMPLARY, OR PUNITIVE DAMAGES. THE WORLD BANK'S AGGREGATE LIABILITY ARISING OUT OF OR RELATING TO THIS SUBLICENSE SHALL NOT EXCEED ONE THOUSAND UNITED STATES DOLLARS (USD $1,000).

14. Privileges and Immunities

Nothing in this Sublicense shall be construed as a waiver, express or implied, of any privileges and immunities of IBRD or IDA, all of which are expressly reserved.

15. Dispute Resolution

The Parties shall attempt in good faith to resolve any dispute amicably within ninety (90) days of written notice. Any dispute not resolved within such period shall be finally settled by arbitration under the UNCITRAL Arbitration Rules by three (3) arbitrators. The appointing authority shall be the Secretary-General of the Permanent Court of Arbitration. The language shall be English.

16. Entire Agreement; Hierarchy; Severability

This Sublicense, together with the Approved Proposal solely for purposes of delimiting authorized scope, constitutes the entire agreement governing Dataset access and use. In the event of any conflict, this Sublicense controls unless dataset-specific Portal terms expressly state otherwise in writing. If any provision is held invalid or unenforceable, the remainder shall remain in full force and effect.

Clickthrough Certification

By clicking "I Agree," You certify and warrant that: You are authorized to bind Yourself and Your institution (if applicable); You will use the Dataset solely for Non-Commercial Research Use and strictly within the Approved Proposal; You will not redistribute raw Dataset content; You will not use the Dataset in any RAG system; You will publicly release Models under a Permissive Open-Source License as required; and You understand that access is conditional, revocable, monitored, and subject to enforcement.

How Data are Transferred

Rather than offering bulk downloads of entire collections, this catalog delivers content exclusively through the API. This design is a deliberate requirement of the IP licensing agreements with each data provider: datasets are made available only for permitted research and AI-training purposes, and the terms of those agreements must be enforced programmatically at the point of transfer.

The API serves files as discrete, size-bounded chunks and enforces per-request and per-key rate limits. This means:

No unrestricted bulk export. Consumers cannot download an entire collection in a single request. The limit parameter caps how many file records are returned per call, and when the result set exceeds that cap a random sample is returned rather than the full set.
Controlled access via API keys. Every request must include a valid X-Api-Key header. Keys are issued only to approved research teams, making it possible to audit usage, suspend access, and enforce the scope of use agreed with the provider.
Compliance with IP licensing terms. Data providers retain ownership of the underlying content. The chunked, API-mediated delivery model ensures that the platform can uphold usage restrictions (e.g., non-commercial research only, attribution requirements, redistribution prohibitions) that would be impossible to enforce with unrestricted file downloads.

IP Protection

To protect the intellectual property rights of content providers, the API enforces two complementary mechanisms that prevent reconstruction of original source material.

🔀

Chunked & Randomized Delivery

All content is pre-segmented into small, paragraph-sized chunks before storage. When a query is executed, only the relevant chunks are returned — and their order is randomized on every request. This ensures that even repeated queries for the same content yield a different ordering each time, making it computationally impractical to reassemble or reconstruct the original source document from API responses.

📊

Daily API Call Limits

Each registered user is subject to a daily rate limit on API calls. This cap is enforced per API key and resets at midnight UTC. Rate limits bound the total volume of content any single user can retrieve within a 24-hour window, providing an additional safeguard against bulk extraction attempts. If your limit is exceeded, the API returns a 429 Too Many Requests response.

API Overview

The Low Resource Language Library API is an ASP.NET 8 Web API that serves Croissant 1.0 JSON-LD responses from a SQLite/PostgreSQL database of dataset metadata. Use it to search and filter file-level records by keyword, theme, author, and year. The API is read-only and requires an API key passed via the X-Api-Key header.

Base URL: https://api.malawi-lrl.org/v1

Authentication: Pass your key on every request via the X-Api-Key header.

Swagger UI: Open https://api.malawi-lrl.org/v1/swagger in your browser, click Authorize (top right), enter your API key, and all test requests will include the header automatically.

Endpoints

GET /api/datasets/query Search and filter datasets; returns Croissant JSON-LD

All parameters are optional. When the total number of matching files exceeds limit, a random sample is returned.

Parameter	Type	Default	Description
`keyword`	string	—	Partial, case-insensitive keyword match
`theme`	string	—	Partial, case-insensitive UNBIS theme match
`author`	string	—	Partial, case-insensitive author name match
`year`	string	—	Exact year, e.g. `2024`
`limit`	int	30	Max files in response (1–100). Random sample taken when total > limit.

Example requests:

              # All files tagged with "malaria", limit 10
              curl -H "X-Api-Key: dev-key-12345" \
                   "http://localhost:5000/api/datasets/query?keyword=malaria&limit=10"

              # Health and Education files from 2024
              curl -H "X-Api-Key: dev-key-12345" \
                   "http://localhost:5000/api/datasets/query?theme=Health&year=2024"

              # Combine filters
              curl -H "X-Api-Key: dev-key-12345" \
                   "http://localhost:5000/api/datasets/query?keyword=election&theme=Political&author=Banda&limit=5"
            

Example response:

              {
                "@context": [ "https://mlcommons.org/working-groups/data/croissant/", { ... } ],
                "generatedAt": "2026-02-27T10:00:00Z",
                "samplingInfo": {
                  "totalMatched": 120,
                  "limit": 30,
                  "returned": 30,
                  "randomSample": true
                },
                "@graph": [
                  {
                    "@type": "sc:Dataset",
                    "@id": "DDP-NATION-2024",
                    "scName": "Nation Newspaper
                2024",
                    "distribution": [
                      {
                        "@type": "cr:FileObject",
                        "@id": "file-2024-abc123",
                        "scName": "Nation_On_Sunday_20240315_p001.pdf",
                        "scKeywords": ["malaria", "hospital"],
                        "dcatTheme": ["Health"],
                        ...
                      }
                    ]
                  }
                ]
              }
            

GET /api/datasets/health Return API status and total file count

Returns the API status and the total number of files in the database. Also requires X-Api-Key.

              curl -H "X-Api-Key: dev-key-12345" \
                   "http://localhost:5000/api/datasets/health"
            

Python SDK — DDPLRLL Dataset Reader

Python client for the Low Resource Language Library API. Queries Croissant JSON-LD metadata, downloads the referenced PDF files, and saves a local JSON-LD with rewritten file paths.

Installation

          # From the project root (editable / dev install)
          pip install -e .
        

Configuration

All settings can be provided via CLI flags, environment variables (prefixed DDPLRLL_), or a .env file.

Env Variable	CLI Flag	Default	Description
`DDPLRLL_API_BASE_URL`	`--api-url`	`https://lrllapi.azurewebsites.net`	Base URL of the API
`DDPLRLL_API_KEY`	`--api-key`	(empty)	`X-Api-Key` header value
`DDPLRLL_KEYWORD`	`--keyword`	—	Filter by keyword
`DDPLRLL_THEME`	`--theme`	—	Filter by theme
`DDPLRLL_AUTHOR`	`--author`	—	Filter by author
`DDPLRLL_YEAR`	`--year`	—	Filter by year
`DDPLRLL_LIMIT`	`--limit`	`30`	Max file entries (1–100)
`DDPLRLL_OUTPUT_DIR`	`--output`	`./output`	Output directory
`DDPLRLL_DOWNLOAD_FILES`	`--no-download`	`true`	Download PDFs
`DDPLRLL_MAX_CONCURRENT_DOWNLOADS`	`--concurrency`	`5`	Parallel downloads

CLI Usage

          # Full pipeline: query + download + save JSON-LD
          ddplrll-reader run \
            --api-url https://lrllapi.azurewebsites.net \
            --api-key MY_KEY \
            --keyword malaria \
            --year 2024 \
            --limit 10 \
            --output ./my-output

          # Query only (no file downloads)
          ddplrll-reader run --api-key MY_KEY --keyword health --no-download

          # Health check
          ddplrll-reader health --api-url https://lrllapi.azurewebsites.net
        

Python API

          from ddplrll_reader import DdplrllDatasetClient, Settings

          # Configure
          settings = Settings(
            api_base_url="https://lrllapi.azurewebsites.net",
            api_key="MY_KEY",
            output_dir="./output",
          )

          client = DdplrllDatasetClient(settings)

          # Full pipeline: query → download PDFs → save JSON-LD
          jsonld_path = client.run(keyword="malaria", year="2024",
          limit=10)
          print(f"Saved to {jsonld_path}")

          # Query only (returns raw dict)
          data = client.query(keyword="health", theme="Education")

          # Query with Pydantic validation
          response = client.query_validated(keyword="malaria")
          for dataset in response.graph or []:
            print(dataset.sc_name)
            for f in dataset.distribution or []:
              print(f" {f.sc_name} →
            {f.sc_content_url}")
        

Output Structure

          output/
          ├── dataset.jsonld          #
            Croissant JSON-LD with local file paths
          └── files/
              ├── file-2022-465a93ae.pdf
              ├── file-2022-1cafc7a4.pdf
              └── ...
        

After downloading, each scContentUrl in dataset.jsonld is rewritten from the remote URL to the absolute local path:

          "scContentUrl": "https://lrllapi.azurewebsites.net/api/files/file-2022-465a93ae"
          →
          "scContentUrl": "/Users/you/output/files/file-2022-465a93ae.pdf"
        

Using with mlcroissant

The saved dataset.jsonld is a valid Croissant 1.0 document. Install the mlcroissant package to load it directly:

pip install mlcroissant

Load and iterate records:

          from mlcroissant import Dataset

          ds = Dataset(jsonld=jsonld_path)
          records = ds.records("default")

          for record in records:
            print(record)
        

Inspect metadata:

          from mlcroissant import Dataset

          ds = Dataset(jsonld="output/dataset.jsonld")

          # Top-level metadata
          print(ds.metadata.name)
          print(ds.metadata.description)

          # List all record sets
          for record_set in ds.metadata.record_sets:
            print(record_set.name, "–",
          len(record_set.fields), "fields")
        

End-to-end: ddplrll-reader → mlcroissant:

          from ddplrll_reader import DdplrllDatasetClient, Settings
          from mlcroissant import Dataset

          # 1. Query and download
          client = DdplrllDatasetClient(Settings(
            api_base_url="https://lrllapi.azurewebsites.net",
            api_key="MY_KEY",
          ))
          jsonld_path = client.run(keyword="health", year="2023",
          limit=50)

          # 2. Load with mlcroissant
          ds = Dataset(jsonld=jsonld_path)
          for record in ds.records("default"):
            print(record)
        

Using with pandas

          import json
          import pandas as pd

          # Load the JSON-LD
          with open("output/dataset.jsonld") as
          f:
            data = json.load(f)

          # Flatten all file objects across every dataset/year into a DataFrame
          rows = []
          for dataset_node in data.get("graph",
          []):
            dataset_id = dataset_node.get("id")
            year = dataset_node.get("scTemporalCoverage")
            for file_obj in dataset_node.get("distribution", []):
              rows.append({
                "dataset_id": dataset_id,
                "year": year,
                "file_id": file_obj.get("id"),
                "name": file_obj.get("scName"),
                "author": file_obj.get("scAuthor"),
                "local_path": file_obj.get("scContentUrl"),
                "size_bytes": file_obj.get("scContentSize"),
                "word_count": file_obj.get("scWordCount"),
                "token_count": file_obj.get("ddpvTokenCount"),
                "keywords": file_obj.get("scKeywords"),
                "themes": file_obj.get("dcatTheme"),
              })

          df = pd.DataFrame(rows)
          print(df.head())
          print(f"\nTotal files: {len(df)}")
          print(f"Total tokens: {df['token_count'].sum():,}")
        

Using with Hugging Face Datasets

          import json
          from datasets import Dataset

          with open("output/dataset.jsonld") as
          f:
            data = json.load(f)

          # Build a flat list of records
          records = []
          for dataset_node in data.get("graph",
          []):
            for file_obj in dataset_node.get("distribution", []):
              records.append({
                "file_id": file_obj["id"],
                "name": file_obj.get("scName"),
                "author": file_obj.get("scAuthor"),
                "year": dataset_node.get("scTemporalCoverage"),
                "local_path": file_obj.get("scContentUrl"),
                "word_count": file_obj.get("scWordCount"),
                "token_count": file_obj.get("ddpvTokenCount"),
                "keywords": ",
            ".join(file_obj.get("scKeywords", [])),
                "themes": ",
            ".join(file_obj.get("dcatTheme", [])),
              })

          ds = Dataset.from_list(records)
          print(ds)
          print(ds[0])

          # Filter, shuffle, split
          ds_health = ds.filter(lambda r: "Health" in r["themes"])
          train_test = ds_health.train_test_split(test_size=0.2)
          print(train_test)
        

End-to-end: query → pandas → analysis

          from ddplrll_reader import DdplrllDatasetClient, Settings
          import pandas as pd

          # 1. Query and download
          client = DdplrllDatasetClient(Settings(
            api_base_url="https://lrllapi.azurewebsites.net",
            api_key="MY_KEY",
          ))
          jsonld_path = client.run(keyword="health", year="2023",
          limit=50)

          # 2. Load into pandas
          import json
          with open(jsonld_path) as f:
            data = json.load(f)

          rows = [
            {
              "name": fo.get("scName"),
              "author": fo.get("scAuthor"),
              "words": fo.get("scWordCount"),
              "tokens": fo.get("ddpvTokenCount"),
              "themes": fo.get("dcatTheme"),
              "path": fo.get("scContentUrl"),
            }
            for node in data.get("graph", [])
            for fo in node.get("distribution", [])
          ]
          df = pd.DataFrame(rows)

          # 3. Analyse
          print(df.describe())
          print(df.groupby("author")["tokens"].sum().sort_values(ascending=False))
        

Metadata Schemas

All datasets in this catalog are documented using structured metadata to ensure they are as discoverable and reusable as possible for AI model training and fine-tuning. The metadata collection is guided by four core objectives:

🔍

Discoverability Making datasets easy to find and correctly identify

📊

Usability Providing clear information so users can assess and work with datasets effectively

🔁

Reproducibility Documenting processes to enable verification and reuse

⚖️

Ethical Practice Ensuring transparent documentation of provenance, licensing, and ethical safeguards

No single existing standard fully met the project’s requirements. Therefore, we adopted an application profile approach, combining fields from multiple standards. It draws on nine established standards alongside a custom namespace, the Development Data Partnership Vocabulary (DDPV), to capture project-specific attributes while maintaining compatibility with existing vocabularies and tools.

Metadata Standards

Standards incorporated across all schemas:

Croissant Schema.org Dataset Vocabulary Dublin Core Terms Data Catalog Vocabulary (DCAT) Data Quality Vocabulary (DQV) Provenance Ontology (PROV-O) RDF Schema (RDFS) EBUCore Open Language Archives Community (OLAC) Bibliographic Ontology (BIBO) Data Documentation Initiative (DDI) DDPV

Metadata Schemas by Data Type

We developed metadata schemas based on the standards above for three primary data types: audio, text, and video. While the majority of the fields are similar, the core differences lie in the technical fields required for each data type. Additional information on the fields, their definitions, cardinality requirements, and collection methods can be found in the resources section below.

📦 Schema Resources & Downloads

📖 Full Documentation Complete documentation describing the metadata schemas and the process used to develop them. PDF 🗂️ DDPV Vocabulary A dedicated webpage describing and maintaining the custom metadata vocabulary. WEB 📄 Text Metadata Schema JSON schema for text and print/news datasets. JSON-LD 🎙️ Audio Metadata Schema JSON schema for audio and broadcast datasets. JSON-LD 🎬 Video Metadata Schema JSON schema for video datasets. JSON-LD 🔊 Example Audio Record A populated example metadata record for an audio dataset. JSON-LD

About This Catalog

This catalog provides a centralized registry of low-resource language datasets, intended for researchers and engineers working on natural language processing (NLP), automatic speech recognition (ASR), and large language model (LLM) fine-tuning for low-resource languages.

This catalog aims to bridge the language data gap by licesnsing data from broadcasters, publishers, government institutions, and community organizations for non-commercial use.

🎯 Purpose

Provide a single point of discovery for available proprietary low-resource language data that is suitable for ML model training, so researchers don't have to hunt across disparate sources.

📋 Data Types

Audio (radio broadcasts, podcasts, recorded surveys), text (news articles, books, educational materials), and structured transcriptions paired with audio for ASR training.

🔑 Access

All metadata is freely browsable. Data downloads require registration and acceptance of the data use agreement -- see licensing restrictions on the "license" tab.

🤝 Contributing

We welcome new data contributions. If your organization produces low-resource language content and would like to make it available for research, please contact us to discuss ingestion and licensing.

📖 Citation

If you use data from this catalog in your research, please cite the catalog and the individual data providers. Citation formats are available on each collection's detail page.

⚖️ Ethics & Privacy

Household survey recordings have been reviewed for PII and anonymized. All data is shared under terms agreed upon with the originating institutions. Researchers must adhere to the data use agreement.

Apply to Access Datasets

Researchers, academic institutions, and organizations working on AI for social good can apply for access to these datasets. The application process involves submitting a research proposal, demonstrating ethical data handling practices, and committing to open-source publication of derived insights. We prioritize applications that show potential for significant impact on global development challenges and align with the Gates Foundation's mission.

Submit Application

Applications are reviewed on a rolling basis. Typical response time is 2–3 weeks.

Low Resource Language Data Catalog

Gates Foundation Supported Datasets for AI Model Training