DDPV

Development Data Partnership Vocabulary

A custom metadata vocabulary for documenting low-resource language audio, video, and text datasets in AI/ML contexts, with specialized terms for PII screening, sensitive content handling, participant demographics, and quality metrics.

Overview

The Development Data Partnership Vocabulary (DDPV) defines specialized metadata terms to extend standard schemas for documenting datasets in low-resource language AI libraries. This vocabulary was developed to support the Gates Foundation funded initiative for democratizing access to high-quality datasets for low resource language AI model training.

Namespace Declaration
@prefix ddpv: <https://datapartnership.org/ddvp-metadata-terms#> .

Version: 1.0

Date Published: January 15, 2025

License: CC BY 4.0

Status: Testing

Purpose and Scope

DDPV provides terms for:

  • Data Governance: PII screening methods, sensitive content classification, attribution requirements
  • Participant Metadata: Demographics, age ranges, dialect/regional variations, roles
  • Technical Quality: Equipment specifications, aggregation types, file counting
  • Multi-Modal Support: Audio, video, and text-specific metadata

Standards Compatibility

DDPV extends and works alongside standards such as:

  • Croissant 1.0 (MLCommons) - Core dataset structure
  • EBUCore - Technical audio/video metadata
  • OLAC - Language archive metadata
  • Schema.org - General web semantics
  • RAI (MLCommons) - Responsible AI properties
  • DQV - Data quality vocabulary

Vocabulary Terms

Data Governance Terms

PII Screening

URI: https://datapartnership.org/ddvp-metadata-terms#piiScreening

Term:
ddpv:piiScreening
Definition:
Indicates whether any file in the dataset contains personally identifiable information (PII). Set to true if PII is present; false otherwise.
Type:
Property
Domain:
sc:Dataset
Range:
xsd:boolean
Status:
Stable
See also:
ddpv:piiScreeningMethod, ddpv:piiNotes
Example Usage
{
  "ddpv:piiScreening": true
}

PII Screening Method

URI: https://datapartnership.org/ddvp-metadata-terms#piiScreeningMethod

Term:
ddpv:piiScreeningMethod
Definition:
Method(s) applied to identify, remove, or mask personally identifiable information in the dataset.
Type:
Property
Domain:
sc:Dataset
Range:
One or more of: "manual_review", "automated_ner", "audio_redaction", "video_redaction", "hybrid", "other"
Status:
Stable
Note:
If "other" is specified, use ddpv:piiScreeningMethodOther to provide details.
Example Usage
{
              "ddpv:piiScreeningMethod": [
                "automated_ner", 
                "manual_review"
              ]
            }

PII Screening Method (Other)

URI: https://datapartnership.org/ddvp-metadata-terms#piiScreeningMethodOther

Term:
ddpv:piiScreeningMethodOther
Definition:
Free-text description of the PII screening method when "other" is selected in ddpv:piiScreeningMethod.
Type:
Property
Domain:
sc:Dataset
Range:
xsd:string
Status:
Stable
Example Usage
{
              "ddpv:piiScreeningMethod": ["other"],
              "ddpv:piiScreeningMethodOther": 
                "Custom facial recognition masking"
            }

PII Notes

URI: https://datapartnership.org/ddvp-metadata-terms#piiNotes

Term:
ddpv:piiNotes
Definition:
Short note describing the type of personally identifiable information present in the dataset, if any (e.g., "Names and phone numbers in transcripts").
Type:
Property
Domain:
sc:Dataset
Range:
xsd:string
Status:
Stable
Example Usage
{
            "ddpv:piiNotes": 
              "Participant names mentioned in recordings"
          }

Sensitive Content

URI: https://datapartnership.org/ddvp-metadata-terms#sensitiveContent

Term:
ddpv:sensitiveContent
Definition:
Indicates whether the dataset contains sensitive, restricted, classified, or potentially harmful content. Set to true if such content is present; false otherwise.
Type:
Property
Domain:
sc:Dataset
Range:
xsd:boolean
Status:
Stable
See also:
ddpv:sensitiveNotes
Example Usage
{
              "ddpv:sensitiveContent": true
            }

Sensitive Content Notes

URI:https://datapartnership.org/ddvp-metadata-terms#sensitiveNotes

Term:
ddpv:sensitiveNotes
Definition:
Array of descriptions specifying the type(s) of sensitive content present (e.g., "hate speech", "military data", "medical records").
Type:
Property
Domain:
sc:Dataset
Range:
Array of xsd:string
Status:
Stable
Example Usage
{
            "ddpv:sensitiveNotes": [
              "politically sensitive content", 
              "religious themes"
            ]
          }

Attribution Required

URI: https://datapartnership.org/ddvp-metadata-terms#attribution

Term:
ddpv:attribution
Definition:
Indicates whether users must credit the source when using the dataset. Default is true.
Type:
Property
Domain:
sc:Dataset
Range:
xsd:boolean
Status:
Stable
Example Usage
{
              "ddpv:attribution": true
            }

Third-Party Restrictions

URI: https://datapartnership.org/ddvp-metadata-terms#thirdPartyRestrictions

Term:
ddpv:thirdPartyRestrictions
Definition:
Description of any third-party intellectual property rights or usage restrictions that apply to the dataset.
Type:
Property
Domain:
sc:Dataset
Range:
xsd:string
Status:
Stable
Example Usage
{
                "ddpv:thirdPartyRestrictions": 
                  "Music tracks subject to copyright"
              }

Retention Policy

URI: https://datapartnership.org/ddvp-metadata-terms#retentionPolicy

Term:
ddpv:retentionPolicy
Definition:
Description of the data retention policy applicable to the dataset, including storage duration and deletion procedures.
Type:
Property
Domain:
sc:Dataset
Range:
xsd:string
Status:
Stable
Example Usage
{
              "ddpv:retentionPolicy": 
                "Data retained for 5 years"
            }

Participant Metadata Terms

Participants

URI: https://datapartnership.org/ddvp-metadata-terms#participants

Term:
ddpv:participants
Definition:
Array of participant information objects for audio/video files, including speakers, interviewers, and other contributors. Each participant object should include role and optional demographic information.
Type:
Property
Domain:
cr:FileObject (audio or video files)
Range:
Array of Participant objects
Status:
Stable
See also:
ddpv:ageRange, ddpv:dialectRegion
Example Usage
{
              "ddpv:participants": [
                {
                  "olac:role": "speaker",
                  "olac:code": "SPK001",
                  "sc:gender": "female",
                  "ddpv:ageRange": "26-35",
                  "ddpv:dialectRegion": "Northern"
                }
              ]
            }

Age Range

URI:https://datapartnership.org/ddvp-metadata-terms#ageRange

Term:
ddpv:ageRange
Definition:
Age range bucket for a participant (e.g., "18-25", "26-35", "36-50"). Used within participant metadata to categorize speakers/contributors by age group.
Type:
Property
Domain:
Participant object within ddpv:participants
Range:
xsd:string
Status:
Stable
Recommended Values:
"0-12", "13-17", "18-25", "26-35", "36-50", "51-65", "65+"
Example Usage
{
  "ddpv:ageRange": "26-35"
}

Dialect Region

URI: https://datapartnership.org/ddvp-metadata-terms#dialectRegion

Term:
ddpv:dialectRegion
Definition:
Dialect or regional variation descriptor for a participant's speech (e.g., "Urban", "Coastal", "Northern", "Southern"). Used to document linguistic variation within a language.
Type:
Property
Domain:
Participant object within ddpv:participants
Range:
xsd:string
Status:
Stable
Example Usage
{
              "ddpv:dialectRegion": "Central"
            }

General Technical Quality Terms

Equipment Type

URI: https://datapartnership.org/ddvp-metadata-terms#equipmentType

Term:
ddpv:equipmentType
Definition:
Recording device, microphone model, or camera model used to capture the media file (free text description).
Type:
Property
Domain:
cr:FileObject (audio or video files)
Range:
xsd:string
Status:
Stable
Example Usage
{
              "ddpv:equipmentType": 
                "Zoom H6 with Sennheiser MKH 416"
            }

Aggregation Type

URI: https://datapartnership.org/ddvp-metadata-terms#aggregationType

Ter,:
ddpv:aggregationType
Definition:
Method used to derive dataset-level quality metric values from file-level measurements (e.g., "mean", "median", "sum", "weighted_mean").
Type:
Property
Domain:
dqv:QualityMeasurement
Range:
One of: "mean", "median", "sum", "min", "max", "count", "weighted_mean"
Status:
Stable
Example Usage
{
              "@type": "dqv:QualityMeasurement",
              "dqv:isMeasurementOf": "ebucore:duration",
              "dqv:value": 145.3,
              "schema:unitText": "seconds",
              "ddpv:aggregationType": "mean"
            }

Number of Files

URI: https://datapartnership.org/ddvp-metadata-terms#NumFiles

Term:
ddpv:NumFiles
Definition:
Total count of files in the dataset. Used as a measurement dimension in dqv:QualityMeasurement to report dataset size.
Type:
Measurement Dimension
Domain:
dqv:isMeasurementOf
Range:
xsd:integer
Status:
Stable
Example Usage
{
              "@type": "dqv:QualityMeasurement",
              "dqv:isMeasurementOf": "ddpv:NumFiles",
              "dqv:value": 1250,
              "schema:unitText": "count",
              "ddpv:aggregationType": "count"
            }

Text Dataset Terms

The following terms are specific to text datasets and provide metadata for documenting text files and corpus-level quality metrics.

File-Level Properties

Character Count

URI: https://datapartnership.org/ddvp-metadata-terms#charCount

Term:
ddpv:charCount
Definition:
Total number of characters in the text document.
Type:
Property
Domain:
cr:FileObject (text files)
Range:
xsd:integer (minimum: 0)
Status:
Stable
Example Usage
{
  "ddpv:charCount": 125847
}

Token Count

URI: https://datapartnership.org/ddvp-metadata-terms#tokenCount

Term:
ddpv:tokenCount
Definition:
Total number of tokens in the text document (tokenization method should be documented in dataset metadata).
Type:
Property
Domain:
cr:FileObject (text files)
Range:
xsd:integer (minimum: 0)
Status:
Stable
Example Usage
{
  "ddpv:tokenCount": 24563
}

Quality Metrics

URI: https://datapartnership.org/ddvp-metadata-terms#qualityMetrics

Term:
ddpv:qualityMetrics
Definition:
Object containing quality metrics for this text document.
Type:
Property
Domain:
cr:FileObject (text files)
Range:
Object
Status:
Stable
Properties:
  • perplexityScore: Perplexity score (lower = more natural text)
Example Usage
{
            "ddpv:qualityMetrics": {
              "perplexityScore": 142.3,
            }
          }

Dataset-Level Quality Measurements for Text

These terms are used as values for dqv:isMeasurementOf in dataset-level quality measurements.

Total Tokens

URI: https://datapartnership.org/ddvp-metadata-terms#totalTokens

Term:
ddpv:totalTokens
Definition:
Aggregate count of tokens across all documents in the dataset.
Type:
Measurement Dimension
Domain:
dqv:isMeasurementOf
Range:
xsd:number
Status:
Stable
Example Usage
{
  "@type": "dqv:QualityMeasurement",
  "dqv:isMeasurementOf": "ddpv:totalTokens",
  "dqv:value": 12584320,
  "schema:unitText": "tokens",
  "ddpv:aggregationType": "sum"
}

Vocabulary Size

URI: https://datapartnership.org/ddvp-metadata-terms#vocabularySize

Term:
ddpv:vocabularySize
Definition:
Number of unique tokens (vocabulary) in the dataset.
Type:
Measurement Dimension
Domain:
dqv:isMeasurementOf
Range:
xsd:number
Status:
Stable
Example Usage
{
  "@type": "dqv:QualityMeasurement",
  "dqv:isMeasurementOf": "ddpv:vocabularySize",
  "dqv:value": 98540,
  "schema:unitText": "unique tokens",
  "ddpv:aggregationType": "count"
}

Average Characters Per Document

URI: https://datapartnership.org/ddvp-metadata-terms#avgCharsPerDoc

Term:
ddpv:avgCharsPerDoc
Definition:
Average number of characters per document in the dataset.
Type:
Measurement Dimension
Domain:
dqv:isMeasurementOf
Range:
xsd:number
Status:
Stable
Example Usage
{
  "@type": "dqv:QualityMeasurement",
  "dqv:isMeasurementOf": "ddpv:avgCharsPerDoc",
  "dqv:value": 3542.8,
  "schema:unitText": "characters",
  "ddpv:aggregationType": "mean"
}

Average Tokens Per Document

URI: https://datapartnership.org/ddvp-metadata-terms#avgTokensPerDoc

Term:
ddpv:avgTokensPerDoc
Definition:
Average number of tokens per document in the dataset.
Type:
Measurement Dimension
Domain:
dqv:isMeasurementOf
Range:
xsd:number
Status:
Stable
Example Usage
{
  "@type": "dqv:QualityMeasurement",
  "dqv:isMeasurementOf": "ddpv:avgTokensPerDoc",
  "dqv:value": 1245.6,
  "schema:unitText": "tokens",
  "ddpv:aggregationType": "mean"
}

NER Coverage

URI: https://datapartnership.org/ddvp-metadata-terms#NER_Coverage

Term:
ddpv:NER_Coverage
Definition:
Percentage of documents in the dataset that include Named Entity Recognition (NER) annotations.
Type:
Measurement Dimension
Domain:
dqv:isMeasurementOf
Range:
xsd:number (0-100)
Status:
Stable
Example Usage
{
            "@type": "dqv:QualityMeasurement",
            "dqv:isMeasurementOf": "ddpv:NER_Coverage",
            "dqv:value": 78.5,
            "schema:unitText": "percentage",
            "ddpv:aggregationType": "percentage"
          }