data/sample_pings_dc.parquet — synthetic D.C. ping sampleThis is a synthetic dataset. It exists solely to give workshop attendees a small, license-safe, deterministic stand-in for an Irys or Quadrant feed so they can run the
pingkitnotebooks on a free Codespace without requesting real vendor data. Every record was generated byscripts/generate_sample.pywith a fixed random seed. No real person, device, or trajectory is represented.
Approximately 1.9 M ping records for 10 000 simulated mobile devices moving around Washington, D.C. between Monday 1 June 2026 00:00 UTC and Sunday 7 June 2026 23:59 UTC. The shape of the data — schema, value ranges, accuracy distribution, weekday commute peaks, weekend leisure trips, commutes that concentrate on a handful of employment centres, and three real SDK-panel pathologies it is explicitly calibrated to reproduce. It was designed to mirror what an attendee would see in a real Irys or Quadrant export:
The calibration target was a real Quadrant daily feed (Bangkok, 2024-01-01). The values themselves are not real.
| Property | Value |
|---|---|
| Rows | ~1.9 M |
| Devices | 10 000 |
| Window | 2026-06-01 00:00 UTC → 2026-06-07 23:59 UTC (7 days) |
| Geographic extent | 38.79 – 38.99 N, -77.12 – -76.91 W (D.C. bounding box) |
| File size | ~47 MB (Parquet, zstd) |
| Sources | "irys" and "quadrant", ~50 / 50 split |
| Pings per device | heavy-tailed: median ≈ 70, p90 ≈ 570, p99 ≈ 1 600, max ≈ 3 100; Gini ≈ 0.71; ~7% single-ping, ~29% ≤5 |
| Inter-ping gap | bursty: median ≈ 0.6 min, ~57% of gaps < 1 min, long tail of multi-minute silences |
| Seed | 20260605 (default in the generator) |
The canonical workshop schema is the seven columns below. They are a deliberately compact subset of what Irys and Quadrant deliver; we picked the fields that every transport workflow needs and named them vendor-agnostically. The mapping back to each vendor’s own column names is in the next section.
| column | dtype | description |
|---|---|---|
device_id |
string | Stable pseudonymous device hash. 16 hex chars, prefixed ir_ or qd_ by source. |
timestamp_utc |
int64 (ms epoch) | Event time in milliseconds since Unix epoch, UTC. |
timestamp_iso |
string (ISO 8601) | Same instant as timestamp_utc, formatted YYYY-MM-DDTHH:MM:SSZ for readability. |
lat |
float64 | WGS-84 latitude, after horizontal-error noise. |
lon |
float64 | WGS-84 longitude, after horizontal-error noise. |
horizontal_accuracy_m |
float32 | Reported horizontal accuracy in metres (1-σ proxy). |
source |
category | "irys" or "quadrant" — which vendor this row would have come from. |
Records are sorted by (device_id, timestamp_utc) so per-device timestamps
are monotonic non-decreasing.
Each of the 10 000 devices is given a home cell sampled from D.C. census
block groups with weights derived from the TIGER/Line 2020 land-area
distribution (smaller / denser block groups get more draws). With
70% probability the device is also given a work cell — but work is
not drawn from the residential surface. Instead it is assigned from a
short list of 14 D.C. employment centres (downtown / K Street, Federal
Triangle, Capitol Hill, Foggy Bottom, L’Enfant Plaza, Navy Yard, Union
Station / NoMa, Dupont, Georgetown, Columbia Heights, Friendship Heights,
Van Ness, St. Elizabeths / DHS) using a gravity model: the probability
of a given centre is proportional to its job weight divided by
(distance_from_home + 1 km)^2. Real commutes have strong distance decay
and concentrate on a few job hubs, so this both adds realism and correlates
home with work — which is what lets the OD matrix aggregate above a
k-anonymity threshold instead of dissolving into singletons. Devices without
a work cell still emit pings — errands, walks, social trips — centred on home.
Each device is also assigned a heavy-tailed activity profile so the panel looks like a real one rather than 10 000 identical participants. The class sets how many of the 7 days the device is present and a weekly ping budget drawn from a log-normal:
| Class | Share | Days present | Weekly ping budget (median) | Effect |
|---|---|---|---|---|
| transient | 20% | 1 | ~2 (1–5) | the sparse tail — single-ping and ≤5-ping devices |
| occasional | 28% | 1–3 | ~7 (2–30) | sporadic |
| regular | 37% | 5–7 | ~170 (60–450) | the backbone — home/work inferable |
| power | 15% | 7 | ~680 (250–2 700) | power users — dominate ping volume |
The budget is enforced by thinning: the device’s full bursty week is simulated, then whole bursts are dropped at random until it lands on its budget. Dropping whole bursts — rather than individual pings — preserves the sub-minute within-burst cadence, so a heavily thinned device looks like one or two tight clusters of sightings, not an evenly spaced handful. The result is the real panel shape: a sparse tail (~7% single-ping devices, ~29% ≤5), heavy skew (top 10% of devices hold ~52% of pings; Gini ≈ 0.71; max ≈ 3 100 vs. a median of ≈ 70), and a bursty cadence (median gap ≈ 0.6 min; ~57% of consecutive gaps < 1 min). These targets are calibrated against a real Quadrant daily feed (Bangkok, 2024-01-01). The class label is deliberately not written to the file: analysts must discover the imbalance from the data, exactly as with a real feed.
The 7-day per-device schedule is then simulated (only on the device’s active days):
Within every segment above, pings are not evenly spaced. They are emitted in
bursts — an app wakes, fires a handful of pings a few seconds apart
(~15 s within a burst), then goes quiet for minutes (mean ≈ 3 min between bursts
while moving, ≈ 6 min while idle); each burst holds 1 + Poisson(2.2) pings.
This is the event-triggered SDK signature: most consecutive gaps are sub-minute,
with a long tail of multi-minute silences. The per-device thinning that enforces
the ping budget then drops whole bursts, which lengthens the silences for
low-activity devices without erasing the sub-minute within-burst structure.
A horizontal-accuracy noise term is then applied to every row. The base
σ is 8 m at open sky and 25 m inside a coarse downtown
“urban-canyon” box (38.89 – 38.91 N, −77.04 – −77.02 W). The reported
horizontal_accuracy_m is the base σ jittered by a small log-normal so it
does not look unrealistically uniform.
The whole pipeline draws from a single numpy.random.default_rng(seed). No
clock, no random, no thread-local state. Running the script twice with the
same seed and --n-devices produces identical bytes.
| Parameter | Value |
|---|---|
| Seed | 20260605 |
| Devices | 10 000 |
| Week | 2026-06-01 → 2026-06-07 UTC |
| Work-cell probability | 0.70 |
| Work assignment | gravity over 14 employment centres |
| Gravity exponent / softening | α = 2.0, d₀ = 1 km |
| Employment-centre jitter σ | 120 m |
| Activity classes | transient 20% / occasional 28% / regular 37% / power 15% |
| Weekly ping budget, median | ~2 / ~7 / ~170 / ~680 (lognormal) |
| Days present | 1 / 1–3 / 5–7 / 7 |
| Midday-hop probability | 0.20 (of work-having devices) |
| Inter-burst gap, moving | 3 min (Exponential) |
| Inter-burst gap, stationary | 6 min (Exponential) |
| Within-burst gap | 0.25 min (~15 s, Exponential) |
| Burst size | 1 + Poisson(2.2) (mean ≈ 3.2 pings) |
| Path jitter σ | 10 m |
| Accuracy σ, open sky | 8 m |
| Accuracy σ, urban canyon | 25 m |
| Indoor drop fraction, dwell > 2h | 0.5 (0.6–0.7 on weekend evenings) |
| Commute speed | 22 km/h |
| Weekend / errand speed | 18 km/h |
| AM departure | 07:30 local, σ = 45 min |
| PM departure | 17:30 local, σ = 60 min |
| Source split | ~50 / 50 ("irys" / "quadrant") |
| Local timezone offset | UTC-4 (EDT, fixed for this window) |
| Parquet compression | zstd |
Re-running the generator with the defaults should produce output close to these targets. Small variation is fine as the simulation is stochastic.
| Check | Expected |
|---|---|
| Row count | ~1.8 – 2.1 M |
| Distinct devices present | ~9 990 of 10 000 (a few are thinned to nothing) |
| Pings per device (median) | ~70 |
| Pings per device (heavy tail) | p90 ≈ 570, p99 ≈ 1 600, max ≈ 3 100; Gini ≈ 0.71 |
| Sparse tail | ~7% single-ping devices, ~29% ≤5 pings |
| Inter-ping gap (bursty) | median ≈ 0.6 min; ~57% of consecutive gaps < 1 min |
timestamp_utc range |
2026-06-01 00:00 UTC → 2026-06-07 23:59 UTC (no spillover) |
| Lat range | inside 38.79 – 38.99 |
| Lon range | inside -77.12 – -76.91 |
| Per-device timestamps | monotonically non-decreasing |
| Source split | ~50 / 50 |
| Median accuracy (open-sky rows) | ~8 m |
| Median accuracy (urban-canyon rows) | ~25 m |
| Devices with inferable home cell | ~52% of present (rest too sparse to locate) |
| Devices with inferable work cell | ~47% of present (~90% of home-devices) |
| Non-trivial OD flows at k=5 / k=15 | ~150 / ~17 (illustrates k-anonymity suppression) |
| File size | < 100 MB; ~47 MB with defaults |
The generator prints all of the above in its summary block, so you can spot a regression immediately.
From the repo root, in the configured devcontainer / Codespace:
python scripts/generate_sample.py
Optional flags:
python scripts/generate_sample.py \
--output data/sample_pings_dc.parquet \
--n-devices 10000 \
--seed 20260605
The script writes the boundary cache to _planning/dc_boundaries.geojson
(gitignored) the first time it runs; subsequent runs are fully offline and
take roughly 30–45 seconds on a free 2-core Codespace.
If TIGER is unreachable, the script logs a warning and falls back to ward-level centroids (see Provenance below). The output schema is identical; only the spatial pattern of home / work points becomes blockier.
horizontal_accuracy and
~1–5% exact duplicates. Those two pathologies are not modelled here, so the
dedupe / missingness QC steps run on essentially clean input (they are taught
for when you graduate to a real feed).consent field. We
intentionally omit it so the workshop can demonstrate why that matters in
the privacy section.tl_2020_11_bg.zip, ≈540 KB). Source URL:
https://www2.census.gov/geo/tiger/TIGER2020/BG/tl_2020_11_bg.zip. Public
domain (works of the U.S. federal government).ALAND (land area, square metres) so the script does not
need a second download (and an API key) for ACS population. This is a
pragmatic proxy that biases draws toward smaller, denser block groups; it
is good enough to produce a plausible spatial pattern for a workshop demo
but is not an accurate population surface.DC_WARD_FALLBACK in
scripts/generate_sample.py). The home / work pattern becomes blockier
but the schema and statistics remain valid.LEISURE_POIS in the generator. Coordinates approximate; no real
foot-traffic data was used.EMPLOYMENT_CENTERS with rough relative job weights (downtown / federal
core largest). Coordinates are approximate landmark centroids; the weights
are illustrative, not derived from LEHD LODES or any employment dataset.
Workplaces are assigned to these centres by the gravity model described
above.This file is synthetic and contains no real personal data. It is
licensed under the same Mozilla Public License 2.0 (MPL-2.0) as the rest of
the pingkit repository and is safe to share for training, demos, and
testing. It is not suitable input for any analysis whose results will
be cited, published, or used to make programmatic, operational, or policy
decisions. Real Irys and Quadrant data must be obtained through the
Development Data Partnership Portal under the existing master licences
(see Subject Brief §1–2).
Even though this dataset is synthetic, it preserves the shape of a real ping feed — including the property that four spatio-temporal points are enough to single out almost any individual trajectory in a population of millions (de Montjoye et al. 2013, Subject Brief §4). Treat it as if it were real when practising privacy techniques.