pingkit

data/sample_pings_dc.parquet — synthetic D.C. ping sample

This is a synthetic dataset. It exists solely to give workshop attendees a small, license-safe, deterministic stand-in for an Irys or Quadrant feed so they can run the pingkit notebooks on a free Codespace without requesting real vendor data. Every record was generated by scripts/generate_sample.py with a fixed random seed. No real person, device, or trajectory is represented.

Overview

Approximately 1.9 M ping records for 10 000 simulated mobile devices moving around Washington, D.C. between Monday 1 June 2026 00:00 UTC and Sunday 7 June 2026 23:59 UTC. The shape of the data — schema, value ranges, accuracy distribution, weekday commute peaks, weekend leisure trips, commutes that concentrate on a handful of employment centres, and three real SDK-panel pathologies it is explicitly calibrated to reproduce. It was designed to mirror what an attendee would see in a real Irys or Quadrant export:

The calibration target was a real Quadrant daily feed (Bangkok, 2024-01-01). The values themselves are not real.

Property Value
Rows ~1.9 M
Devices 10 000
Window 2026-06-01 00:00 UTC → 2026-06-07 23:59 UTC (7 days)
Geographic extent 38.79 – 38.99 N, -77.12 – -76.91 W (D.C. bounding box)
File size ~47 MB (Parquet, zstd)
Sources "irys" and "quadrant", ~50 / 50 split
Pings per device heavy-tailed: median ≈ 70, p90 ≈ 570, p99 ≈ 1 600, max ≈ 3 100; Gini ≈ 0.71; ~7% single-ping, ~29% ≤5
Inter-ping gap bursty: median ≈ 0.6 min, ~57% of gaps < 1 min, long tail of multi-minute silences
Seed 20260605 (default in the generator)

Schema

The canonical workshop schema is the seven columns below. They are a deliberately compact subset of what Irys and Quadrant deliver; we picked the fields that every transport workflow needs and named them vendor-agnostically. The mapping back to each vendor’s own column names is in the next section.

column dtype description
device_id string Stable pseudonymous device hash. 16 hex chars, prefixed ir_ or qd_ by source.
timestamp_utc int64 (ms epoch) Event time in milliseconds since Unix epoch, UTC.
timestamp_iso string (ISO 8601) Same instant as timestamp_utc, formatted YYYY-MM-DDTHH:MM:SSZ for readability.
lat float64 WGS-84 latitude, after horizontal-error noise.
lon float64 WGS-84 longitude, after horizontal-error noise.
horizontal_accuracy_m float32 Reported horizontal accuracy in metres (1-σ proxy).
source category "irys" or "quadrant" — which vendor this row would have come from.

Records are sorted by (device_id, timestamp_utc) so per-device timestamps are monotonic non-decreasing.

Generation method

Each of the 10 000 devices is given a home cell sampled from D.C. census block groups with weights derived from the TIGER/Line 2020 land-area distribution (smaller / denser block groups get more draws). With 70% probability the device is also given a work cell — but work is not drawn from the residential surface. Instead it is assigned from a short list of 14 D.C. employment centres (downtown / K Street, Federal Triangle, Capitol Hill, Foggy Bottom, L’Enfant Plaza, Navy Yard, Union Station / NoMa, Dupont, Georgetown, Columbia Heights, Friendship Heights, Van Ness, St. Elizabeths / DHS) using a gravity model: the probability of a given centre is proportional to its job weight divided by (distance_from_home + 1 km)^2. Real commutes have strong distance decay and concentrate on a few job hubs, so this both adds realism and correlates home with work — which is what lets the OD matrix aggregate above a k-anonymity threshold instead of dissolving into singletons. Devices without a work cell still emit pings — errands, walks, social trips — centred on home.

Each device is also assigned a heavy-tailed activity profile so the panel looks like a real one rather than 10 000 identical participants. The class sets how many of the 7 days the device is present and a weekly ping budget drawn from a log-normal:

Class Share Days present Weekly ping budget (median) Effect
transient 20% 1 ~2 (1–5) the sparse tail — single-ping and ≤5-ping devices
occasional 28% 1–3 ~7 (2–30) sporadic
regular 37% 5–7 ~170 (60–450) the backbone — home/work inferable
power 15% 7 ~680 (250–2 700) power users — dominate ping volume

The budget is enforced by thinning: the device’s full bursty week is simulated, then whole bursts are dropped at random until it lands on its budget. Dropping whole bursts — rather than individual pings — preserves the sub-minute within-burst cadence, so a heavily thinned device looks like one or two tight clusters of sightings, not an evenly spaced handful. The result is the real panel shape: a sparse tail (~7% single-ping devices, ~29% ≤5), heavy skew (top 10% of devices hold ~52% of pings; Gini ≈ 0.71; max ≈ 3 100 vs. a median of ≈ 70), and a bursty cadence (median gap ≈ 0.6 min; ~57% of consecutive gaps < 1 min). These targets are calibrated against a real Quadrant daily feed (Bangkok, 2024-01-01). The class label is deliberately not written to the file: analysts must discover the imbalance from the data, exactly as with a real feed.

The 7-day per-device schedule is then simulated (only on the device’s active days):

Within every segment above, pings are not evenly spaced. They are emitted in bursts — an app wakes, fires a handful of pings a few seconds apart (~15 s within a burst), then goes quiet for minutes (mean ≈ 3 min between bursts while moving, ≈ 6 min while idle); each burst holds 1 + Poisson(2.2) pings. This is the event-triggered SDK signature: most consecutive gaps are sub-minute, with a long tail of multi-minute silences. The per-device thinning that enforces the ping budget then drops whole bursts, which lengthens the silences for low-activity devices without erasing the sub-minute within-burst structure.

A horizontal-accuracy noise term is then applied to every row. The base σ is 8 m at open sky and 25 m inside a coarse downtown “urban-canyon” box (38.89 – 38.91 N, −77.04 – −77.02 W). The reported horizontal_accuracy_m is the base σ jittered by a small log-normal so it does not look unrealistically uniform.

The whole pipeline draws from a single numpy.random.default_rng(seed). No clock, no random, no thread-local state. Running the script twice with the same seed and --n-devices produces identical bytes.

Fixed parameters

Parameter Value
Seed 20260605
Devices 10 000
Week 2026-06-01 → 2026-06-07 UTC
Work-cell probability 0.70
Work assignment gravity over 14 employment centres
Gravity exponent / softening α = 2.0, d₀ = 1 km
Employment-centre jitter σ 120 m
Activity classes transient 20% / occasional 28% / regular 37% / power 15%
Weekly ping budget, median ~2 / ~7 / ~170 / ~680 (lognormal)
Days present 1 / 1–3 / 5–7 / 7
Midday-hop probability 0.20 (of work-having devices)
Inter-burst gap, moving 3 min (Exponential)
Inter-burst gap, stationary 6 min (Exponential)
Within-burst gap 0.25 min (~15 s, Exponential)
Burst size 1 + Poisson(2.2) (mean ≈ 3.2 pings)
Path jitter σ 10 m
Accuracy σ, open sky 8 m
Accuracy σ, urban canyon 25 m
Indoor drop fraction, dwell > 2h 0.5 (0.6–0.7 on weekend evenings)
Commute speed 22 km/h
Weekend / errand speed 18 km/h
AM departure 07:30 local, σ = 45 min
PM departure 17:30 local, σ = 60 min
Source split ~50 / 50 ("irys" / "quadrant")
Local timezone offset UTC-4 (EDT, fixed for this window)
Parquet compression zstd

Validation expectations

Re-running the generator with the defaults should produce output close to these targets. Small variation is fine as the simulation is stochastic.

Check Expected
Row count ~1.8 – 2.1 M
Distinct devices present ~9 990 of 10 000 (a few are thinned to nothing)
Pings per device (median) ~70
Pings per device (heavy tail) p90 ≈ 570, p99 ≈ 1 600, max ≈ 3 100; Gini ≈ 0.71
Sparse tail ~7% single-ping devices, ~29% ≤5 pings
Inter-ping gap (bursty) median ≈ 0.6 min; ~57% of consecutive gaps < 1 min
timestamp_utc range 2026-06-01 00:00 UTC → 2026-06-07 23:59 UTC (no spillover)
Lat range inside 38.79 – 38.99
Lon range inside -77.12 – -76.91
Per-device timestamps monotonically non-decreasing
Source split ~50 / 50
Median accuracy (open-sky rows) ~8 m
Median accuracy (urban-canyon rows) ~25 m
Devices with inferable home cell ~52% of present (rest too sparse to locate)
Devices with inferable work cell ~47% of present (~90% of home-devices)
Non-trivial OD flows at k=5 / k=15 ~150 / ~17 (illustrates k-anonymity suppression)
File size < 100 MB; ~47 MB with defaults

The generator prints all of the above in its summary block, so you can spot a regression immediately.

How to regenerate

From the repo root, in the configured devcontainer / Codespace:

python scripts/generate_sample.py

Optional flags:

python scripts/generate_sample.py \
    --output data/sample_pings_dc.parquet \
    --n-devices 10000 \
    --seed 20260605

The script writes the boundary cache to _planning/dc_boundaries.geojson (gitignored) the first time it runs; subsequent runs are fully offline and take roughly 30–45 seconds on a free 2-core Codespace.

If TIGER is unreachable, the script logs a warning and falls back to ward-level centroids (see Provenance below). The output schema is identical; only the spatial pattern of home / work points becomes blockier.

Limitations

Provenance

Ethical / licensing disclaimer

This file is synthetic and contains no real personal data. It is licensed under the same Mozilla Public License 2.0 (MPL-2.0) as the rest of the pingkit repository and is safe to share for training, demos, and testing. It is not suitable input for any analysis whose results will be cited, published, or used to make programmatic, operational, or policy decisions. Real Irys and Quadrant data must be obtained through the Development Data Partnership Portal under the existing master licences (see Subject Brief §1–2).

Even though this dataset is synthetic, it preserves the shape of a real ping feed — including the property that four spatio-temporal points are enough to single out almost any individual trajectory in a population of millions (de Montjoye et al. 2013, Subject Brief §4). Treat it as if it were real when practising privacy techniques.