pingkit

`data/sample_pings_dc.parquet` — synthetic D.C. ping sample

This is a synthetic dataset. It exists solely to give workshop attendees a small, license-safe, deterministic stand-in for an Irys or Quadrant feed so they can run the pingkit notebooks on a free Codespace without requesting real vendor data. Every record was generated by scripts/generate_sample.py with a fixed random seed. No real person, device, or trajectory is represented.

Overview

Approximately 1.9 M ping records for 10 000 simulated mobile devices moving around Washington, D.C. between Monday 1 June 2026 00:00 UTC and Sunday 7 June 2026 23:59 UTC. The shape of the data — schema, value ranges, accuracy distribution, weekday commute peaks, weekend leisure trips, commutes that concentrate on a handful of employment centres, and three real SDK-panel pathologies it is explicitly calibrated to reproduce. It was designed to mirror what an attendee would see in a real Irys or Quadrant export:

a sparse tail — ~7% of devices emit a single ping, ~29% emit ≤5;
heavy skew — the top 10% of devices hold ~half of all pings (Gini ≈ 0.71);
a bursty, event-triggered cadence — most consecutive pings are sub-minute (inside short bursts), separated by long silences.

The calibration target was a real Quadrant daily feed (Bangkok, 2024-01-01). The values themselves are not real.

Property	Value
Rows	~1.9 M
Devices	10 000
Window	2026-06-01 00:00 UTC → 2026-06-07 23:59 UTC (7 days)
Geographic extent	38.79 – 38.99 N, -77.12 – -76.91 W (D.C. bounding box)
File size	~47 MB (Parquet, zstd)
Sources	`"irys"` and `"quadrant"`, ~50 / 50 split
Pings per device	heavy-tailed: median ≈ 70, p90 ≈ 570, p99 ≈ 1 600, max ≈ 3 100; Gini ≈ 0.71; ~7% single-ping, ~29% ≤5
Inter-ping gap	bursty: median ≈ 0.6 min, ~57% of gaps < 1 min, long tail of multi-minute silences
Seed	`20260605` (default in the generator)

Schema

The canonical workshop schema is the seven columns below. They are a deliberately compact subset of what Irys and Quadrant deliver; we picked the fields that every transport workflow needs and named them vendor-agnostically. The mapping back to each vendor’s own column names is in the next section.

column	dtype	description
`device_id`	string	Stable pseudonymous device hash. 16 hex chars, prefixed `ir_` or `qd_` by source.
`timestamp_utc`	int64 (ms epoch)	Event time in milliseconds since Unix epoch, UTC.
`timestamp_iso`	string (ISO 8601)	Same instant as `timestamp_utc`, formatted `YYYY-MM-DDTHH:MM:SSZ` for readability.
`lat`	float64	WGS-84 latitude, after horizontal-error noise.
`lon`	float64	WGS-84 longitude, after horizontal-error noise.
`horizontal_accuracy_m`	float32	Reported horizontal accuracy in metres (1-σ proxy).
`source`	category	`"irys"` or `"quadrant"` — which vendor this row would have come from.

Records are sorted by (device_id, timestamp_utc) so per-device timestamps are monotonic non-decreasing.

Generation method

Each of the 10 000 devices is given a home cell sampled from D.C. census block groups with weights derived from the TIGER/Line 2020 land-area distribution (smaller / denser block groups get more draws). With 70% probability the device is also given a work cell — but work is not drawn from the residential surface. Instead it is assigned from a short list of 14 D.C. employment centres (downtown / K Street, Federal Triangle, Capitol Hill, Foggy Bottom, L’Enfant Plaza, Navy Yard, Union Station / NoMa, Dupont, Georgetown, Columbia Heights, Friendship Heights, Van Ness, St. Elizabeths / DHS) using a gravity model: the probability of a given centre is proportional to its job weight divided by (distance_from_home + 1 km)^2. Real commutes have strong distance decay and concentrate on a few job hubs, so this both adds realism and correlates home with work — which is what lets the OD matrix aggregate above a k-anonymity threshold instead of dissolving into singletons. Devices without a work cell still emit pings — errands, walks, social trips — centred on home.

Each device is also assigned a heavy-tailed activity profile so the panel looks like a real one rather than 10 000 identical participants. The class sets how many of the 7 days the device is present and a weekly ping budget drawn from a log-normal:

Class	Share	Days present	Weekly ping budget (median)	Effect
transient	20%	1	~2 (1–5)	the sparse tail — single-ping and ≤5-ping devices
occasional	28%	1–3	~7 (2–30)	sporadic
regular	37%	5–7	~170 (60–450)	the backbone — home/work inferable
power	15%	7	~680 (250–2 700)	power users — dominate ping volume

The budget is enforced by thinning: the device’s full bursty week is simulated, then whole bursts are dropped at random until it lands on its budget. Dropping whole bursts — rather than individual pings — preserves the sub-minute within-burst cadence, so a heavily thinned device looks like one or two tight clusters of sightings, not an evenly spaced handful. The result is the real panel shape: a sparse tail (~7% single-ping devices, ~29% ≤5), heavy skew (top 10% of devices hold ~52% of pings; Gini ≈ 0.71; max ≈ 3 100 vs. a median of ≈ 70), and a bursty cadence (median gap ≈ 0.6 min; ~57% of consecutive gaps < 1 min). These targets are calibrated against a real Quadrant daily feed (Bangkok, 2024-01-01). The class label is deliberately not written to the file: analysts must discover the imbalance from the data, exactly as with a real feed.

The 7-day per-device schedule is then simulated (only on the device’s active days):

Weekdays (Mon–Fri) with a work cell. AM departure is drawn from a Gaussian centred at 07:30 local (σ ≈ 45 min) and PM departure from a Gaussian centred at 17:30 local (σ ≈ 60 min). The device emits pings while stationary at home, while in transit home → work and work → home (positions linearly interpolated with ~10 m Gaussian jitter), and while stationary at work. About 20% of work-having devices take a short midday hop near work and back. Indoor “quiet” gaps drop ~50% of pings on any stationary dwell longer than two hours.
Weekdays without a work cell. 0–2 short errands between 09:00 and 19:00 local to a leisure POI or a random nearby point, then home.
Weekend (Sat–Sun). 1–3 leisure trips per day to one of ten hard-coded D.C. landmarks (National Mall, Georgetown waterfront, U Street, Adams Morgan, Capitol Hill, H Street NE, Anacostia waterfront, Navy Yard, etc.). Overall ping volume is lower than weekdays.

Within every segment above, pings are not evenly spaced. They are emitted in bursts — an app wakes, fires a handful of pings a few seconds apart (~15 s within a burst), then goes quiet for minutes (mean ≈ 3 min between bursts while moving, ≈ 6 min while idle); each burst holds 1 + Poisson(2.2) pings. This is the event-triggered SDK signature: most consecutive gaps are sub-minute, with a long tail of multi-minute silences. The per-device thinning that enforces the ping budget then drops whole bursts, which lengthens the silences for low-activity devices without erasing the sub-minute within-burst structure.

A horizontal-accuracy noise term is then applied to every row. The base σ is 8 m at open sky and 25 m inside a coarse downtown “urban-canyon” box (38.89 – 38.91 N, −77.04 – −77.02 W). The reported horizontal_accuracy_m is the base σ jittered by a small log-normal so it does not look unrealistically uniform.

The whole pipeline draws from a single numpy.random.default_rng(seed). No clock, no random, no thread-local state. Running the script twice with the same seed and --n-devices produces identical bytes.

Fixed parameters

Parameter	Value
Seed	`20260605`
Devices	10 000
Week	2026-06-01 → 2026-06-07 UTC
Work-cell probability	0.70
Work assignment	gravity over 14 employment centres
Gravity exponent / softening	α = 2.0, d₀ = 1 km
Employment-centre jitter σ	120 m
Activity classes	transient 20% / occasional 28% / regular 37% / power 15%
Weekly ping budget, median	~2 / ~7 / ~170 / ~680 (lognormal)
Days present	1 / 1–3 / 5–7 / 7
Midday-hop probability	0.20 (of work-having devices)
Inter-burst gap, moving	3 min (Exponential)
Inter-burst gap, stationary	6 min (Exponential)
Within-burst gap	0.25 min (~15 s, Exponential)
Burst size	1 + Poisson(2.2) (mean ≈ 3.2 pings)
Path jitter σ	10 m
Accuracy σ, open sky	8 m
Accuracy σ, urban canyon	25 m
Indoor drop fraction, dwell > 2h	0.5 (0.6–0.7 on weekend evenings)
Commute speed	22 km/h
Weekend / errand speed	18 km/h
AM departure	07:30 local, σ = 45 min
PM departure	17:30 local, σ = 60 min
Source split	~50 / 50 (`"irys"` / `"quadrant"`)
Local timezone offset	UTC-4 (EDT, fixed for this window)
Parquet compression	zstd

Validation expectations

Re-running the generator with the defaults should produce output close to these targets. Small variation is fine as the simulation is stochastic.

Check	Expected
Row count	~1.8 – 2.1 M
Distinct devices present	~9 990 of 10 000 (a few are thinned to nothing)
Pings per device (median)	~70
Pings per device (heavy tail)	p90 ≈ 570, p99 ≈ 1 600, max ≈ 3 100; Gini ≈ 0.71
Sparse tail	~7% single-ping devices, ~29% ≤5 pings
Inter-ping gap (bursty)	median ≈ 0.6 min; ~57% of consecutive gaps < 1 min
`timestamp_utc` range	2026-06-01 00:00 UTC → 2026-06-07 23:59 UTC (no spillover)
Lat range	inside 38.79 – 38.99
Lon range	inside -77.12 – -76.91
Per-device timestamps	monotonically non-decreasing
Source split	~50 / 50
Median accuracy (open-sky rows)	~8 m
Median accuracy (urban-canyon rows)	~25 m
Devices with inferable home cell	~52% of present (rest too sparse to locate)
Devices with inferable work cell	~47% of present (~90% of home-devices)
Non-trivial OD flows at k=5 / k=15	~150 / ~17 (illustrates k-anonymity suppression)
File size	< 100 MB; ~47 MB with defaults

The generator prints all of the above in its summary block, so you can spot a regression immediately.

How to regenerate

From the repo root, in the configured devcontainer / Codespace:

python scripts/generate_sample.py

Optional flags:

python scripts/generate_sample.py \
    --output data/sample_pings_dc.parquet \
    --n-devices 10000 \
    --seed 20260605

The script writes the boundary cache to _planning/dc_boundaries.geojson (gitignored) the first time it runs; subsequent runs are fully offline and take roughly 30–45 seconds on a free 2-core Codespace.

If TIGER is unreachable, the script logs a warning and falls back to ward-level centroids (see Provenance below). The output schema is identical; only the spatial pattern of home / work points becomes blockier.

Limitations

Synthetic ≠ representative. The dataset does now reproduce three real SDK-panel pathologies — a sparse tail (single-ping and ≤5-ping devices), a heavily skewed pings-per-device distribution (a few power users hold most of the volume), and a bursty, event-triggered cadence (sub-minute bursts separated by long silences) — all calibrated to a real Quadrant daily feed (Bangkok, 2024-01-01), so the data-quality exercise has something to bite on. It does not reproduce demographic bias: real Irys and Quadrant panels skew toward Android, free-app, and younger users (Subject Brief §1, §2), whereas here the panel is a uniform random draw on a population-weighted geography with a synthetic activity profile bolted on. Do not draw conclusions about real D.C. travel from it.
Intentionally clean. This sample has no missing values and no duplicate rows, whereas a real raw feed carries ~15% missing horizontal_accuracy and ~1–5% exact duplicates. Those two pathologies are not modelled here, so the dedupe / missingness QC steps run on essentially clean input (they are taught for when you graduate to a real feed).
No consent flag. Real Quadrant rows include a consent field. We intentionally omit it so the workshop can demonstrate why that matters in the privacy section.
No sensitive-POI exclusion applied. Real production pipelines must strip pings within ~150 m of medical, religious, school, shelter, and military POIs (FTC v. Kochava, 4 May 2026 — Subject Brief §4). This sample is intentionally raw so the workshop can demo the exclusion step.
No privacy aggregation applied. Pings are at full spatio-temporal resolution. Re-identification risk is the whole point of the privacy exercise — the workshop shows attendees how to aggregate to H3 res 8 and enforce k ≥ 15 before any output leaves the secure environment.
No iOS/Android, no app-id, no carrier columns. The schema is the workshop minimum, not a full vendor record.
Urban-canyon model is a single rectangle. Real GNSS accuracy degradation depends on building geometry and is far more local than this.

Provenance

Spatial backbone. D.C. block-group polygons are pulled at first run from the U.S. Census Bureau TIGER/Line 2020 dataset for state FIPS 11 (tl_2020_11_bg.zip, ≈540 KB). Source URL: https://www2.census.gov/geo/tiger/TIGER2020/BG/tl_2020_11_bg.zip. Public domain (works of the U.S. federal government).
Block-group weighting. We approximate population density by inverse square root of ALAND (land area, square metres) so the script does not need a second download (and an API key) for ACS population. This is a pragmatic proxy that biases draws toward smaller, denser block groups; it is good enough to produce a plausible spatial pattern for a workshop demo but is not an accurate population surface.
Fallback (TIGER unreachable). If the TIGER download fails, the script switches to a hard-coded table of 8 D.C. ward centroids weighted by 2020 Decennial Census P1 counts (see DC_WARD_FALLBACK in scripts/generate_sample.py). The home / work pattern becomes blockier but the schema and statistics remain valid.
Leisure POIs. Ten well-known D.C. landmarks are hard-coded in LEISURE_POIS in the generator. Coordinates approximate; no real foot-traffic data was used.
Employment centres. Fourteen D.C. job hubs are hard-coded in EMPLOYMENT_CENTERS with rough relative job weights (downtown / federal core largest). Coordinates are approximate landmark centroids; the weights are illustrative, not derived from LEHD LODES or any employment dataset. Workplaces are assigned to these centres by the gravity model described above.

Ethical / licensing disclaimer

This file is synthetic and contains no real personal data. It is licensed under the same Mozilla Public License 2.0 (MPL-2.0) as the rest of the pingkit repository and is safe to share for training, demos, and testing. It is not suitable input for any analysis whose results will be cited, published, or used to make programmatic, operational, or policy decisions. Real Irys and Quadrant data must be obtained through the Development Data Partnership Portal under the existing master licences (see Subject Brief §1–2).

Even though this dataset is synthetic, it preserves the shape of a real ping feed — including the property that four spatio-temporal points are enough to single out almost any individual trajectory in a population of millions (de Montjoye et al. 2013, Subject Brief §4). Treat it as if it were real when practising privacy techniques.

This site is open source. Improve this page.