Skip to main content
DuckDB can query Parquet files directly over HTTP with zero download or setup. This is the fastest way to explore the data interactively.

Setup

Install DuckDB, then query directly:
duckdb
SELECT sender, COUNT(*) as n
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
GROUP BY sender ORDER BY n DESC LIMIT 20;

Example Queries

Count emails by sender

SELECT sender, COUNT(*) as n
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
GROUP BY sender ORDER BY n DESC LIMIT 20;

Emails sent by Epstein

SELECT subject, sent_at, to_recipients
FROM read_parquet('https://data.jmail.world/v1/emails-slim.parquet')
WHERE epstein_is_sender = true
ORDER BY sent_at DESC LIMIT 20;

Search documents

SELECT original_filename, document_description, page_count
FROM read_parquet('https://data.jmail.world/v1/documents.parquet')
WHERE document_description ILIKE '%flight%'
LIMIT 20;

Join photos with people

SELECT p.name, COUNT(*) as appearances
FROM read_parquet('https://data.jmail.world/v1/photo_faces.parquet') pf
JOIN read_parquet('https://data.jmail.world/v1/people.parquet') p
  ON pf.person_id = p.id
GROUP BY p.name ORDER BY appearances DESC;

Star counts by entity type

SELECT entity_type, SUM(count) as total_stars
FROM read_parquet('https://data.jmail.world/v1/star_counts.parquet')
GROUP BY entity_type ORDER BY total_stars DESC;

Using with Polars

Polars can also read Parquet over HTTP:
import polars as pl

df = pl.read_parquet("https://data.jmail.world/v1/emails-slim.parquet")
print(df.group_by("sender").len().sort("len", descending=True).head(20))

Using with Pandas

import pandas as pd

df = pd.read_parquet("https://data.jmail.world/v1/emails-slim.parquet")
print(df.groupby("sender").size().sort_values(ascending=False).head(20))