client.documents(include_text=False) -> pd.DataFrame
Download document metadata. Optionally includes full extracted text from all documents (downloads sharded files).
Parameters
When True, downloads individual full-text shards and concatenates them into a single DataFrame. This is a much larger download.
Returns
pd.DataFrame with document metadata, optionally including extracted_text.
Example
from jmail import JmailClient
client = JmailClient()
# Metadata only
docs = client.documents()
# With full extracted text (large download)
docs_full = client.documents(include_text=True)
# Search document descriptions
flights = docs[docs.document_description.str.contains("flight", case=False, na=False)]
Columns
| Column | Type | Description |
|---|
id | int | Unique document ID |
source | string | Source (doj, house_oversight) |
release_batch | string | Volume/batch identifier |
original_filename | string | Original filename |
page_count | int | Number of pages |
size | int | File size in bytes |
document_description | string | AI-generated description |
has_thumbnail | bool | Whether a thumbnail exists |
Additional Column (include_text=True)
| Column | Type | Description |
|---|
extracted_text | string | Full extracted text from the document |
Full-Text Shards
When using include_text=True, the client downloads these shards and concatenates them:
| Shard | Contents |
|---|
VOL00008 | DOJ Volume 8 documents |
VOL00009 | DOJ Volume 9 documents |
VOL00010 | DOJ Volume 10 documents |
DataSet11 | DOJ Dataset 11 documents |
other | House Oversight, court records, etc. |
Direct URLs
https://data.jmail.world/v1/documents.parquet
https://data.jmail.world/v1/documents-full/VOL00008.parquet
https://data.jmail.world/v1/documents-full/VOL00009.parquet
https://data.jmail.world/v1/documents-full/VOL00010.parquet
https://data.jmail.world/v1/documents-full/DataSet11.parquet
https://data.jmail.world/v1/documents-full/other.parquet