Skip to main content
client.documents(include_text=False) -> pd.DataFrame
Download document metadata. Optionally includes full extracted text from all documents (downloads sharded files).

Parameters

include_text
bool
default:"False"
When True, downloads individual full-text shards and concatenates them into a single DataFrame. This is a much larger download.

Returns

pd.DataFrame with document metadata, optionally including extracted_text.

Example

from jmail import JmailClient

client = JmailClient()

# Metadata only
docs = client.documents()

# With full extracted text (large download)
docs_full = client.documents(include_text=True)

# Search document descriptions
flights = docs[docs.document_description.str.contains("flight", case=False, na=False)]

Columns

ColumnTypeDescription
idintUnique document ID
sourcestringSource (doj, house_oversight)
release_batchstringVolume/batch identifier
original_filenamestringOriginal filename
page_countintNumber of pages
sizeintFile size in bytes
document_descriptionstringAI-generated description
has_thumbnailboolWhether a thumbnail exists

Additional Column (include_text=True)

ColumnTypeDescription
extracted_textstringFull extracted text from the document

Full-Text Shards

When using include_text=True, the client downloads these shards and concatenates them:
ShardContents
VOL00008DOJ Volume 8 documents
VOL00009DOJ Volume 9 documents
VOL00010DOJ Volume 10 documents
DataSet11DOJ Dataset 11 documents
otherHouse Oversight, court records, etc.

Direct URLs

https://data.jmail.world/v1/documents.parquet
https://data.jmail.world/v1/documents-full/VOL00008.parquet
https://data.jmail.world/v1/documents-full/VOL00009.parquet
https://data.jmail.world/v1/documents-full/VOL00010.parquet
https://data.jmail.world/v1/documents-full/DataSet11.parquet
https://data.jmail.world/v1/documents-full/other.parquet