API
get_html_page(url)
Fetch and parse an HTML page.
This function downloads the HTML content from a given URL and parses it into a BeautifulSoup object, which can then be queried using CSS selectors or tag-based navigation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
The URL of the HTML page to retrieve. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
BeautifulSoup |
BeautifulSoup
|
A parsed representation of the HTML document. |
parse_file_size_mb(file_size, rounding=5)
Convert a file size expression to megabytes.
A vectorized Polars expression that parses file size strings containing KB, MB, or GB units and converts them to a numeric value in megabytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_size
|
Expr
|
A Polars expression resolving to file size strings (e.g. "531 KB", "1.8 MB", "1 GB"). |
required |
rounding
|
int
|
Number of decimal places to round to. Default is 5. |
5
|
Returns:
| Type | Description |
|---|---|
Expr
|
pl.Expr: A Polars expression resolving to file sizes in megabytes. |
Examples:
scrape_mult_mort_user_guide(url)
Scrape Mortality Multiple Cause-of-Death user guide links from the CDC.
Extracts downloadable file links from the CDC Mortality Multiple Cause-of-Death documentation page and returns a Polars DataFrame with metadata about each file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
URL of the CDC mortality documentation page. Typically: https://www.cdc.gov/nchs/nvss/mortality_public_use_data.htm |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pl.DataFrame: A DataFrame with columns: - section (str): Always "mortality_multiple" - subsection (str): Always "User Guide" - link_text (str): Text of the download link - year (int): Four-digit year extracted from link text, filled down for sub-items - file_size (str): File size string, if present - url (str): Absolute URL to the file - file_type (str): File extension - file_size_mb (float): File size converted to megabytes |
Notes
1997 and 1998 entries link to separate HTML pages containing many PDFs and are not scraped by this function as of 2/17/2026.
scrape_all_sections(url, url_pdf=None)
Scrape all CDC Vital Statistics sections.
Downloads and combines all the main CDC Vital Statistics sections
into a single DataFrame. Optionally scrapes the separate Mortality Multiple
Cause-of-Death documentation page and merges it in.
Args:
url (str): The CDC Vital Stats page URL.
Typically: https://www.cdc.gov/nchs/data_access/VitalStatsOnline.htm
url_pdf (str | None): Optional CDC Mortality documentation page URL.
If provided, the placeholder mortality user guide link is replaced
with the full set of scraped links from that page.
Typically: https://www.cdc.gov/nchs/nvss/mortality_public_use_data.htm
Returns:
pl.DataFrame: A DataFrame with columns:
- section (str): Section name
- subsection (str): Subsection name
- link_text (str): Text of the download link
- year (int): Four-digit year
- file_size (str): Raw file size string
- url (str): Absolute URL to the file
- file_type (str): File extension
- file_size_mb (float): File size converted to megabytes
Notes:
A known typo in the CDC source data causes one file size to appear as
"10.2.MB" instead of "10.2 MB". This is corrected automatically.
Example:
python
df = scrape_all_sections(
url = "https://www.cdc.gov/nchs/data_access/VitalStatsOnline.htm",
url_pdf = "https://www.cdc.gov/nchs/nvss/mortality_public_use_data.htm"
)
load_cdc_data()
Load the pre-scraped CDC Vital Statistics dataset.
Returns a Polars DataFrame containing all CDC Vital Statistics
download links and metadata, bundled with the package.
Returns:
pl.DataFrame: A DataFrame with columns:
- section (str): Section name
- subsection (str): Subsection name
- link_text (str): Text of the download link
- year (int): Four-digit year
- file_size (str): Raw file size string
- url (str): Absolute URL to the file
- file_type (str): File extension
- file_size_mb (float): File size converted to megabytes
Example:
python
from usdeathspy import load_cdc_data
df = load_cdc_data()
df.filter(pl.col("section") == "births")