Skip to content

API

get_html_page(url)

Fetch and parse an HTML page.

This function downloads the HTML content from a given URL and parses it into a BeautifulSoup object, which can then be queried using CSS selectors or tag-based navigation.

Parameters:

Name Type Description Default
url str

The URL of the HTML page to retrieve.

required

Returns:

Name Type Description
BeautifulSoup BeautifulSoup

A parsed representation of the HTML document.

Example
soup = get_html_page(
    "https://www.cdc.gov/nchs/data_access/VitalStatsOnline.htm"
)
soup.select_one("a#births")

parse_file_size_mb(file_size, rounding=5)

Convert a file size expression to megabytes.

A vectorized Polars expression that parses file size strings containing KB, MB, or GB units and converts them to a numeric value in megabytes.

Parameters:

Name Type Description Default
file_size Expr

A Polars expression resolving to file size strings (e.g. "531 KB", "1.8 MB", "1 GB").

required
rounding int

Number of decimal places to round to. Default is 5.

5

Returns:

Type Description
Expr

pl.Expr: A Polars expression resolving to file sizes in megabytes.

Examples:

>>> df = pl.DataFrame({"file_size": ["531 KB", "1.8 MB", "1 GB", None]})
>>> df.with_columns(parse_file_size_mb(pl.col("file_size")).alias("mb"))

scrape_mult_mort_user_guide(url)

Scrape Mortality Multiple Cause-of-Death user guide links from the CDC.

Extracts downloadable file links from the CDC Mortality Multiple Cause-of-Death documentation page and returns a Polars DataFrame with metadata about each file.

Parameters:

Name Type Description Default
url str

URL of the CDC mortality documentation page. Typically: https://www.cdc.gov/nchs/nvss/mortality_public_use_data.htm

required

Returns:

Type Description
DataFrame

pl.DataFrame: A DataFrame with columns: - section (str): Always "mortality_multiple" - subsection (str): Always "User Guide" - link_text (str): Text of the download link - year (int): Four-digit year extracted from link text, filled down for sub-items - file_size (str): File size string, if present - url (str): Absolute URL to the file - file_type (str): File extension - file_size_mb (float): File size converted to megabytes

Notes

1997 and 1998 entries link to separate HTML pages containing many PDFs and are not scraped by this function as of 2/17/2026.

scrape_all_sections(url, url_pdf=None)

Scrape all CDC Vital Statistics sections.

Downloads and combines all the main CDC Vital Statistics sections
into a single DataFrame. Optionally scrapes the separate Mortality Multiple
Cause-of-Death documentation page and merges it in.

Args:
    url (str): The CDC Vital Stats page URL.
        Typically: https://www.cdc.gov/nchs/data_access/VitalStatsOnline.htm
    url_pdf (str | None): Optional CDC Mortality documentation page URL.
        If provided, the placeholder mortality user guide link is replaced
        with the full set of scraped links from that page.
        Typically: https://www.cdc.gov/nchs/nvss/mortality_public_use_data.htm

Returns:
    pl.DataFrame: A DataFrame with columns:
        - section (str): Section name
        - subsection (str): Subsection name
        - link_text (str): Text of the download link
        - year (int): Four-digit year
        - file_size (str): Raw file size string
        - url (str): Absolute URL to the file
        - file_type (str): File extension
        - file_size_mb (float): File size converted to megabytes

Notes:
    A known typo in the CDC source data causes one file size to appear as
    "10.2.MB" instead of "10.2 MB". This is corrected automatically.

Example:

python df = scrape_all_sections( url = "https://www.cdc.gov/nchs/data_access/VitalStatsOnline.htm", url_pdf = "https://www.cdc.gov/nchs/nvss/mortality_public_use_data.htm" )

load_cdc_data()

Load the pre-scraped CDC Vital Statistics dataset.

Returns a Polars DataFrame containing all CDC Vital Statistics
download links and metadata, bundled with the package.

Returns:
    pl.DataFrame: A DataFrame with columns:
        - section (str): Section name
        - subsection (str): Subsection name
        - link_text (str): Text of the download link
        - year (int): Four-digit year
        - file_size (str): Raw file size string
        - url (str): Absolute URL to the file
        - file_type (str): File extension
        - file_size_mb (float): File size converted to megabytes

Example:

python from usdeathspy import load_cdc_data df = load_cdc_data() df.filter(pl.col("section") == "births")