Scrape file links from a CDC Vital Statistics section — scrape_cdc

Extracts downloadable file links from a CDC Vital Statistics page section identified by an anchor ID. The function navigates the HTML structure, collects links from listScroll elements, and returns a tidy tibble with metadata about each file.

scrape_cdc_section(page, anchor_id, section_name, subsection_names)

Arguments

page: An HTML document returned by rvest::read_html().
anchor_id: Character string giving the HTML anchor ID for the section.
section_name: Human-readable name of the section.
subsection_names: Character vector of subsection names. Must match the number of listScroll elements found in the section.

Value

A tibble with columns:

section: Section name
subsection: Subsection name
link_text: Text of the download link
year: Extracted year or leading label
file_size: File size string, if present
url: Absolute URL to the file
file_type: File extension

Details

The function assumes the CDC page structure uses .listScroll containers and that the anchor is nested three levels below the section root. Changes to page structure may require updating the DOM traversal.