API documentation

class crec.record.Record(start_date: str | datetime = None, end_date: str | datetime = None, dates: List[str | datetime] = None, granule_ids: List[str] = None, read_directory: str = None, granule_class_filter: List[str] = None, parse: bool = True, write: bool | str = False, zipped: bool = True, batch_size: int = 3, batch_wait: int | bool = False, rate_limit_wait: int | bool = 30, retry_limit: bool | int = 5, api_key: str = None, print_logs: bool = True, write_logs: bool = False, write_path: str = None)[source]

A collection of Congressional Record data from GovInfo’s Congressional Record API.

Parameters:

start_date (Union[str, datetime.datetime] = None) – First date for which CREC data will be retrieved.
end_date (Union[str, datetime.datetime] = None) – Last date for which CREC data will be retrieved.
dates (List[Union[str, datetime.datetime]] = None) – A custom list of dates to be used instead of the range created by start_date and end_date.
granule_ids (List[str] = None) – A list of official granule identifiers to be used instead of start_date and end_date or dates.
read_directory (str = None) – A directory to read in XML and HTML files from. There should be one XML file and one HTML file per granule in this directory.
granule_class_filter (List[str] = None) –
If provided, only granules with a class listed in granule_class_filter will be retrieved. If granule_class_filter is None, all granules are included. The class options are:
- HOUSE
- SENATE
- EXTENSIONS
- DAILYDIGEST
parse (bool = True) – A boolean that indicates whether or not the text of granules should be parsed. See Granule.parse_htm() for more information.
write (Union[bool, str] = False) – If write is False, then granule text (htm files) and metadata (xml files) will not be written to disk. Otherwise, write should be a path where those files should be written to.
zipped (bool = True) – Determines if granules should be requested individually or in zips. Only applies to calls where dates are used; if you are requesting individual granule identifiers, granules are always requested individually.
batch_size (int = 3) – The number of request to asynchronously send at the same time. Too high of a number will result in frequent rate limit issues. When requesting zip files, this number should be low. When requesting files individually, this should be much higher. To avoid rate limiting, try around 200.
batch_wait (Union[int, bool] = False) – If batch_wait is an int, then after requesting granule data in each batch of size batch_size, the program will halt for batch_wait seconds. Otherwise, batch_size should be False, and the program will not pause after each batch. When requesting zip files, a batch_wait is not necessary. When requesting files individually, the batch_wait should be around 2-5 seconds.
rate_limit_wait (Union[int, bool] = 300) – If rate_limit_wait is an int, then exceeding the GovInfo rate limit will cause the program to halt for rate_limit_wait seconds. Otherwise, rate_limit_wait should be False, and exceeding the rate limit will throw an uncaught exception.
retry_limit (Union[bool, int] = 5) – If retry_limit is an int, then the program will attempt to request URLs up to retry_limit times before moving on. Otherwise, retry_limit should be False, and URLs will only be tried once.
api_key (str = None) – API key from GovInfo. Can be obtained by visiting https://www.govinfo.gov/api-signup
print_logs (bool) – A boolean that determines whether or not logs are printed to stdout.
write_logs (bool) – A boolean that determines whether or not logs are written to disk.
write_path (str = None) – A filename to write logs to. Must be provided if write_logs is True.

passages[source]

Stores the Passage objects associated with this retrieved granule.

Type:: PassageCollection

paragraphs[source]

Stores the Paragraph objects associated with the retrieved granules.

Type:: ParagraphCollection

property incomplete_days: Set[str][source]: A set of date strings that did not have all of their associated granule identifiers retrieved.

property incomplete_granules: Set[str][source]: A set of granule identifiers that did not have their data retrieved, parsed, or written to disk (depending on requested behavior).

property raw_text: List[str][source]: A list containing the text of each granule without elements like headers, page numbers, and times removed.

property clean_text: List[str][source]: A list containing the text of each granule with elements like headers, page numbers, and times removed.

async crec.granule.get_granule_ids(date: str, client: GovInfoClient, granule_class_filters: List[str], logger: Logger) → Tuple[bool, List[str]][source]

A function to retrieve the granule identifiers associated with a specific day. Takes as an input a date string, a GovInfoClient object, a list of granule_class_filters, and a Logger object.

This function reaches the /granules endpoint of the GovInfo API which returns these identifiers. If provided, only granules with a class listed in granule_class_filter will be retrieved. Otherwise, all granules are included.

class crec.granule.Granule(granule_id: str)[source]

Represents a single GovInfo granule and its associated metadata and text.

Parameters:: granule_id (str) – The granule’s identifier

attributes[source]

A dictionary of information describing the granule. Possible keys are:

granuleDate
granuleId
searchTitle
granuleClass
subGranuleClass
chamber

Type:: dict

xml_url[source]

A relative URL that reaches the /mods endpoint of the GovInfo API to request the granule’s metadata.

Type:: str

htm_url[source]

A relative URL that reaches the /htm endpoint of the GovInfo API to request the granule’s text.

Type:: str

raw_text[source]

The text of the granule without elements like headers, page numbers, and times removed.

Type:: str

clean_text[source]

The text of the granule with elements like headers, page numbers, and times removed.

Type:: str

speakers[source]

A mapping between speaker identifiers and Speaker objects for all of the speakers on the granule, including speakers referred to by title only (ie. The PRESIDENT pro tempore).

Type:: Dict[str, speaker]

valid_responses[source]

A boolean that indicates whether the metadata and text requests both properly resolved.

Type:: bool

parsed[source]

A boolean that indicates whether the text of the granule was successfully parsed.

Type:: bool

written[source]

A boolean that indicates whether the metadata and text of the granule were successfully written to disk.

Type:: bool

complete[source]

A boolean that indicates whether the desired behavior (parsing and writing) was achieved. If both parse and write are True, then parsed and written must be True for the granule to be complete; if parse is True and write is False, only parsed must be true for the granule to be complete; if parse is False and write is True, only written must be true for the granule to be complete. Finally, if both parse and write are False, the granule is automatically complete.

Type:: bool

parse_exception[source]

In the case of a parsing exception, that exception is assigned to this attribute.

Type:: Exception

write_exception[source]

In the case of a writing exception, that exception is assigned to this attribute.

Type:: Exception

passages[source]

Stores the Passage objects associated with this granule.

Type:: PassageCollection

paragraphs[source]

Stores the Paragraph objects associated with this granule.

Type:: ParagraphCollection

async async_get(client: GovInfoClient, parse: bool, write: bool | str) → None[source]: Takes as an input a GovInfoClient object, and booleans indicating whether the granule’s data should be parsed and/or written to disk. Requests the granule’s metadata and text, and proceeds from there.

parse_responses(xml_response: Response | Element, htm_response: Response | str) → None[source]: Takes both the metadata (xml) and text (htm) responses return from the GovInfoClient object. Tries to parse both of them. In the case of an error, saves the error to either the Granule.parse_exception attribute.

write_responses(write: str, xml_response: Response | Element, htm_response: Response | str) → None[source]: Takes a write path, and both the metadata (xml) and text (htm) responses return from the GovInfoClient object. Tries to write both of them to disk. In the case of an error, saves the error to either the Granule.write_exception attribute.

parse_xml(root: Element) → None[source]: Parses the xml response. First, it updates the Granule.attributes dictionary. Then, it finds all listed Congress Members who spoke during the course of the granule. For each one, it instantiates a Speaker object, and a unique speaker identifier, and adds those to the granule’s mapping of speakers.

parse_htm(raw_text) → None[source]: Parses the text response. Starts by removing common non-speech elements: the title, the footer, page numbers, and times. Then, it calls the Granule.find_titled_speakers() and Granule.find_passages() functions.

find_titled_speakers() → None[source]: Searches through the cleaned text to find instances of ‘titled speakers,’ or a speaker who is not listed in the xml and is only referred to by title. Examples of this type of speaker include The PRESIDING OFFICER and The CHIEF JUSTICE. For each titled speaker, it creates a new Speaker object, and a unique speaker identifier, and adds those to the granule’s mapping of speakers.

find_passages() → None[source]

Splits the cleaned text into Passage objects. This process goes as follows:

First, the function checks the length of the speaker attribute. If the length is zero, there are no known speakers. As such, the entire cleaned text is assigned to a single Passage with an unknown speaker.

If there is more than one speaker, the function continues. The idea here is to take advantage of the ways the Congressional Record introduces new speakers. For Members of Congress, the Congressional Record places their honorific (Mr., Ms., Dr., etc.) and their fully capitalized last name at the beginning of the paragraph where they begin speaking.

An example may look like this:

  Ms. TENNEY. Mr. Speaker, I rise today to recognize a new record in
Oswego County, New York. The town of Redfield now has the record for
the most snowfall in 48 hours. An astonishing 62 inches of snow fell in
this idyllic town along the Salmon River with only 550 people near Lake
Ontario.

Or this:

  Mr. McCONNELL. Mr. President, I ask unanimous consent that the Senate
proceed to legislative session for a period of morning business, with
Senators permitted to speak therein for up to 10 minutes each.

Notice that the last name is not always fully capitalized. There are some other inconsistencies as well. Some examples include:

Mr. RODNEY DAVIS of Illinois.
Ms. ROS-LEHTINEN.

Finally, titled speakers follow slightly different patterns. Examples include:

The SPEAKER pro tempore.
The CHIEF JUSTICE.
Mr. Manager RASKIN.

Thankfully, the metadata of each granule lists the “parsed name” of each speaking Member of Congress, or how their name will appear in the text when they begin speaking. These parsed names are assigned to the appropriate Speaker objects when Granule.parse_xml() is called.

Similarly, the parsed names of titled speakers are found and assigned when Granule.find_titled_speakers() is called.

These parsed names are used to construct a regex search string that can identify new speakers at the beginning of paragraphs. The function splits the cleaned text up at these new-speaker-matches, and assigns each piece of text to a Passage object attributed to the corresponding speaker.

class crec.text.Paragraph(granule_attributes: dict, paragraph_id: int, passage_id: int, speaker: Speaker, text: str)[source]

A class to represent a single paragraph of text from the Congressional Record.

Parameters:

granule_attributes (dict) – A list of attributes associated with the Granule object this paragraph is derived from. For a list of possible keys, see Granule.attributes.
paragraph_id (int) – An integer indicating the index of this paragraph within the Passage it comes from. Starts from 1.
passage_id (int) – An integer indicating the index of the Passage object this paragraph comes from within the Granule it comes from. Starts from 1.
speaker (Speaker) – The speaker (Member of Congress or titled speaker) that this paragraph belongs to.
text (str) – The text of the paragraph. To eliminate extra whitespace, the text is ultimately split into tokens and rejoined.

class crec.text.Passage(granule_attributes: dict, passage_id: int, speaker: Speaker, text: str = '')[source]

A class to represent a single passage of text from the Congressional Record.

Parameters:

granule_attributes (dict) – A list of attributes associated with the Granule object this passage is derived from. For a list of possible keys, see Granule.attributes.
passage_id (int) – An integer indicating the index of this paragraph within the Granule it comes from. Starts from 1.
speaker (Speaker) – The speaker (Member of Congress or titled speaker) that this paragraph belongs to.
text (str) – The text of the paragraph. To eliminate extra whitespace, the text is ultimately split into tokens and rejoined.

paragraph_collection[source]

A collection of Paragraph objects that belong to this passage.

Type:: ParagraphCollection

clean_text[source]

The concatenation of the clean text of all of the paragraphs associated with this paragraph, separated by newlines.

Type:: str

split_into_paragraphs(text)[source]: Splits the passage’s text into Paragraph objects.

class crec.text.ParagraphCollection[source]

A collection of Paragraph objects.

paragraphs[source]

A list of paragraph objects.

Type:: List[Paragraph]

merge(other: ParagraphCollection) → None[source]: Merges another ParagraphCollection with itself by concatenating the two ParagraphCollection.paragraphs lists together.

add(paragraph: Paragraph)[source]: Adds a single Paragraph object to ParagraphCollection.paragraphs.

to_list(include_unknown_speakers: bool = False, search: str = None) → List[Paragraph][source]

Returns a list of Paragraph objects that meet the desired criteria.

Parameters:

include_unknown_speakers (bool = False) – Occasionally, a Granule finds passages and paragraphs with no known speaker. This parameter controls whether such paragraphs should be kept or filtered out.
search (str = None) – If provided, only paragraphs whose text contain search (ignoring case) are included.

to_df(include_unknown_speakers: bool = False, granule_attributes: List[str] = ['granuleDate', 'granuleId'], speaker_attributes: List[str] = ['bioGuideId'], search: str = None) → DataFrame[source]

Construct and return a pd.DataFrame object from paragraphs that meet the desired criteria.

Parameters:

include_unknown_speakers (bool = False) – Occasionally, a Granule finds passages and paragraphs with no known speaker. This parameter controls whether such paragraphs should be kept or filtered out.
granule_attributes (List[str] = [granuleDate, granuleId]) – Each entry in this list will be an additional column in the final pd.DataFrame. For a full list of options, see Granule.attributes.
speaker_attributes (List[str] = [bioGuideId]) – Each entry in this list will be an additional column in the final pd.DataFrame. For a full list of options, see Speaker.attributes.
search (str = None) – If provided, only paragraphs whose text contain search (ignoring case) are included.

class crec.text.PassageCollection[source]

A collection of Paragraph objects.

passages[source]

A list of passage objects.

Type:: List[Passage]

paragraphs[source]

Stores the Paragraph objects associated with each passage.

Type:: ParagraphCollection

merge(other: PassageCollection) → None[source]: Merges another PassageCollection with itself by concatenating the two PassageCollection.passages lists together.

add(passage: Passage)[source]: Adds a single Passage object to PassageCollection.passages. Ensures that the Passage is non-empty.

to_list(include_unknown_speakers: bool = False, search: str = None) → List[Passage][source]

Returns a list of Passage objects that meet the desired criteria.

Parameters:

include_unknown_speakers (bool = False) – Occasionally, a Granule finds passages and paragraphs with no known speaker. This parameter controls whether such paragraphs should be kept or filtered out.
search (str = None) – If provided, only paragraphs whose text contain search (ignoring case) are included.

to_df(include_unknown_speakers: bool = False, granule_attributes: List[str] = ['granuleDate', 'granuleId'], speaker_attributes: List[str] = ['bioGuideId'], search: str = None) → DataFrame[source]

Construct and return a pd.DataFrame object from passages that meet the desired criteria.

Parameters:

include_unknown_speakers (bool = False) – Occasionally, a Granule finds passages and paragraphs with no known speaker. This parameter controls whether such paragraphs should be kept or filtered out.
granule_attributes (List[str] = [granuleDate, granuleId]) – Each entry in this list will be an additional column in the final pd.DataFrame. For a full list of options, see Granule.attributes.
speaker_attributes (List[str] = [bioGuideId]) – Each entry in this list will be an additional column in the final pd.DataFrame. For a full list of options, see Speaker.attributes.
search (str = None) – If provided, only paragraphs whose text contain search (ignoring case) are included.

class crec.speaker.Speaker(attributes: Dict[str, str] = None, names: Dict[str, str] = None, titled: bool = False)[source]

A class to represent a single speaker – either a Member of Congress or a titled speaker.

Parameters:

attributes (Dict[str, str] = None) –
A dictionary of attributes that describe the speaker. Possible keys are:
- authorityId
- bioGuideId
- chamber
- congress
- gpoId
- party
- role
- state
names (Dict[str, str] = None) –
A dictionary of names the speaker goes by. Possible keys are:
- parsed
- authority-fnf (First Last)
- authority-lnf (Last, First)
titled (bool = False) – A boolean indicating whether the speaker is a titled speaker (President pro tempore, Chief Justice, etc.), and not a Member of Congress.

classmethod from_title(title: str) → Speaker[source]

This classmethod instantiates a Speaker object from a title alone. The provided title is set to be both the parsed and first-last name of the speaker.

Parameters:: title (str) – The name of the speaker.

classmethod from_member(member: Element) → Speaker[source]

This classmethod instantiates a Speaker object to represent a Member of Congress. The Member’s attributes and names are assigned to the speaker’s attributes and names.

Parameters:: member (xml.etree.ElementTree.Element) – An xml element that represents a Member of Congress.

get_attribute(attribute: str) → str[source]: Returns the requested speaker attribute if it exists; otherwise, returns None.

class crec.downloader.AsyncLoopHandler[source]

Class to handle asynchronous requests. Useful especially in the case where a user is writing code inside an IPython environment.

Adapted from https://stackoverflow.com/a/66055205/17834461

run()[source]

Method representing the thread’s activity.

You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.

class crec.downloader.Downloader(granule_class_filter: List[str], parse: bool, write: bool | str, zipped: bool, batch_size: int, batch_wait: bool | int, rate_limit_wait: bool | int, retry_limit: bool | int, api_key: str, logger: Logger)[source]

Supervises and controls requests made through a GovInfoClient object.

Parameters:

granule_class_filter (List[str] = None) –
If provided, only granules with a class listed in granule_class_filter will be retrieved. If granule_class_filter is None, all granules are included. The class options are:
- HOUSE
- SENATE
- EXTENSIONS
- DAILYDIGEST
parse (bool = True) – A boolean that indicates whether or not the text of granules should be parsed.
write (Union[bool, str] = False) – If write is False, then granule text (htm files) and metadata (xml files) will not be written to disk. Otherwise, write should be a path where those files should be written to.
zipped (bool = True) – Determines if granules should be requested individually or in zips. Only applies to calls where dates are used; if you are requesting individual granule identifiers, granules are always requested individually.
batch_size (int = 3) – The number of request to asynchronously send at the same time. Too high of a number will result in frequent rate limit issues. When requesting zip files, this number should be low. When requesting files individually, this should be much higher. To avoid rate limiting, try around 200.
batch_wait (Union[int, bool] = False) – If batch_wait is an int, then after requesting granule data in each batch of size batch_size, the program will halt for batch_wait seconds. Otherwise, batch_size should be False, and the program will not pause after each batch. When requesting zip files, a batch_wait is not necessary. When requesting files individually, the batch_wait should be around 2-5 seconds.
rate_limit_wait (Union[int, bool] = 300) – If rate_limit_wait is an int, then exceeding the GovInfo rate limit will cause the program to halt for rate_limit_wait seconds. Otherwise, rate_limit_wait should be False, and exceeding the rate limit will throw an uncaught exception.
retry_limit (Union[bool, int] = 5) – If retry_limit is an int, then the program will attempt to request URLs up to retry_limit times before moving on. Otherwise, retry_limit should be False, and URLs will only be tried once.
api_key (str = None) – API key from GovInfo. Can be obtained by visiting https://www.govinfo.gov/api-signup
logger (Logger) – An object that handles outputting logs.

client[source]

A GovInfoClient object that is used to make requests.

Type:: GovInfoClient

incomplete_days[source]

A set of date strings that did not have all of their associated granule identifiers retrieved.

Type:: Set[str]

incomplete_granules[source]

A set of granule identifiers that did not have their data retrieved, parsed, or written to disk (depending on requested behavior).

Type:: Set[str]

async get_granules_in_batch(granules: List[Granule], client: GovInfoClient) → List[Granule][source]: Takes as an input a list of Granule objects and a GovInfoClient. Splits the granules into batches of size self.batch_size. Within each batch, a asyncio.Task is created which consists of getting the granule’s data, and potentially parsing and writing it (depending on self.parse and self.write). Should only be called internally.

async get_granules_from_ids(granule_ids: List[str], client: GovInfoClient) → List[Granule][source]: Takes as an input a list of granule identifiers and a GovInfoClient. Initializes a Granule object for each identifier. Hands the initialized granules over to Downloader.get_granules_in_batch(). Should only be called internally.

get_from_ids(granule_ids: List[str] = []) → List[Granule][source]: Takes as an input a list of granule identifiers and executes the Downloader.get_granules_from_ids() coroutine.

get_from_directory(directory: str) → List[Granule][source]: Takes as an input a path and a creates a Granule object for each set of XML and HTML files in that directory.

async get_granule_ids_from_dates(dates: List[str], client: GovInfoClient) → List[str][source]: Takes as an input a list of date strings and a GovInfoClient. Then, for each date, calls the get_granule_ids() function to get the granule identifiers associated with that given day. Also keeps track of whether all granule identifiers from a particular day were retrieved; if not, that day will be added to self.incomplete_days.

async get_granules_from_dates(dates: List[str]) → List[Granule][source]: Takes as an input a list of date strings. Gets the granule identifiers associated with those dates using Downloader.get_granule_ids_from_dates(), and passes those along to Downloader.get_granules_from_ids().

async get_zips_from_dates_in_batch(dates: List[str], client: GovInfoClient) → List[ZipFile][source]: Takes as an input a list of date strings. Returns the zipped files associated with those dates.

granules_from_zips(zips: List[ZipFile]) → List[Granule][source]: Takes as an input a set of zipped files. For each zipped file, generates a set of Granule objects corresponding to the files within those zips.

async get_granules_from_zips(dates: List[str]) → List[Granule][source]: Takes as an input a list of date strings. Returns a set of Granule objects from those days, but requests the granules in zip files as opposed to one at a time.

get_from_dates(dates: List[str] = []) → List[Granule][source]: Takes as an input a list of date strings and executes the Downloader.get_granules_from_zips() coroutine if self.zipped is True or the Downloader.get_granules_from_dates() coroutine if self.zipped is False.

exception crec.api.RateLimitError[source]: Thrown if the rate limit is exceeded and the Record object’s wait parameter is False.

exception crec.api.APIKeyError[source]: Thrown if an API key is not provided or is invalid.

class crec.api.GovInfoClient(rate_limit_wait: bool | int, retry_limit: bool | int, logger: Logger, api_key: str)[source]

Handles requesting data from the GovInfo API. Inherits from httpx.AsyncClient so that requests can be made asynchronously.

Parameters:

wait (Union[bool, int]) – If wait is an int, then exceeding the GovInfo rate limit will cause the program to wait for wait seconds. Otherwise, wait should be False, and exceeding the rate limit will throw an uncaught exception.
retry_limit (Union[bool, int]) – If retry_limit is an int, then the program will attempt to request URLs up to retry_limit times before moving on. Otherwise, retry_limit should be False, and URLs will only be tried once.
logger (Logger) – An object that handles outputting logs.
api_key (str = None) – API key from GovInfo. Can be obtained by visiting https://www.govinfo.gov/api-signup

async get(url: str, params: dict = {}, use_api: bool = True) → Tuple[bool, Response | None][source]: Extends httpx.AsyncClient.get(). Controls waiting and retrying URLs, and handles GovInfo-specific query parameters like the api_key. Returns a tuple consisting of a boolean indicating whether or not the request was successful and the response itself (the response could be None if the request fails self.retry_limit times).

class crec.logger.DuplicateFilter(rate_limit_wait: bool | int)[source]

A custom class that inherits from logging.Filter to filter out duplicate rate limit logs.

Parameters:: rate_limit_wait (Union[bool, int]) – Determines how long the filter should wait before outputting another rate limit error message.

filter(record)[source]: Determines whether or not to output the current record. First, this function checks to see whether the last two messages in a row were both rate limit exceptions. Then, it checks whether less than rate_limit_wait time has passed. If all conditions are True then the record is skipped. Otherwise, the record is outputted.

class crec.logger.Logger(rate_limit_wait: bool | int, print_logs: bool, write_logs: bool, write_path: str)[source]

A custom logger to handle the logging of status updates. Maintains a logging queue so that logs which are sent during an asynchronous event loop are non-blocking.

Parameters:

rate_limit_wait (Union[bool, int]) – Determines how long the logger should wait before outputting another rate limit error message.
print_logs (bool) – A boolean that determines whether or not logs are printed to stdout.
write_logs (bool) – A boolean that determines whether or not logs are written to disk.
write_path (str = None) – A filename to write logs to. Must be provided if write_logs is True.

log(message: str, level: str = 'info') → None[source]

Outputs a log.

Parameters:

message (str) – The message to be logged.
level (str) – The level for the message to be logged.