API documentation
- class crec.record.Record(start_date: str | datetime = None, end_date: str | datetime = None, dates: List[str | datetime] = None, granule_ids: List[str] = None, read_directory: str = None, granule_class_filter: List[str] = None, parse: bool = True, write: bool | str = False, zipped: bool = True, batch_size: int = 3, batch_wait: int | bool = False, rate_limit_wait: int | bool = 30, retry_limit: bool | int = 5, api_key: str = None, print_logs: bool = True, write_logs: bool = False, write_path: str = None)[source]
A collection of Congressional Record data from GovInfo’s Congressional Record API.
- Parameters:
start_date (Union[str, datetime.datetime] = None) – First date for which CREC data will be retrieved.
end_date (Union[str, datetime.datetime] = None) – Last date for which CREC data will be retrieved.
dates (List[Union[str, datetime.datetime]] = None) – A custom list of dates to be used instead of the range created by
start_dateandend_date.granule_ids (List[str] = None) – A list of official granule identifiers to be used instead of
start_dateandend_dateordates.read_directory (str = None) – A directory to read in XML and HTML files from. There should be one XML file and one HTML file per granule in this directory.
granule_class_filter (List[str] = None) –
If provided, only granules with a class listed in
granule_class_filterwill be retrieved. Ifgranule_class_filterisNone, all granules are included. The class options are:HOUSESENATEEXTENSIONSDAILYDIGEST
parse (bool = True) – A boolean that indicates whether or not the text of granules should be parsed. See
Granule.parse_htm()for more information.write (Union[bool, str] = False) – If
writeisFalse, then granule text (htm files) and metadata (xml files) will not be written to disk. Otherwise,writeshould be a path where those files should be written to.zipped (bool = True) – Determines if granules should be requested individually or in zips. Only applies to calls where dates are used; if you are requesting individual granule identifiers, granules are always requested individually.
batch_size (int = 3) – The number of request to asynchronously send at the same time. Too high of a number will result in frequent rate limit issues. When requesting zip files, this number should be low. When requesting files individually, this should be much higher. To avoid rate limiting, try around 200.
batch_wait (Union[int, bool] = False) – If
batch_waitis anint, then after requesting granule data in each batch of sizebatch_size, the program will halt forbatch_waitseconds. Otherwise,batch_sizeshould beFalse, and the program will not pause after each batch. When requesting zip files, abatch_waitis not necessary. When requesting files individually, thebatch_waitshould be around 2-5 seconds.rate_limit_wait (Union[int, bool] = 300) – If
rate_limit_waitis anint, then exceeding the GovInfo rate limit will cause the program to halt forrate_limit_waitseconds. Otherwise,rate_limit_waitshould beFalse, and exceeding the rate limit will throw an uncaught exception.retry_limit (Union[bool, int] = 5) – If
retry_limitis anint, then the program will attempt to request URLs up toretry_limittimes before moving on. Otherwise,retry_limitshould beFalse, and URLs will only be tried once.api_key (str = None) – API key from GovInfo. Can be obtained by visiting https://www.govinfo.gov/api-signup
print_logs (bool) – A boolean that determines whether or not logs are printed to stdout.
write_logs (bool) – A boolean that determines whether or not logs are written to disk.
write_path (str = None) – A filename to write logs to. Must be provided if
write_logsisTrue.
- property incomplete_days: Set[str][source]
A set of date strings that did not have all of their associated granule identifiers retrieved.
- property incomplete_granules: Set[str][source]
A set of granule identifiers that did not have their data retrieved, parsed, or written to disk (depending on requested behavior).
- async crec.granule.get_granule_ids(date: str, client: GovInfoClient, granule_class_filters: List[str], logger: Logger) Tuple[bool, List[str]][source]
A function to retrieve the granule identifiers associated with a specific day. Takes as an input a date string, a
GovInfoClientobject, a list of granule_class_filters, and aLoggerobject.This function reaches the
/granulesendpoint of the GovInfo API which returns these identifiers. If provided, only granules with a class listed ingranule_class_filterwill be retrieved. Otherwise, all granules are included.
- class crec.granule.Granule(granule_id: str)[source]
Represents a single GovInfo granule and its associated metadata and text.
- Parameters:
granule_id (str) – The granule’s identifier
- attributes[source]
A dictionary of information describing the granule. Possible keys are:
granuleDategranuleIdsearchTitlegranuleClasssubGranuleClasschamber
- Type:
dict
- xml_url[source]
A relative URL that reaches the
/modsendpoint of the GovInfo API to request the granule’s metadata.- Type:
str
- htm_url[source]
A relative URL that reaches the
/htmendpoint of the GovInfo API to request the granule’s text.- Type:
str
- raw_text[source]
The text of the granule without elements like headers, page numbers, and times removed.
- Type:
str
- clean_text[source]
The text of the granule with elements like headers, page numbers, and times removed.
- Type:
str
- speakers[source]
A mapping between speaker identifiers and
Speakerobjects for all of the speakers on the granule, including speakers referred to by title only (ie. The PRESIDENT pro tempore).- Type:
Dict[str, speaker]
- valid_responses[source]
A boolean that indicates whether the metadata and text requests both properly resolved.
- Type:
bool
- parsed[source]
A boolean that indicates whether the text of the granule was successfully parsed.
- Type:
bool
- written[source]
A boolean that indicates whether the metadata and text of the granule were successfully written to disk.
- Type:
bool
- complete[source]
A boolean that indicates whether the desired behavior (parsing and writing) was achieved. If both
parseandwriteareTrue, thenparsedandwrittenmust beTruefor the granule to becomplete; ifparseisTrueandwriteisFalse, onlyparsedmust be true for the granule to becomplete; ifparseisFalseandwriteisTrue, onlywrittenmust be true for the granule to becomplete. Finally, if bothparseandwriteareFalse, the granule is automaticallycomplete.- Type:
bool
- parse_exception[source]
In the case of a parsing exception, that exception is assigned to this attribute.
- Type:
Exception
- write_exception[source]
In the case of a writing exception, that exception is assigned to this attribute.
- Type:
Exception
- async async_get(client: GovInfoClient, parse: bool, write: bool | str) None[source]
Takes as an input a
GovInfoClientobject, and booleans indicating whether the granule’s data should be parsed and/or written to disk. Requests the granule’s metadata and text, and proceeds from there.
- parse_responses(xml_response: Response | Element, htm_response: Response | str) None[source]
Takes both the metadata (xml) and text (htm) responses return from the
GovInfoClientobject. Tries to parse both of them. In the case of an error, saves the error to either theGranule.parse_exceptionattribute.
- write_responses(write: str, xml_response: Response | Element, htm_response: Response | str) None[source]
Takes a
writepath, and both the metadata (xml) and text (htm) responses return from theGovInfoClientobject. Tries to write both of them to disk. In the case of an error, saves the error to either theGranule.write_exceptionattribute.
- parse_xml(root: Element) None[source]
Parses the xml response. First, it updates the
Granule.attributesdictionary. Then, it finds all listed Congress Members who spoke during the course of the granule. For each one, it instantiates aSpeakerobject, and a unique speaker identifier, and adds those to the granule’s mapping of speakers.
- parse_htm(raw_text) None[source]
Parses the text response. Starts by removing common non-speech elements: the title, the footer, page numbers, and times. Then, it calls the
Granule.find_titled_speakers()andGranule.find_passages()functions.
- find_titled_speakers() None[source]
Searches through the cleaned text to find instances of ‘titled speakers,’ or a speaker who is not listed in the xml and is only referred to by title. Examples of this type of speaker include The PRESIDING OFFICER and The CHIEF JUSTICE. For each titled speaker, it creates a new
Speakerobject, and a unique speaker identifier, and adds those to the granule’s mapping of speakers.
- find_passages() None[source]
Splits the cleaned text into
Passageobjects. This process goes as follows:First, the function checks the length of the
speakerattribute. If the length is zero, there are no known speakers. As such, the entire cleaned text is assigned to a singlePassagewith an unknown speaker.If there is more than one speaker, the function continues. The idea here is to take advantage of the ways the Congressional Record introduces new speakers. For Members of Congress, the Congressional Record places their honorific (Mr., Ms., Dr., etc.) and their fully capitalized last name at the beginning of the paragraph where they begin speaking.
An example may look like this:
Ms. TENNEY. Mr. Speaker, I rise today to recognize a new record in Oswego County, New York. The town of Redfield now has the record for the most snowfall in 48 hours. An astonishing 62 inches of snow fell in this idyllic town along the Salmon River with only 550 people near Lake Ontario.
Or this:
Mr. McCONNELL. Mr. President, I ask unanimous consent that the Senate proceed to legislative session for a period of morning business, with Senators permitted to speak therein for up to 10 minutes each.
Notice that the last name is not always fully capitalized. There are some other inconsistencies as well. Some examples include:
Mr. RODNEY DAVIS of Illinois.
Ms. ROS-LEHTINEN.
Finally, titled speakers follow slightly different patterns. Examples include:
The SPEAKER pro tempore.
The CHIEF JUSTICE.
Mr. Manager RASKIN.
Thankfully, the metadata of each granule lists the “parsed name” of each speaking Member of Congress, or how their name will appear in the text when they begin speaking. These parsed names are assigned to the appropriate
Speakerobjects whenGranule.parse_xml()is called.Similarly, the parsed names of titled speakers are found and assigned when
Granule.find_titled_speakers()is called.These parsed names are used to construct a regex search string that can identify new speakers at the beginning of paragraphs. The function splits the cleaned text up at these new-speaker-matches, and assigns each piece of text to a
Passageobject attributed to the corresponding speaker.
- class crec.text.Paragraph(granule_attributes: dict, paragraph_id: int, passage_id: int, speaker: Speaker, text: str)[source]
A class to represent a single paragraph of text from the Congressional Record.
- Parameters:
granule_attributes (dict) – A list of attributes associated with the
Granuleobject this paragraph is derived from. For a list of possible keys, seeGranule.attributes.paragraph_id (int) – An integer indicating the index of this paragraph within the
Passageit comes from. Starts from 1.passage_id (int) – An integer indicating the index of the
Passageobject this paragraph comes from within theGranuleit comes from. Starts from 1.speaker (
Speaker) – The speaker (Member of Congress or titled speaker) that this paragraph belongs to.text (str) – The text of the paragraph. To eliminate extra whitespace, the text is ultimately split into tokens and rejoined.
- class crec.text.Passage(granule_attributes: dict, passage_id: int, speaker: Speaker, text: str = '')[source]
A class to represent a single passage of text from the Congressional Record.
- Parameters:
granule_attributes (dict) – A list of attributes associated with the
Granuleobject this passage is derived from. For a list of possible keys, seeGranule.attributes.passage_id (int) – An integer indicating the index of this paragraph within the
Granuleit comes from. Starts from 1.speaker (
Speaker) – The speaker (Member of Congress or titled speaker) that this paragraph belongs to.text (str) – The text of the paragraph. To eliminate extra whitespace, the text is ultimately split into tokens and rejoined.
- class crec.text.ParagraphCollection[source]
A collection of
Paragraphobjects.- merge(other: ParagraphCollection) None[source]
Merges another
ParagraphCollectionwith itself by concatenating the twoParagraphCollection.paragraphslists together.
- add(paragraph: Paragraph)[source]
Adds a single
Paragraphobject toParagraphCollection.paragraphs.
- to_list(include_unknown_speakers: bool = False, search: str = None) List[Paragraph][source]
Returns a list of
Paragraphobjects that meet the desired criteria.- Parameters:
include_unknown_speakers (bool = False) – Occasionally, a
Granulefinds passages and paragraphs with no known speaker. This parameter controls whether such paragraphs should be kept or filtered out.search (str = None) – If provided, only paragraphs whose text contain
search(ignoring case) are included.
- to_df(include_unknown_speakers: bool = False, granule_attributes: List[str] = ['granuleDate', 'granuleId'], speaker_attributes: List[str] = ['bioGuideId'], search: str = None) DataFrame[source]
Construct and return a
pd.DataFrameobject from paragraphs that meet the desired criteria.- Parameters:
include_unknown_speakers (bool = False) – Occasionally, a
Granulefinds passages and paragraphs with no known speaker. This parameter controls whether such paragraphs should be kept or filtered out.granule_attributes (List[str] = [granuleDate, granuleId]) – Each entry in this list will be an additional column in the final
pd.DataFrame. For a full list of options, seeGranule.attributes.speaker_attributes (List[str] = [bioGuideId]) – Each entry in this list will be an additional column in the final
pd.DataFrame. For a full list of options, seeSpeaker.attributes.search (str = None) – If provided, only paragraphs whose text contain
search(ignoring case) are included.
- class crec.text.PassageCollection[source]
A collection of
Paragraphobjects.- merge(other: PassageCollection) None[source]
Merges another
PassageCollectionwith itself by concatenating the twoPassageCollection.passageslists together.
- add(passage: Passage)[source]
Adds a single
Passageobject toPassageCollection.passages. Ensures that thePassageis non-empty.
- to_list(include_unknown_speakers: bool = False, search: str = None) List[Passage][source]
Returns a list of
Passageobjects that meet the desired criteria.- Parameters:
include_unknown_speakers (bool = False) – Occasionally, a
Granulefinds passages and paragraphs with no known speaker. This parameter controls whether such paragraphs should be kept or filtered out.search (str = None) – If provided, only paragraphs whose text contain
search(ignoring case) are included.
- to_df(include_unknown_speakers: bool = False, granule_attributes: List[str] = ['granuleDate', 'granuleId'], speaker_attributes: List[str] = ['bioGuideId'], search: str = None) DataFrame[source]
Construct and return a
pd.DataFrameobject from passages that meet the desired criteria.- Parameters:
include_unknown_speakers (bool = False) – Occasionally, a
Granulefinds passages and paragraphs with no known speaker. This parameter controls whether such paragraphs should be kept or filtered out.granule_attributes (List[str] = [granuleDate, granuleId]) – Each entry in this list will be an additional column in the final
pd.DataFrame. For a full list of options, seeGranule.attributes.speaker_attributes (List[str] = [bioGuideId]) – Each entry in this list will be an additional column in the final
pd.DataFrame. For a full list of options, seeSpeaker.attributes.search (str = None) – If provided, only paragraphs whose text contain
search(ignoring case) are included.
- class crec.speaker.Speaker(attributes: Dict[str, str] = None, names: Dict[str, str] = None, titled: bool = False)[source]
A class to represent a single speaker – either a Member of Congress or a titled speaker.
- Parameters:
attributes (Dict[str, str] = None) –
A dictionary of attributes that describe the speaker. Possible keys are:
authorityIdbioGuideIdchambercongressgpoIdpartyrolestate
names (Dict[str, str] = None) –
A dictionary of names the speaker goes by. Possible keys are:
parsedauthority-fnf(First Last)authority-lnf(Last, First)
titled (bool = False) – A boolean indicating whether the speaker is a titled speaker (President pro tempore, Chief Justice, etc.), and not a Member of Congress.
- classmethod from_title(title: str) Speaker[source]
This classmethod instantiates a
Speakerobject from a title alone. The provided title is set to be both the parsed and first-last name of the speaker.- Parameters:
title (str) – The name of the speaker.
- classmethod from_member(member: Element) Speaker[source]
This classmethod instantiates a
Speakerobject to represent a Member of Congress. The Member’s attributes and names are assigned to the speaker’s attributes and names.- Parameters:
member (xml.etree.ElementTree.Element) – An xml element that represents a Member of Congress.
- class crec.downloader.AsyncLoopHandler[source]
Class to handle asynchronous requests. Useful especially in the case where a user is writing code inside an IPython environment.
Adapted from https://stackoverflow.com/a/66055205/17834461
- run()[source]
Method representing the thread’s activity.
You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.
- class crec.downloader.Downloader(granule_class_filter: List[str], parse: bool, write: bool | str, zipped: bool, batch_size: int, batch_wait: bool | int, rate_limit_wait: bool | int, retry_limit: bool | int, api_key: str, logger: Logger)[source]
Supervises and controls requests made through a
GovInfoClientobject.- Parameters:
granule_class_filter (List[str] = None) –
If provided, only granules with a class listed in
granule_class_filterwill be retrieved. Ifgranule_class_filterisNone, all granules are included. The class options are:HOUSESENATEEXTENSIONSDAILYDIGEST
parse (bool = True) – A boolean that indicates whether or not the text of granules should be parsed.
write (Union[bool, str] = False) – If
writeisFalse, then granule text (htm files) and metadata (xml files) will not be written to disk. Otherwise,writeshould be a path where those files should be written to.zipped (bool = True) – Determines if granules should be requested individually or in zips. Only applies to calls where dates are used; if you are requesting individual granule identifiers, granules are always requested individually.
batch_size (int = 3) – The number of request to asynchronously send at the same time. Too high of a number will result in frequent rate limit issues. When requesting zip files, this number should be low. When requesting files individually, this should be much higher. To avoid rate limiting, try around 200.
batch_wait (Union[int, bool] = False) – If
batch_waitis anint, then after requesting granule data in each batch of sizebatch_size, the program will halt forbatch_waitseconds. Otherwise,batch_sizeshould beFalse, and the program will not pause after each batch. When requesting zip files, abatch_waitis not necessary. When requesting files individually, thebatch_waitshould be around 2-5 seconds.rate_limit_wait (Union[int, bool] = 300) – If
rate_limit_waitis anint, then exceeding the GovInfo rate limit will cause the program to halt forrate_limit_waitseconds. Otherwise,rate_limit_waitshould beFalse, and exceeding the rate limit will throw an uncaught exception.retry_limit (Union[bool, int] = 5) – If
retry_limitis anint, then the program will attempt to request URLs up toretry_limittimes before moving on. Otherwise,retry_limitshould beFalse, and URLs will only be tried once.api_key (str = None) – API key from GovInfo. Can be obtained by visiting https://www.govinfo.gov/api-signup
logger (
Logger) – An object that handles outputting logs.
- client[source]
A
GovInfoClientobject that is used to make requests.- Type:
- incomplete_days[source]
A set of date strings that did not have all of their associated granule identifiers retrieved.
- Type:
Set[str]
- incomplete_granules[source]
A set of granule identifiers that did not have their data retrieved, parsed, or written to disk (depending on requested behavior).
- Type:
Set[str]
- async get_granules_in_batch(granules: List[Granule], client: GovInfoClient) List[Granule][source]
Takes as an input a list of
Granuleobjects and aGovInfoClient. Splits the granules into batches of sizeself.batch_size. Within each batch, aasyncio.Taskis created which consists of getting the granule’s data, and potentially parsing and writing it (depending onself.parseandself.write). Should only be called internally.
- async get_granules_from_ids(granule_ids: List[str], client: GovInfoClient) List[Granule][source]
Takes as an input a list of granule identifiers and a
GovInfoClient. Initializes aGranuleobject for each identifier. Hands the initialized granules over toDownloader.get_granules_in_batch(). Should only be called internally.
- get_from_ids(granule_ids: List[str] = []) List[Granule][source]
Takes as an input a list of granule identifiers and executes the
Downloader.get_granules_from_ids()coroutine.
- get_from_directory(directory: str) List[Granule][source]
Takes as an input a path and a creates a
Granuleobject for each set of XML and HTML files in that directory.
- async get_granule_ids_from_dates(dates: List[str], client: GovInfoClient) List[str][source]
Takes as an input a list of date strings and a
GovInfoClient. Then, for each date, calls theget_granule_ids()function to get the granule identifiers associated with that given day. Also keeps track of whether all granule identifiers from a particular day were retrieved; if not, that day will be added toself.incomplete_days.
- async get_granules_from_dates(dates: List[str]) List[Granule][source]
Takes as an input a list of date strings. Gets the granule identifiers associated with those dates using
Downloader.get_granule_ids_from_dates(), and passes those along toDownloader.get_granules_from_ids().
- async get_zips_from_dates_in_batch(dates: List[str], client: GovInfoClient) List[ZipFile][source]
Takes as an input a list of date strings. Returns the zipped files associated with those dates.
- granules_from_zips(zips: List[ZipFile]) List[Granule][source]
Takes as an input a set of zipped files. For each zipped file, generates a set of
Granuleobjects corresponding to the files within those zips.
- async get_granules_from_zips(dates: List[str]) List[Granule][source]
Takes as an input a list of date strings. Returns a set of
Granuleobjects from those days, but requests the granules in zip files as opposed to one at a time.
- get_from_dates(dates: List[str] = []) List[Granule][source]
Takes as an input a list of date strings and executes the
Downloader.get_granules_from_zips()coroutine ifself.zippedisTrueor theDownloader.get_granules_from_dates()coroutine ifself.zippedisFalse.
- exception crec.api.RateLimitError[source]
Thrown if the rate limit is exceeded and the Record object’s
waitparameter isFalse.
- class crec.api.GovInfoClient(rate_limit_wait: bool | int, retry_limit: bool | int, logger: Logger, api_key: str)[source]
Handles requesting data from the GovInfo API. Inherits from
httpx.AsyncClientso that requests can be made asynchronously.- Parameters:
wait (Union[bool, int]) – If
waitis anint, then exceeding the GovInfo rate limit will cause the program to wait forwaitseconds. Otherwise,waitshould beFalse, and exceeding the rate limit will throw an uncaught exception.retry_limit (Union[bool, int]) – If
retry_limitis anint, then the program will attempt to request URLs up toretry_limittimes before moving on. Otherwise,retry_limitshould beFalse, and URLs will only be tried once.logger (
Logger) – An object that handles outputting logs.api_key (str = None) – API key from GovInfo. Can be obtained by visiting https://www.govinfo.gov/api-signup
- async get(url: str, params: dict = {}, use_api: bool = True) Tuple[bool, Response | None][source]
Extends
httpx.AsyncClient.get(). Controls waiting and retrying URLs, and handles GovInfo-specific query parameters like theapi_key. Returns a tuple consisting of a boolean indicating whether or not the request was successful and the response itself (the response could beNoneif the request failsself.retry_limittimes).
- class crec.logger.DuplicateFilter(rate_limit_wait: bool | int)[source]
A custom class that inherits from
logging.Filterto filter out duplicate rate limit logs.- Parameters:
rate_limit_wait (Union[bool, int]) – Determines how long the filter should wait before outputting another rate limit error message.
- filter(record)[source]
Determines whether or not to output the current record. First, this function checks to see whether the last two messages in a row were both rate limit exceptions. Then, it checks whether less than
rate_limit_waittime has passed. If all conditions areTruethen the record is skipped. Otherwise, the record is outputted.
- class crec.logger.Logger(rate_limit_wait: bool | int, print_logs: bool, write_logs: bool, write_path: str)[source]
A custom logger to handle the logging of status updates. Maintains a logging queue so that logs which are sent during an asynchronous event loop are non-blocking.
- Parameters:
rate_limit_wait (Union[bool, int]) – Determines how long the logger should wait before outputting another rate limit error message.
print_logs (bool) – A boolean that determines whether or not logs are printed to stdout.
write_logs (bool) – A boolean that determines whether or not logs are written to disk.
write_path (str = None) – A filename to write logs to. Must be provided if
write_logsisTrue.