data_collection package#
Submodules#
data_collection.CrossrefWrapper module#
- class academic_metrics.data_collection.CrossrefWrapper.CrossrefWrapper(*, scraper, base_url='https://api.crossref.org/works', affiliation='Salisbury%20University', from_year=2017, to_year=2024, from_month=1, to_month=12, test_run=False, run_scraper=True)[source]#
Bases:
objectA wrapper class for interacting with the Crossref API to fetch and process publication data.
- logger#
Logger for logging messages.
- Type:
- semaphore#
Semaphore to control the rate of concurrent requests.
- Type:
- fetch_data(session
aiohttp.ClientSession, url: str, headers: dict[str, Any], retries: int, retry_delay: int) -> dict[str, Any] | None: Fetches data from the given URL using aiohttp.
- build_request_url(base_url
str, affiliation: str, from_date: str, to_date: str, n_element: str, sort_type: str, sort_ord: str, cursor: str, has_abstract: bool | None = False) -> str: Builds the request URL for the Crossref API.
- process_items(data
dict[str, Any], from_date: str, to_date: str, affiliation: str | None = “salisbury univ”) -> list[dict[str, Any]]: Processes the items fetched from the Crossref API, filtering by date and affiliation.
- _get_last_day_of_month(year
int, month: int) -> int: Returns the last day of the given month in the given year.
- fetch_data_for_multiple_years() list[dict[str, Any]][source]#
Fetches data for multiple years asynchronously.
- serialize_to_json(output_file
str) -> None: Serializes the fetched data to a JSON file.
- run_all_process(save_offline
bool = False) -> Union[None, List[Dict[str, Any]]]: Run all data fetching and processing
- __init__(*, scraper, base_url='https://api.crossref.org/works', affiliation='Salisbury%20University', from_year=2017, to_year=2024, from_month=1, to_month=12, test_run=False, run_scraper=True)[source]#
Initializes the CrossrefWrapper with the given parameters.
- Parameters:
base_url (str) – The base URL for the Crossref API.
affiliation (str) – The affiliation to filter publications by.
from_year (int) – The starting year for the publication search.
to_year (int) – The ending year for the publication search.
logger (logging.Logger, optional) – Logger for logging messages. Defaults to None.
- async fetch_data(session, url, headers, retries, retry_delay)[source]#
Fetches data from the given URL using aiohttp.
- Parameters:
- Returns:
The JSON data fetched from the URL, or None if an error occurs.
- Return type:
- build_request_url(base_url, affiliation, from_date, to_date, n_element, sort_type, sort_ord, cursor, has_abstract=False)[source]#
Builds the request URL for the Crossref API.
- Parameters:
base_url (str) – The base URL for the Crossref API.
affiliation (str) – The affiliation to filter publications by.
from_date (str) – The starting date for the publication search.
to_date (str) – The ending date for the publication search.
n_element (str) – Number of elements to fetch per request.
sort_type (str) – The type of sorting to apply.
sort_ord (str) – The order of sorting (asc or desc).
cursor (str) – The cursor for pagination.
has_abstract (bool, optional) – Whether to filter for publications with abstracts. Defaults to False.
- Returns:
The constructed request URL.
- Return type:
- process_items(data, from_date, to_date, affiliation='salisbury univ')[source]#
Processes the items fetched from the Crossref API, filtering by date and affiliation.
- Parameters:
- Returns:
The filtered list of items.
- Return type:
- async acollect_yrange(session, from_date='2018-01-01', to_date='2024-10-09', n_element='1000', sort_type='relevance', sort_ord='desc', cursor='*', retries=5, retry_delay=3)[source]#
Collects data for a range of years asynchronously.
- Parameters:
session (aiohttp.ClientSession) – The aiohttp session to use for the request.
from_date (str) – The starting date for the publication search.
to_date (str) – The ending date for the publication search.
n_element (str) – Number of elements to fetch per request.
sort_type (str) – The type of sorting to apply.
sort_ord (str) – The order of sorting (asc or desc).
cursor (str) – The cursor for pagination.
retries (int) – Number of retries in case of failure.
retry_delay (int) – Delay between retries in seconds.
- Returns:
A tuple containing the list of items and the next cursor.
- Return type:
- _get_last_day_of_month(year, month)[source]#
Returns the last day of the given month in the given year. Handles leap years for February.
- run_afetch_yrange()[source]#
Runs the asynchronous data fetch for multiple years.
- Returns:
The instance of the class for method chaining.
- Return type:
self
- final_data_process()[source]#
Processes the final data, filling in missing abstracts.
- Returns:
The instance of the class for method chaining.
- Return type:
self
data_collection.scraper module#
- class academic_metrics.data_collection.scraper.CleanerOutput(**data)[source]#
Bases:
BaseModelPydantic model for the cleaner output.
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'extra_context': FieldInfo(annotation=Dict[str, Any], required=True), 'page_content': FieldInfo(annotation=str, required=True)}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- class academic_metrics.data_collection.scraper.Scraper(api_key)[source]#
Bases:
objectScraper class for fetching and processing abstracts from URLs.
- client#
The OpenAI client.
- Type:
OpenAI
- options#
The Selenium options.
- Type:
Options
- service#
The Selenium service.
- Type:
Service
- setup_chain(output_list
list[str]) -> dict[str, Any] | None: Set up and run the chain.
- get_abstract(url
str, return_raw_output: bool | None = False) -> tuple[str | None, dict[str, Any] | None]: Fetch and process the abstract from a given URL.
- __init__(api_key)[source]#
Initialize the Scraper with API key and logger.
- Parameters:
api_key (str) – The OpenAI API key.
- _setup_selenium_options()[source]#
Set up Selenium Firefox options.
- Returns:
The Selenium options.
- Return type:
options (Options)
- get_abstract(url, return_raw_output=False)[source]#
Fetches and processes the abstract from a given URL.
This function uses Selenium to fetch the content of the provided URL in headless mode. It then parses the HTML content using BeautifulSoup and attempts to find the abstract in common HTML tags such as <meta>, <p>, <div>, <article>, and <span>. The collected content is processed using a chain of prompts managed by the ChainManager.
- Parameters:
url (str) – The URL of the web page to fetch the abstract from.
- Returns:
str | None, extra_context: dict[str, Any] | None): - The processed abstract and additional context, - or None if no abstract is found or an error occurs.
- Return type:
(abstract