data_collection package#

Submodules#

data_collection.CrossrefWrapper module#

class academic_metrics.data_collection.CrossrefWrapper.CrossrefWrapper(*, scraper, base_url='https://api.crossref.org/works', affiliation='Salisbury%20University', from_year=2017, to_year=2024, from_month=1, to_month=12, test_run=False, run_scraper=True)[source]#

Bases: object

A wrapper class for interacting with the Crossref API to fetch and process publication data.

base_url#

The base URL for the Crossref API.

Type:

str

affiliation#

The affiliation to filter publications by.

Type:

str

from_year#

The starting year for the publication search.

Type:

int

to_year#

The ending year for the publication search.

Type:

int

logger#

Logger for logging messages.

Type:

logging.Logger

MAX_CONCURRENT_REQUESTS#

Maximum number of concurrent requests allowed.

Type:

int

semaphore#

Semaphore to control the rate of concurrent requests.

Type:

asyncio.Semaphore

years#

List of years to fetch data for.

Type:

list

data#

Data fetched from the Crossref API.

Type:

dict

fetch_data(session

aiohttp.ClientSession, url: str, headers: dict[str, Any], retries: int, retry_delay: int) -> dict[str, Any] | None: Fetches data from the given URL using aiohttp.

build_request_url(base_url

str, affiliation: str, from_date: str, to_date: str, n_element: str, sort_type: str, sort_ord: str, cursor: str, has_abstract: bool | None = False) -> str: Builds the request URL for the Crossref API.

process_items(data

dict[str, Any], from_date: str, to_date: str, affiliation: str | None = “salisbury univ”) -> list[dict[str, Any]]: Processes the items fetched from the Crossref API, filtering by date and affiliation.

_get_last_day_of_month(year

int, month: int) -> int: Returns the last day of the given month in the given year.

fetch_data_for_multiple_years() list[dict[str, Any]][source]#

Fetches data for multiple years asynchronously.

serialize_to_json(output_file

str) -> None: Serializes the fetched data to a JSON file.

final_data_process() Self[source]#

Processes the final data, filling in missing abstracts.

get_result_list() list[dict[str, Any]][source]#

Get the result list

run_all_process(save_offline

bool = False) -> Union[None, List[Dict[str, Any]]]: Run all data fetching and processing

__init__(*, scraper, base_url='https://api.crossref.org/works', affiliation='Salisbury%20University', from_year=2017, to_year=2024, from_month=1, to_month=12, test_run=False, run_scraper=True)[source]#

Initializes the CrossrefWrapper with the given parameters.

Parameters:
  • base_url (str) – The base URL for the Crossref API.

  • affiliation (str) – The affiliation to filter publications by.

  • from_year (int) – The starting year for the publication search.

  • to_year (int) – The ending year for the publication search.

  • logger (logging.Logger, optional) – Logger for logging messages. Defaults to None.

async fetch_data(session, url, headers, retries, retry_delay)[source]#

Fetches data from the given URL using aiohttp.

Parameters:
  • session (aiohttp.ClientSession) – The aiohttp session to use for the request.

  • url (str) – The URL to fetch data from.

  • headers (dict) – Headers to include in the request.

  • retries (int) – Number of retries in case of failure.

  • retry_delay (int) – Delay between retries in seconds.

Returns:

The JSON data fetched from the URL, or None if an error occurs.

Return type:

dict

build_request_url(base_url, affiliation, from_date, to_date, n_element, sort_type, sort_ord, cursor, has_abstract=False)[source]#

Builds the request URL for the Crossref API.

Parameters:
  • base_url (str) – The base URL for the Crossref API.

  • affiliation (str) – The affiliation to filter publications by.

  • from_date (str) – The starting date for the publication search.

  • to_date (str) – The ending date for the publication search.

  • n_element (str) – Number of elements to fetch per request.

  • sort_type (str) – The type of sorting to apply.

  • sort_ord (str) – The order of sorting (asc or desc).

  • cursor (str) – The cursor for pagination.

  • has_abstract (bool, optional) – Whether to filter for publications with abstracts. Defaults to False.

Returns:

The constructed request URL.

Return type:

str

process_items(data, from_date, to_date, affiliation='salisbury univ')[source]#

Processes the items fetched from the Crossref API, filtering by date and affiliation.

Parameters:
  • data (dict) – The data fetched from the Crossref API.

  • from_date (str) – The starting date for the publication search.

  • to_date (str) – The ending date for the publication search.

  • affiliation (str, optional) – The affiliation to filter publications by. Defaults to “salisbury univ”.

Returns:

The filtered list of items.

Return type:

list

async acollect_yrange(session, from_date='2018-01-01', to_date='2024-10-09', n_element='1000', sort_type='relevance', sort_ord='desc', cursor='*', retries=5, retry_delay=3)[source]#

Collects data for a range of years asynchronously.

Parameters:
  • session (aiohttp.ClientSession) – The aiohttp session to use for the request.

  • from_date (str) – The starting date for the publication search.

  • to_date (str) – The ending date for the publication search.

  • n_element (str) – Number of elements to fetch per request.

  • sort_type (str) – The type of sorting to apply.

  • sort_ord (str) – The order of sorting (asc or desc).

  • cursor (str) – The cursor for pagination.

  • retries (int) – Number of retries in case of failure.

  • retry_delay (int) – Delay between retries in seconds.

Returns:

A tuple containing the list of items and the next cursor.

Return type:

tuple

_get_last_day_of_month(year, month)[source]#

Returns the last day of the given month in the given year. Handles leap years for February.

Parameters:
  • year (int) – The year to check.

  • month (int) – The month to check.

Returns:

The last day of the given month in the given year.

Return type:

int

async fetch_data_for_multiple_years()[source]#

Fetches data for multiple years asynchronously.

Returns:

The list of items fetched from the Crossref API.

Return type:

final_result (list[dict[str, Any]])

run_afetch_yrange()[source]#

Runs the asynchronous data fetch for multiple years.

Returns:

The instance of the class for method chaining.

Return type:

self

serialize_to_json(output_file)[source]#

Serializes the fetched data to a JSON file.

Parameters:

output_file (str) – The path to the output JSON file.

Return type:

None

final_data_process()[source]#

Processes the final data, filling in missing abstracts.

Returns:

The instance of the class for method chaining.

Return type:

self

get_result_list()[source]#

Get the result list

Returns:

The result list

Return type:

self.result (list[dict[str, Any]])

run_all_process(save_offline=False)[source]#

Run all data fetching and processing

Parameters:

save_offline (bool) – Whether to save the offline data.

Returns:

The result list or None

Return type:

Union[None, List[Dict[str, Any]]]

data_collection.scraper module#

class academic_metrics.data_collection.scraper.CleanerOutput(**data)[source]#

Bases: BaseModel

Pydantic model for the cleaner output.

page_content#

The page content.

Type:

str

extra_context#

The extra context.

Type:

Dict[str, Any]

page_content: str#
extra_context: Dict[str, Any]#
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'extra_context': FieldInfo(annotation=Dict[str, Any], required=True), 'page_content': FieldInfo(annotation=str, required=True)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

class academic_metrics.data_collection.scraper.Scraper(api_key)[source]#

Bases: object

Scraper class for fetching and processing abstracts from URLs.

api_key#

The OpenAI API key.

Type:

str

client#

The OpenAI client.

Type:

OpenAI

options#

The Selenium options.

Type:

Options

service#

The Selenium service.

Type:

Service

raw_results#

The raw results.

Type:

list[dict[str, Any]]

_setup_selenium_options()[source]#

Set up Selenium Firefox options.

setup_chain(output_list

list[str]) -> dict[str, Any] | None: Set up and run the chain.

get_abstract(url

str, return_raw_output: bool | None = False) -> tuple[str | None, dict[str, Any] | None]: Fetch and process the abstract from a given URL.

save_raw_results()[source]#

Save the raw results to a JSON file.

__init__(api_key)[source]#

Initialize the Scraper with API key and logger.

Parameters:

api_key (str) – The OpenAI API key.

_setup_selenium_options()[source]#

Set up Selenium Firefox options.

Returns:

The Selenium options.

Return type:

options (Options)

setup_chain(output_list)[source]#

Set up and run the chain.

Parameters:

output_list (List[str]) – The output list.

Returns:

The result of the chain.

Return type:

Dict[str, Any] | None

get_abstract(url, return_raw_output=False)[source]#

Fetches and processes the abstract from a given URL.

This function uses Selenium to fetch the content of the provided URL in headless mode. It then parses the HTML content using BeautifulSoup and attempts to find the abstract in common HTML tags such as <meta>, <p>, <div>, <article>, and <span>. The collected content is processed using a chain of prompts managed by the ChainManager.

Parameters:

url (str) – The URL of the web page to fetch the abstract from.

Returns:

str | None, extra_context: dict[str, Any] | None): - The processed abstract and additional context, - or None if no abstract is found or an error occurs.

Return type:

(abstract

save_raw_results()[source]#

Save the raw results to a JSON file.

Module contents#