core package#

Submodules#

core.category_processor module#

class academic_metrics.core.category_processor.CategoryProcessor(utils, dataclass_factory, warning_manager, taxonomy_util, log_to_console=True)[source]#

Bases: object

Processes and organizes academic publication data by categories.

This class handles the processing of classified publication data, organizing it into categories and generating various statistics. It manages faculty affiliations, article details, and category relationships.

Parameters:

None

utils#

Utility functions for data processing.

Type:

Utilities

warning_manager#

System for handling and logging warnings.

Type:

WarningManager

dataclass_factory#

Factory for creating data model instances.

Type:

DataClassFactory

taxonomy_util#

Utility for managing publication taxonomy.

Type:

Taxonomy

category_data#

Mapping of categories to their information.

Type:

Dict[str, CategoryInfo]

faculty_stats#

Faculty statistics by category.

Type:

Dict[str, FacultyStats]

global_faculty_stats#

Global faculty statistics.

Type:

Dict[str, GlobalFacultyStats]

category_article_stats#

Article statistics by category.

Type:

Dict[str, CrossrefArticleStats]

articles#

List of processed article details.

Type:

List[CrossrefArticleDetails]

logger#

Logger instance for this class.

Type:

logging.Logger

log_file_path#

Path to the log file.

Type:

str

process_data_list()[source]#

Process a list of publication data items

get_category_data()[source]#

Get processed category data

get_category_article_stats()[source]#

Get article statistics by category

get_articles()[source]#

Get list of processed articles

get_faculty_stats()[source]#

Get faculty statistics by category

get_global_faculty_stats()[source]#

Get global faculty statistics

call_get_attributes()[source]#

Extract attributes from raw data

update_category_stats()[source]#

Update statistics for a category

update_faculty_stats()[source]#

Update faculty statistics

update_global_faculty_stats()[source]#

Update global faculty statistics

update_category_article_stats()[source]#

Update article statistics by category

create_article_object()[source]#

Create a new article object

clean_faculty_affiliations()[source]#

Clean faculty affiliation data

clean_faculty_members()[source]#

Clean faculty member data

initialize_categories()[source]#

Initialize category data structures

_collect_all_affiliations()[source]#

Collect all faculty affiliations

_generate_url()[source]#

Generate URL from string

_generate_normal_id()[source]#

Generate normalized ID from strings

__init__(utils, dataclass_factory, warning_manager, taxonomy_util, log_to_console=True)[source]#

Initialize the CategoryProcessor with required dependencies.

Sets up logging configuration and initializes all required components for processing publication data, including utilities, factories, and data structures for storing category, faculty, and article information.

Parameters:
  • utils (Utilities) – Utility functions for data processing. Type: academic_metrics.core.utilities.Utilities

  • dataclass_factory (DataClassFactory) – Factory for creating data model instances. Type: academic_metrics.core.data_class_factory.DataClassFactory

  • warning_manager (WarningManager) – System for handling and logging warnings. Type: academic_metrics.core.warning_manager.WarningManager

  • taxonomy_util (Taxonomy) – Utility for managing publication taxonomy. Type: academic_metrics.core.taxonomy.Taxonomy

  • log_to_console (bool | None) – Whether to log output to console. Type: bool | None Defaults to LOG_TO_CONSOLE.

Raises:
  • ValueError – If required dependencies are not properly initialized

  • IOError – If log file cannot be created or accessed

Notes

Initializes the following data structures: - category_data: Dictionary mapping categories to their information - faculty_stats: Dictionary tracking faculty statistics by category - global_faculty_stats: Dictionary tracking global faculty statistics - category_article_stats: Dictionary tracking article stats per category - articles: List of CrossrefArticleDetails objects for ground truth data

process_data_list(data)[source]#

Process a list of publication data items.

Takes raw publication data and processes each item through several stages: 1. Extracts base attributes 2. Initializes category information 3. Generates URL maps for categories 4. Cleans faculty and affiliation data 5. Updates various statistics (category, faculty, article) 6. Creates article objects

Parameters:

data (List[Dict]) – List of raw publication data dictionaries to process. Type: List[Dict[str, Any]]

Raises:
  • ValueError – If required attributes are missing from data

  • Exception – If category information cannot be initialized

Return type:

None

Notes

  • Processes each publication through all stages sequentially

  • Updates multiple data structures during processing

  • Maintains relationships between categories, faculty, and articles

  • Performs data cleaning and normalization

_test_category_processor(raw_attributes)[source]#

Test method for validating category processing functionality.

This private method is used for testing the category processor’s ability to handle raw attribute data and properly process it through the category system.

Parameters:

raw_attributes (Dict[str, Any]) – Dictionary of raw attributes to test processing. Type: Dict[str, Any]

Return type:

None

Notes

  • Used for internal testing purposes only

  • Validates category processing pipeline

  • Does not modify production data

  • Helps ensure data integrity

call_get_attributes(*, data)[source]#

Extract and process attributes from raw publication data.

Extracts various attributes including categories, authors, departments, titles, citations, abstracts, licenses, publication dates, journal info, URLs, DOIs, and themes from the raw data.

Parameters:

data (Dict[str, Any]) – Raw publication data dictionary. Type: Dict[str, Any]

Returns:

Dictionary containing extracted and processed attributes.

Type: Dict[str, Any] Contains: - categories (List[str]): List of publication categories - faculty_members (List[str]): List of faculty authors - faculty_affiliations (Dict[str, str]): Faculty to department mapping - title (str): Publication title - tc_count (int): Citation count - abstract (str): Publication abstract - license_url (str): License URL - date_published_print (str): Print publication date - date_published_online (str): Online publication date - journal (str): Journal name - download_url (str): Download URL - doi (str): Digital Object Identifier - themes (List[str]): List of publication themes

Return type:

Dict[str, Any]

Raises:

Exception – If no category is found in the data

Notes

  • Extracts all available attributes from raw data

  • Performs basic validation of required fields

  • Handles missing optional fields gracefully

  • Maintains data types for each attribute

update_category_stats(**kwargs)[source]#

Update statistics for each category based on processed article data.

Updates category information including faculty members, departments, titles, citation counts, DOIs, and themes. Also calculates derived statistics like faculty count, department count, article count, and citation averages.

Parameters:

**kwargs

Keyword arguments containing article data. Required arguments: - title (str): Article title

Type: str

  • doi (str): Digital Object Identifier

    Type: str

  • tc_count (int): Citation count

    Type: int

  • faculty_members (list): List of faculty authors

    Type: List[str]

  • all_affiliations (set): Set of department affiliations

    Type: Set[str]

  • themes (list): List of article themes

    Type: List[str]

  • all_categories (list): List of all categories

    Type: List[str]

  • url_maps (dict): Category URL mappings

    Type: Dict[str, Dict[str, str]]

Raises:
  • KeyError – If required kwargs are missing

  • ValueError – If category information cannot be updated

Return type:

None

Notes

  • Updates multiple statistics per category

  • Calculates derived metrics from raw data

  • Maintains relationships between entities

  • Handles missing optional data gracefully

  • Updates both raw counts and computed averages

update_faculty_stats(**kwargs)[source]#

Update faculty statistics for each category.

Updates faculty member information including department affiliations, publication titles, DOIs, citation counts, and article counts. Creates or updates faculty statistics entries for each category.

Parameters:

**kwargs

Keyword arguments containing faculty and article data. Required arguments: - faculty_members (List): List of faculty authors

Type: List[str]

  • faculty_affiliations (Dict): Faculty department mappings

    Type: Dict[str, List[str]]

  • title (str): Article title

    Type: str

  • doi (str): Digital Object Identifier

    Type: str

  • tc_count (int): Citation count

    Type: int

  • all_categories (List): List of all categories

    Type: List[str]

  • url_maps (Dict): Category URL mappings

    Type: Dict[str, Dict[str, str]]

Raises:
  • KeyError – If required kwargs are missing

  • ValueError – If faculty statistics cannot be updated

Return type:

None

Notes

  • Updates statistics for each faculty member

  • Maintains faculty-department relationships

  • Tracks publication metrics per faculty

  • Handles multiple department affiliations

  • Updates both individual and aggregate statistics

update_global_faculty_stats(**kwargs)[source]#

Update global statistics for each faculty member.

Creates or updates global faculty statistics including total citations, article counts, department affiliations, DOIs, titles, categories, and category URLs across all publication categories.

Parameters:

**kwargs

Keyword arguments containing faculty and article data. Required arguments: - faculty_members (List): List of faculty authors

Type: List[str]

  • faculty_affiliations (Dict): Faculty department mappings

    Type: Dict[str, List[str]]

  • title (str): Article title

    Type: str

  • doi (str): Digital Object Identifier

    Type: str

  • tc_count (int): Citation count

    Type: int

  • all_categories (List): List of all categories

    Type: List[str]

  • top_level_categories (List): Top-level categories

    Type: List[str]

  • mid_level_categories (List): Mid-level categories

    Type: List[str]

  • low_level_categories (List): Low-level categories

    Type: List[str]

  • url_maps (Dict): Category URL mappings

    Type: Dict[str, Dict[str, str]]

  • themes (List): Article themes

    Type: List[str]

  • journal (str): Journal name

    Type: str

Raises:
  • KeyError – If required kwargs are missing

  • ValueError – If global faculty statistics cannot be updated

Return type:

None

Notes

  • Updates global metrics for each faculty member

  • Tracks statistics across all categories

  • Maintains hierarchical category relationships

  • Handles multiple department affiliations

  • Aggregates publication metrics globally

update_category_article_stats(**kwargs)[source]#

Update article statistics for each category.

Creates or updates article statistics including titles, citations, faculty members, affiliations, abstracts, licenses, publication dates, and URLs. Organizes articles by their category levels (top, mid, low).

Parameters:

**kwargs

Keyword arguments containing article data. Required arguments: - title (str): Article title

Type: str

  • doi (str): Digital Object Identifier

    Type: str

  • tc_count (int): Citation count

    Type: int

  • faculty_members (List): List of faculty authors

    Type: List[str]

  • faculty_affiliations (Dict): Faculty department mappings

    Type: Dict[str, List[str]]

  • abstract (str): Article abstract

    Type: str

  • license_url (str): License URL

    Type: str

  • date_published_print (str): Print publication date

    Type: str

  • date_published_online (str): Online publication date

    Type: str

  • journal (str): Journal name

    Type: str

  • download_url (str): Download URL

    Type: str

  • themes (List): Article themes

    Type: List[str]

  • all_categories (List): List of all categories

    Type: List[str]

  • low_level_categories (List): Low-level categories

    Type: List[str]

  • mid_level_categories (List): Mid-level categories

    Type: List[str]

  • url_maps (Dict): Category URL mappings

    Type: Dict[str, Dict[str, str]]

Raises:
  • KeyError – If required kwargs are missing

  • ValueError – If article statistics cannot be updated

Return type:

None

Notes

  • Updates statistics for each category level

  • Maintains hierarchical relationships

  • Tracks detailed article metadata

  • Links articles to faculty and departments

  • Preserves publication timeline information

create_article_object(**kwargs)[source]#

Create a new article object with complete metadata.

Creates a CrossrefArticleDetails object containing all article information, including category relationships, URLs, and metadata. Handles URL generation for different category levels and maintains category hierarchies.

Parameters:

**kwargs

Keyword arguments containing article data. Required arguments: - doi (str): Digital Object Identifier

Type: str

  • title (str): Article title

    Type: str

  • tc_count (int): Citation count

    Type: int

  • faculty_members (List): Faculty authors

    Type: List[str]

  • faculty_affiliations (Dict): Faculty affiliations

    Type: Dict[str, List[str]]

  • abstract (str): Article abstract

    Type: str

  • license_url (str): License URL

    Type: str

  • date_published_print (str): Print publication date

    Type: str

  • date_published_online (str): Online publication date

    Type: str

  • journal (str): Journal name

    Type: str

  • download_url (str): Download URL

    Type: str

  • themes (List): Article themes

    Type: List[str]

  • all_categories (List): All categories

    Type: List[str]

  • top_level_categories (List): Top-level categories

    Type: List[str]

  • mid_level_categories (List): Mid-level categories

    Type: List[str]

  • low_level_categories (List): Low-level categories

    Type: List[str]

Raises:
  • KeyError – If required kwargs are missing

  • ValueError – If article object cannot be created

Return type:

None

Notes

  • Creates CrossrefArticleDetails instance

  • Generates URLs for all category levels

  • Maintains category hierarchies

  • Preserves all article metadata

  • Links faculty and department relationships

clean_faculty_affiliations(faculty_affiliations)[source]#

Clean and format faculty affiliation data.

Processes raw faculty affiliation mappings to ensure consistent formatting and remove any invalid or malformed data.

Parameters:

faculty_affiliations (Dict) – Raw faculty affiliation mappings. Type: Dict[str, Any]

Returns:

Cleaned faculty affiliation mappings.

Type: Dict[str, Any]

Return type:

Dict

Notes

  • Removes invalid entries

  • Normalizes department names

  • Handles missing or malformed data

  • Maintains faculty-department relationships

clean_faculty_members(faculty_members)[source]#

Clean and filter faculty member names.

Processes raw faculty member names to ensure consistent formatting and remove any invalid or empty entries.

Parameters:

faculty_members (List) – Raw list of faculty member names. Type: List[str]

Returns:

Cleaned list of faculty member names.

Type: List[str] Excludes empty strings and invalid entries.

Return type:

List

Notes

  • Removes empty strings

  • Normalizes name formats

  • Filters invalid entries

  • Maintains unique entries

initialize_categories(categories)[source]#

Initialize category data structures for all category levels.

Creates CategoryInfo instances for each category and organizes them by level in the taxonomy hierarchy (top, mid, low).

Parameters:

categories (Dict) – Categories organized by level. Type: Dict[str, List[str]] Keys must be: “top”, “mid”, “low”

Returns:

Organized category data.

Type: Dict[str, List[str]] Contains: - top_level_categories (List[str]): List of top-level categories - mid_level_categories (List[str]): List of mid-level categories - low_level_categories (List[str]): List of low-level categories - all_categories (List[str]): List of all categories

Return type:

Dict

Raises:

ValueError – If category initialization fails

Notes

  • Creates CategoryInfo instances for each category

  • Maintains hierarchical relationships

  • Validates category levels

  • Ensures unique category names

  • Preserves taxonomy structure

get_category_data()[source]#

Get the processed category data.

Provides access to the complete mapping of categories and their associated information, including statistics and relationships.

Returns:

Mapping of categories to their information.

Type: Dict[str, academic_metrics.models.category_info.CategoryInfo]

Return type:

Dict

Notes

  • Returns complete category hierarchy

  • Includes all category statistics

  • Contains faculty and article relationships

  • Preserves category metadata

get_category_article_stats()[source]#

Get article statistics organized by category.

Provides access to the complete mapping of categories to their associated article statistics, including metrics and metadata.

Returns:

Mapping of categories to their article statistics.

Type: Dict[str, academic_metrics.models.crossref_article_stats.CrossrefArticleStats]

Return type:

Dict

Notes

  • Returns statistics for all categories

  • Includes article counts and metrics

  • Contains citation information

  • Preserves publication metadata

  • Maintains category relationships

get_articles()[source]#

Get the list of processed articles.

Provides access to the complete list of processed articles with their full details and metadata.

Returns:

List of all processed article details.

Type: List[academic_metrics.models.crossref_article_details.CrossrefArticleDetails]

Return type:

List

Notes

  • Returns all processed articles

  • Includes complete article metadata

  • Contains category assignments

  • Preserves faculty relationships

  • Maintains publication details

get_faculty_stats()[source]#

Get faculty statistics organized by category.

Provides access to the complete mapping of categories to their associated faculty statistics, including publication metrics and relationships.

Returns:

Mapping of categories to their faculty statistics.

Type: Dict[str, academic_metrics.models.faculty_stats.FacultyStats]

Return type:

Dict

Notes

  • Returns statistics for all categories

  • Includes faculty publication counts

  • Contains citation metrics

  • Preserves department affiliations

  • Maintains category-specific metrics

get_global_faculty_stats()[source]#

Get global statistics for all faculty members.

Provides access to the complete mapping of faculty members to their global statistics across all categories and publications.

Returns:

Mapping of faculty members to their global statistics.

Type: Dict[str, academic_metrics.models.global_faculty_stats.GlobalFacultyStats]

Return type:

Dict

Notes

  • Returns aggregate statistics per faculty

  • Includes cross-category metrics

  • Contains total publication counts

  • Preserves all department affiliations

  • Maintains complete publication history

static _collect_all_affiliations(faculty_affiliations, logger)[source]#

Collect all unique department affiliations.

Extracts and deduplicates all department affiliations from the faculty to department mapping dictionary.

Parameters:
  • faculty_affiliations (Dict) – Faculty to department mappings. Type: Dict[str, Any]

  • logger (logging.Logger) – Logger instance for tracking operations. Type: logging.Logger

Returns:

Set of unique department affiliations.

Type: Set[str]

Return type:

set

Notes

  • Removes duplicate departments

  • Handles missing affiliations

  • Validates department names

  • Maintains unique entries only

static _generate_url(string, logger=None)[source]#

Generate a URL-safe string.

Converts an input string into a URL-safe format by removing special characters, replacing spaces, and ensuring proper encoding.

Parameters:
  • string (str) – Input string to encode. Type: str

  • logger (logging.Logger | None) – Logger instance to use for logging. Type: logging.Logger | None Defaults to None.

Returns:

URL-encoded string.

Type: str

Return type:

str

Notes

  • Removes special characters

  • Replaces spaces with hyphens

  • Converts to lowercase

  • Ensures URL-safe encoding

static _generate_normal_id(strings, logger=None)[source]#

Generate a normalized ID from a list of strings.

Combines multiple strings into a single normalized identifier, ensuring consistent formatting and URL-safe characters.

Parameters:
  • strings (list) – List of strings to combine into an ID. Type: List[str]

  • logger (logging.Logger | None) – Logger instance to use for logging. Type: logging.Logger | None Defaults to None.

Returns:

Normalized ID string.

Type: str Format: lowercase, hyphen-separated

Return type:

str

Notes

  • Combines multiple strings

  • Converts to lowercase

  • Replaces spaces with hyphens

  • Removes special characters

  • Ensures consistent formatting

Module contents#