core package#
Submodules#
core.category_processor module#
- class academic_metrics.core.category_processor.CategoryProcessor(utils, dataclass_factory, warning_manager, taxonomy_util, log_to_console=True)[source]#
Bases:
objectProcesses and organizes academic publication data by categories.
This class handles the processing of classified publication data, organizing it into categories and generating various statistics. It manages faculty affiliations, article details, and category relationships.
- Parameters:
None
- warning_manager#
System for handling and logging warnings.
- Type:
- dataclass_factory#
Factory for creating data model instances.
- Type:
- category_data#
Mapping of categories to their information.
- Type:
Dict[str, CategoryInfo]
- faculty_stats#
Faculty statistics by category.
- Type:
Dict[str, FacultyStats]
- global_faculty_stats#
Global faculty statistics.
- Type:
Dict[str, GlobalFacultyStats]
- category_article_stats#
Article statistics by category.
- Type:
Dict[str, CrossrefArticleStats]
- articles#
List of processed article details.
- Type:
List[CrossrefArticleDetails]
- logger#
Logger instance for this class.
- Type:
- __init__(utils, dataclass_factory, warning_manager, taxonomy_util, log_to_console=True)[source]#
Initialize the CategoryProcessor with required dependencies.
Sets up logging configuration and initializes all required components for processing publication data, including utilities, factories, and data structures for storing category, faculty, and article information.
- Parameters:
utils (Utilities) – Utility functions for data processing. Type:
academic_metrics.core.utilities.Utilitiesdataclass_factory (DataClassFactory) – Factory for creating data model instances. Type:
academic_metrics.core.data_class_factory.DataClassFactorywarning_manager (WarningManager) – System for handling and logging warnings. Type:
academic_metrics.core.warning_manager.WarningManagertaxonomy_util (Taxonomy) – Utility for managing publication taxonomy. Type:
academic_metrics.core.taxonomy.Taxonomylog_to_console (bool | None) – Whether to log output to console. Type: bool | None Defaults to LOG_TO_CONSOLE.
- Raises:
ValueError – If required dependencies are not properly initialized
IOError – If log file cannot be created or accessed
Notes
Initializes the following data structures: - category_data: Dictionary mapping categories to their information - faculty_stats: Dictionary tracking faculty statistics by category - global_faculty_stats: Dictionary tracking global faculty statistics - category_article_stats: Dictionary tracking article stats per category - articles: List of CrossrefArticleDetails objects for ground truth data
- process_data_list(data)[source]#
Process a list of publication data items.
Takes raw publication data and processes each item through several stages: 1. Extracts base attributes 2. Initializes category information 3. Generates URL maps for categories 4. Cleans faculty and affiliation data 5. Updates various statistics (category, faculty, article) 6. Creates article objects
- Parameters:
data (List[Dict]) – List of raw publication data dictionaries to process. Type: List[Dict[str, Any]]
- Raises:
ValueError – If required attributes are missing from data
Exception – If category information cannot be initialized
- Return type:
Notes
Processes each publication through all stages sequentially
Updates multiple data structures during processing
Maintains relationships between categories, faculty, and articles
Performs data cleaning and normalization
- _test_category_processor(raw_attributes)[source]#
Test method for validating category processing functionality.
This private method is used for testing the category processor’s ability to handle raw attribute data and properly process it through the category system.
- Parameters:
raw_attributes (Dict[str, Any]) – Dictionary of raw attributes to test processing. Type: Dict[str, Any]
- Return type:
Notes
Used for internal testing purposes only
Validates category processing pipeline
Does not modify production data
Helps ensure data integrity
- call_get_attributes(*, data)[source]#
Extract and process attributes from raw publication data.
Extracts various attributes including categories, authors, departments, titles, citations, abstracts, licenses, publication dates, journal info, URLs, DOIs, and themes from the raw data.
- Parameters:
data (Dict[str, Any]) – Raw publication data dictionary. Type: Dict[str, Any]
- Returns:
- Dictionary containing extracted and processed attributes.
Type: Dict[str, Any] Contains: - categories (List[str]): List of publication categories - faculty_members (List[str]): List of faculty authors - faculty_affiliations (Dict[str, str]): Faculty to department mapping - title (str): Publication title - tc_count (int): Citation count - abstract (str): Publication abstract - license_url (str): License URL - date_published_print (str): Print publication date - date_published_online (str): Online publication date - journal (str): Journal name - download_url (str): Download URL - doi (str): Digital Object Identifier - themes (List[str]): List of publication themes
- Return type:
Dict[str, Any]
- Raises:
Exception – If no category is found in the data
Notes
Extracts all available attributes from raw data
Performs basic validation of required fields
Handles missing optional fields gracefully
Maintains data types for each attribute
- update_category_stats(**kwargs)[source]#
Update statistics for each category based on processed article data.
Updates category information including faculty members, departments, titles, citation counts, DOIs, and themes. Also calculates derived statistics like faculty count, department count, article count, and citation averages.
- Parameters:
**kwargs –
Keyword arguments containing article data. Required arguments: - title (str): Article title
Type: str
- doi (str): Digital Object Identifier
Type: str
- tc_count (int): Citation count
Type: int
- faculty_members (list): List of faculty authors
Type: List[str]
- all_affiliations (set): Set of department affiliations
Type: Set[str]
- themes (list): List of article themes
Type: List[str]
- all_categories (list): List of all categories
Type: List[str]
- url_maps (dict): Category URL mappings
Type: Dict[str, Dict[str, str]]
- Raises:
KeyError – If required kwargs are missing
ValueError – If category information cannot be updated
- Return type:
Notes
Updates multiple statistics per category
Calculates derived metrics from raw data
Maintains relationships between entities
Handles missing optional data gracefully
Updates both raw counts and computed averages
- update_faculty_stats(**kwargs)[source]#
Update faculty statistics for each category.
Updates faculty member information including department affiliations, publication titles, DOIs, citation counts, and article counts. Creates or updates faculty statistics entries for each category.
- Parameters:
**kwargs –
Keyword arguments containing faculty and article data. Required arguments: - faculty_members (List): List of faculty authors
Type: List[str]
- faculty_affiliations (Dict): Faculty department mappings
Type: Dict[str, List[str]]
- title (str): Article title
Type: str
- doi (str): Digital Object Identifier
Type: str
- tc_count (int): Citation count
Type: int
- all_categories (List): List of all categories
Type: List[str]
- url_maps (Dict): Category URL mappings
Type: Dict[str, Dict[str, str]]
- Raises:
KeyError – If required kwargs are missing
ValueError – If faculty statistics cannot be updated
- Return type:
Notes
Updates statistics for each faculty member
Maintains faculty-department relationships
Tracks publication metrics per faculty
Handles multiple department affiliations
Updates both individual and aggregate statistics
- update_global_faculty_stats(**kwargs)[source]#
Update global statistics for each faculty member.
Creates or updates global faculty statistics including total citations, article counts, department affiliations, DOIs, titles, categories, and category URLs across all publication categories.
- Parameters:
**kwargs –
Keyword arguments containing faculty and article data. Required arguments: - faculty_members (List): List of faculty authors
Type: List[str]
- faculty_affiliations (Dict): Faculty department mappings
Type: Dict[str, List[str]]
- title (str): Article title
Type: str
- doi (str): Digital Object Identifier
Type: str
- tc_count (int): Citation count
Type: int
- all_categories (List): List of all categories
Type: List[str]
- top_level_categories (List): Top-level categories
Type: List[str]
- mid_level_categories (List): Mid-level categories
Type: List[str]
- low_level_categories (List): Low-level categories
Type: List[str]
- url_maps (Dict): Category URL mappings
Type: Dict[str, Dict[str, str]]
- themes (List): Article themes
Type: List[str]
- journal (str): Journal name
Type: str
- Raises:
KeyError – If required kwargs are missing
ValueError – If global faculty statistics cannot be updated
- Return type:
Notes
Updates global metrics for each faculty member
Tracks statistics across all categories
Maintains hierarchical category relationships
Handles multiple department affiliations
Aggregates publication metrics globally
- update_category_article_stats(**kwargs)[source]#
Update article statistics for each category.
Creates or updates article statistics including titles, citations, faculty members, affiliations, abstracts, licenses, publication dates, and URLs. Organizes articles by their category levels (top, mid, low).
- Parameters:
**kwargs –
Keyword arguments containing article data. Required arguments: - title (str): Article title
Type: str
- doi (str): Digital Object Identifier
Type: str
- tc_count (int): Citation count
Type: int
- faculty_members (List): List of faculty authors
Type: List[str]
- faculty_affiliations (Dict): Faculty department mappings
Type: Dict[str, List[str]]
- abstract (str): Article abstract
Type: str
- license_url (str): License URL
Type: str
- date_published_print (str): Print publication date
Type: str
- date_published_online (str): Online publication date
Type: str
- journal (str): Journal name
Type: str
- download_url (str): Download URL
Type: str
- themes (List): Article themes
Type: List[str]
- all_categories (List): List of all categories
Type: List[str]
- low_level_categories (List): Low-level categories
Type: List[str]
- mid_level_categories (List): Mid-level categories
Type: List[str]
- url_maps (Dict): Category URL mappings
Type: Dict[str, Dict[str, str]]
- Raises:
KeyError – If required kwargs are missing
ValueError – If article statistics cannot be updated
- Return type:
Notes
Updates statistics for each category level
Maintains hierarchical relationships
Tracks detailed article metadata
Links articles to faculty and departments
Preserves publication timeline information
- create_article_object(**kwargs)[source]#
Create a new article object with complete metadata.
Creates a CrossrefArticleDetails object containing all article information, including category relationships, URLs, and metadata. Handles URL generation for different category levels and maintains category hierarchies.
- Parameters:
**kwargs –
Keyword arguments containing article data. Required arguments: - doi (str): Digital Object Identifier
Type: str
- title (str): Article title
Type: str
- tc_count (int): Citation count
Type: int
- faculty_members (List): Faculty authors
Type: List[str]
- faculty_affiliations (Dict): Faculty affiliations
Type: Dict[str, List[str]]
- abstract (str): Article abstract
Type: str
- license_url (str): License URL
Type: str
- date_published_print (str): Print publication date
Type: str
- date_published_online (str): Online publication date
Type: str
- journal (str): Journal name
Type: str
- download_url (str): Download URL
Type: str
- themes (List): Article themes
Type: List[str]
- all_categories (List): All categories
Type: List[str]
- top_level_categories (List): Top-level categories
Type: List[str]
- mid_level_categories (List): Mid-level categories
Type: List[str]
- low_level_categories (List): Low-level categories
Type: List[str]
- Raises:
KeyError – If required kwargs are missing
ValueError – If article object cannot be created
- Return type:
Notes
Creates CrossrefArticleDetails instance
Generates URLs for all category levels
Maintains category hierarchies
Preserves all article metadata
Links faculty and department relationships
- clean_faculty_affiliations(faculty_affiliations)[source]#
Clean and format faculty affiliation data.
Processes raw faculty affiliation mappings to ensure consistent formatting and remove any invalid or malformed data.
- Parameters:
faculty_affiliations (Dict) – Raw faculty affiliation mappings. Type: Dict[str, Any]
- Returns:
- Cleaned faculty affiliation mappings.
Type: Dict[str, Any]
- Return type:
Dict
Notes
Removes invalid entries
Normalizes department names
Handles missing or malformed data
Maintains faculty-department relationships
- clean_faculty_members(faculty_members)[source]#
Clean and filter faculty member names.
Processes raw faculty member names to ensure consistent formatting and remove any invalid or empty entries.
- Parameters:
faculty_members (List) – Raw list of faculty member names. Type: List[str]
- Returns:
- Cleaned list of faculty member names.
Type: List[str] Excludes empty strings and invalid entries.
- Return type:
List
Notes
Removes empty strings
Normalizes name formats
Filters invalid entries
Maintains unique entries
- initialize_categories(categories)[source]#
Initialize category data structures for all category levels.
Creates CategoryInfo instances for each category and organizes them by level in the taxonomy hierarchy (top, mid, low).
- Parameters:
categories (Dict) – Categories organized by level. Type: Dict[str, List[str]] Keys must be: “top”, “mid”, “low”
- Returns:
- Organized category data.
Type: Dict[str, List[str]] Contains: - top_level_categories (List[str]): List of top-level categories - mid_level_categories (List[str]): List of mid-level categories - low_level_categories (List[str]): List of low-level categories - all_categories (List[str]): List of all categories
- Return type:
Dict
- Raises:
ValueError – If category initialization fails
Notes
Creates CategoryInfo instances for each category
Maintains hierarchical relationships
Validates category levels
Ensures unique category names
Preserves taxonomy structure
- get_category_data()[source]#
Get the processed category data.
Provides access to the complete mapping of categories and their associated information, including statistics and relationships.
- Returns:
- Mapping of categories to their information.
Type: Dict[str,
academic_metrics.models.category_info.CategoryInfo]
- Return type:
Dict
Notes
Returns complete category hierarchy
Includes all category statistics
Contains faculty and article relationships
Preserves category metadata
- get_category_article_stats()[source]#
Get article statistics organized by category.
Provides access to the complete mapping of categories to their associated article statistics, including metrics and metadata.
- Returns:
- Mapping of categories to their article statistics.
Type: Dict[str,
academic_metrics.models.crossref_article_stats.CrossrefArticleStats]
- Return type:
Dict
Notes
Returns statistics for all categories
Includes article counts and metrics
Contains citation information
Preserves publication metadata
Maintains category relationships
- get_articles()[source]#
Get the list of processed articles.
Provides access to the complete list of processed articles with their full details and metadata.
- Returns:
- List of all processed article details.
Type: List[
academic_metrics.models.crossref_article_details.CrossrefArticleDetails]
- Return type:
List
Notes
Returns all processed articles
Includes complete article metadata
Contains category assignments
Preserves faculty relationships
Maintains publication details
- get_faculty_stats()[source]#
Get faculty statistics organized by category.
Provides access to the complete mapping of categories to their associated faculty statistics, including publication metrics and relationships.
- Returns:
- Mapping of categories to their faculty statistics.
Type: Dict[str,
academic_metrics.models.faculty_stats.FacultyStats]
- Return type:
Dict
Notes
Returns statistics for all categories
Includes faculty publication counts
Contains citation metrics
Preserves department affiliations
Maintains category-specific metrics
- get_global_faculty_stats()[source]#
Get global statistics for all faculty members.
Provides access to the complete mapping of faculty members to their global statistics across all categories and publications.
- Returns:
- Mapping of faculty members to their global statistics.
Type: Dict[str,
academic_metrics.models.global_faculty_stats.GlobalFacultyStats]
- Return type:
Dict
Notes
Returns aggregate statistics per faculty
Includes cross-category metrics
Contains total publication counts
Preserves all department affiliations
Maintains complete publication history
- static _collect_all_affiliations(faculty_affiliations, logger)[source]#
Collect all unique department affiliations.
Extracts and deduplicates all department affiliations from the faculty to department mapping dictionary.
- Parameters:
faculty_affiliations (Dict) – Faculty to department mappings. Type: Dict[str, Any]
logger (logging.Logger) – Logger instance for tracking operations. Type: logging.Logger
- Returns:
- Set of unique department affiliations.
Type: Set[str]
- Return type:
Notes
Removes duplicate departments
Handles missing affiliations
Validates department names
Maintains unique entries only
- static _generate_url(string, logger=None)[source]#
Generate a URL-safe string.
Converts an input string into a URL-safe format by removing special characters, replacing spaces, and ensuring proper encoding.
- Parameters:
string (str) – Input string to encode. Type: str
logger (logging.Logger | None) – Logger instance to use for logging. Type: logging.Logger | None Defaults to None.
- Returns:
- URL-encoded string.
Type: str
- Return type:
Notes
Removes special characters
Replaces spaces with hyphens
Converts to lowercase
Ensures URL-safe encoding
- static _generate_normal_id(strings, logger=None)[source]#
Generate a normalized ID from a list of strings.
Combines multiple strings into a single normalized identifier, ensuring consistent formatting and URL-safe characters.
- Parameters:
strings (list) – List of strings to combine into an ID. Type: List[str]
logger (logging.Logger | None) – Logger instance to use for logging. Type: logging.Logger | None Defaults to None.
- Returns:
- Normalized ID string.
Type: str Format: lowercase, hyphen-separated
- Return type:
Notes
Combines multiple strings
Converts to lowercase
Replaces spaces with hyphens
Removes special characters
Ensures consistent formatting