orchestrators package#
Submodules#
orchestrators.category_data_orchestrator module#
- class academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator(*, data, output_dir_path, category_processor, faculty_postprocessor, department_postprocessor, strategy_factory, dataclass_factory, warning_manager, utilities, extend=False)[source]#
Bases:
objectOrchestrates the processing and organization of academic publication data.
This class manages the workflow of processing classified publication data through various stages: 1. Processing raw data through CategoryProcessor 2. Managing faculty/department relationships 3. Generating statistical outputs 4. Serializing results to JSON files
- data#
Raw classified publication data to process.
- Type:
List[Dict]
- strategy_factory#
Factory for creating processing strategies.
- Type:
- warning_manager#
System for handling and logging warnings.
- Type:
- dataclass_factory#
Factory for creating data model instances.
- Type:
- category_processor#
Processor for category-related operations.
- Type:
- faculty_postprocessor#
Processor for faculty data refinement.
- Type:
- final_category_data#
Processed category statistics.
- Type:
List[Dict]
- final_faculty_data#
Processed faculty statistics.
- Type:
List[Dict]
- final_article_stats_data#
Processed article statistics.
- Type:
List[Dict]
- final_article_data#
Processed article details.
- Type:
List[Dict]
- final_global_faculty_data#
Processed global faculty statistics.
- Type:
List[Dict]
- logger#
Logger instance for this class.
- Type:
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.run_orchestrator`
Executes the main data processing workflow.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_category_data`
Returns processed category data.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_faculty_data`
Returns processed faculty data.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_global_faculty_data`
Returns processed global faculty data.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_article_stats_data`
Returns processed article statistics.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_article_data`
Returns processed article details.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._save_all_results`
Saves all processed data to files.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._refine_faculty_sets`
Refines faculty sets by removing duplicates.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._refine_faculty_stats`
Refines faculty statistics based on name variations.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._clean_category_data`
Prepares category data by removing unwanted keys.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_category_data`
Serializes and saves category data.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_faculty_stats`
Serializes and saves faculty statistics.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_global_faculty_stats`
Serializes and saves global faculty statistics.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_category_article_stats`
Serializes and saves article statistics.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_articles`
Serializes and saves article details.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._flatten_to_list`
Flattens nested data structures into a list.
- :meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._write_to_json`
Writes data to JSON file.
- __init__(*, data, output_dir_path, category_processor, faculty_postprocessor, department_postprocessor, strategy_factory, dataclass_factory, warning_manager, utilities, extend=False)[source]#
Initialize the CategoryDataOrchestrator with required components and settings.
Sets up logging configuration with both file and console handlers and initializes internal data structures for storing processed results.
- Parameters:
data (List[Dict]) – Raw classified publication data to process.
output_dir_path (str) – Directory path where output files will be saved.
category_processor (CategoryProcessor) – Processor for handling category-related operations.
faculty_postprocessor (FacultyPostprocessor) – Processor for faculty data refinement.
strategy_factory (StrategyFactory) – Factory for creating processing strategies.
dataclass_factory (DataClassFactory) – Factory for creating data model instances.
warning_manager (WarningManager) – System for handling and logging warnings.
utilities (Utilities) – General utility functions.
extend (bool, optional) – Whether to extend existing data files. Defaults to False.
- Raises:
ValueError – If output directory path doesn’t exist or isn’t writable.
TypeError – If any of the processor or factory arguments are of incorrect type.
- run_orchestrator(category_data=None)[source]#
Execute the main data processing workflow.
Processes the raw publication data through several stages: 1. Processes data through CategoryProcessor 2. Gets category data for faculty set refinement 3. Refines faculty sets to remove duplicates 4. Refines faculty statistics with name variations 5. Saves all processed results to files
- Raises:
ValueError – If category data processing fails.
IOError – If saving results to files fails.
- Return type:
- get_final_category_data()[source]#
Retrieve the processed category data.
- Returns:
List of processed category data dictionaries.
- Return type:
List[Dict]
- Raises:
ValueError – If final category data hasn’t been generated yet.
- get_final_faculty_data()[source]#
Retrieve the processed faculty data.
- Returns:
List of processed faculty data dictionaries.
- Return type:
List[Dict]
- Raises:
ValueError – If final faculty data hasn’t been generated yet.
- get_final_global_faculty_data()[source]#
Retrieve the processed global faculty data.
- Returns:
List of processed global faculty data dictionaries.
- Return type:
List[Dict]
- Raises:
ValueError – If final global faculty data hasn’t been generated yet.
- get_final_article_stats_data()[source]#
Retrieve the processed article statistics data.
- Returns:
List of processed article statistics dictionaries.
- Return type:
List[Dict]
- Raises:
ValueError – If final article statistics data hasn’t been generated yet.
- get_final_article_data()[source]#
Retrieve the processed article data.
- Returns:
List of processed article data dictionaries.
- Return type:
List[Dict]
- Raises:
ValueError – If final article data hasn’t been generated yet.
- _save_all_results()[source]#
Save all processed data to their respective JSON files.
Serializes and saves: 1. Category data 2. Faculty statistics 3. Article statistics 4. Article details 5. Global faculty statistics
- _refine_faculty(faculty_postprocessor, category_dict)[source]#
Refines faculty sets by removing near duplicates and updating counts.
Uses FacultyPostprocessor to clean faculty data by removing near-duplicate entries and updating all related faculty and department counts.
- Parameters:
faculty_postprocessor (FacultyPostprocessor) – Postprocessor for faculty data. Type:
academic_metrics.postprocessors.faculty_postprocessor.FacultyPostprocessorcategory_dict (dict) – Dictionary of categories and their information. Type: Dict[str,
academic_metrics.models.category_info.CategoryInfo]
- Return type:
None
Notes
Removes near-duplicate faculty entries
Updates faculty counts per category
Updates department counts
Maintains faculty-department relationships
Ensures data consistency after refinement
- _refine_faculty_stats(*, faculty_stats, variations, category_dict)[source]#
Refines faculty statistics based on name variations.
Processes faculty statistics to account for name variations, ensuring accurate attribution of publications and metrics across all faculty members.
- Parameters:
faculty_stats (Dict) – Dictionary of faculty statistics. Type: Dict[str,
academic_metrics.models.faculty_stats.FacultyStats]variations (Dict) – Dictionary of name variations. Type: Dict[str,
academic_metrics.models.string_variation.StringVariation]category_dict (Dict) – Dictionary of categories and their information. Type: Dict[str,
academic_metrics.models.category_info.CategoryInfo]
- Return type:
Notes
Iterates through all categories
Processes each faculty member
Applies name variation matching
Updates publication counts
Ensures metric consistency
Maintains statistical accuracy
- _refine_departments(department_postprocessor, category_dict)[source]#
Processes department sets by removing near duplicates and updates counts.
Uses DepartmentPostprocessor to clean department data by removing near-duplicate entries and updating all related department counts and relationships.
- Parameters:
department_postprocessor (DepartmentPostprocessor) – Postprocessor for department data. Type:
academic_metrics.postprocessors.department_postprocessor.DepartmentPostprocessorcategory_dict (dict) – Dictionary of categories and their information. Type: Dict[str,
academic_metrics.models.category_info.CategoryInfo]
- Return type:
None
Notes
Removes near-duplicate department entries
Updates department counts per category
Maintains faculty-department relationships
Ensures naming consistency
Preserves hierarchical relationships
Updates all related statistics
- _clean_category_data(category_data)[source]#
Prepare category data by removing unwanted keys.
Cleans the raw category data by removing specified keys that are not needed for further processing or analysis.
- Parameters:
category_data (Dict) – Raw category data to clean. Type: Dict[str,
academic_metrics.models.category_info.CategoryInfo]- Returns:
- Cleaned category data with specified keys removed.
Type: Dict[str, Dict]
- Return type:
Notes
Identifies and removes unnecessary keys
Preserves essential category information
Ensures data consistency
Prepares data for downstream processing
- _serialize_and_save_category_data(*, output_path, category_data)[source]#
Serialize and save category data to JSON file.
Converts the category data dictionary to JSON format and saves it to the specified file path, creating directories if needed.
- Parameters:
output_path (str) – Path where the JSON file will be saved. Type: str
category_data (Dict) – Category data to serialize. Type: Dict[str, Dict]
- Raises:
IOError – If file writing fails or directory creation fails
- Return type:
Notes
Creates output directory if needed
Serializes data to JSON format
Handles nested dictionary structures
Ensures proper file encoding
Validates output before saving
- _serialize_and_save_faculty_stats(*, output_path, faculty_stats)[source]#
Serialize and save faculty statistics to JSON file.
Converts the faculty statistics dictionary to JSON format and saves it to the specified file path, creating directories if needed.
- Parameters:
output_path (str) – Path where the JSON file will be saved. Type: str
faculty_stats (Dict) – Faculty statistics to serialize. Type: Dict[str,
academic_metrics.models.faculty_stats.FacultyStats]
- Raises:
IOError – If file writing fails or directory creation fails
- Return type:
Notes
Creates output directory if needed
Serializes data to JSON format
Handles nested dictionary structures
Ensures proper file encoding
Validates output before saving
- _serialize_and_save_global_faculty_stats(*, output_path, global_faculty_stats)[source]#
Serialize and save global faculty statistics to JSON file.
Converts the global faculty statistics dictionary to JSON format and saves it to the specified file path, creating directories if needed.
- Parameters:
output_path (str) – Path where the JSON file will be saved. Type: str
global_faculty_stats (Dict) – Global faculty statistics to serialize. Type: Dict[str,
academic_metrics.models.faculty_stats.FacultyStats]
- Raises:
IOError – If file writing fails or directory creation fails
- Return type:
Notes
Creates output directory if needed
Serializes data to JSON format
Handles nested dictionary structures
Ensures proper file encoding
Validates output before saving
- _serialize_and_save_category_article_stats(*, output_path, article_stats)[source]#
Serialize and save article statistics to JSON file.
Converts the category article statistics dictionary to JSON format and saves it to the specified file path, creating directories if needed.
- Parameters:
output_path (str) – Path where the JSON file will be saved. Type: str
article_stats (Dict) – Article statistics to serialize. Type: Dict[str,
academic_metrics.models.crossref_article_stats.CrossrefArticleStats]
- Raises:
IOError – If file writing fails or directory creation fails
- Return type:
Notes
Creates output directory if needed
Serializes data to JSON format
Handles nested dictionary structures
Ensures proper file encoding
Validates output before saving
- _serialize_and_save_articles(*, output_path, articles)[source]#
Serialize and save article details to JSON file.
Converts the list of article details to JSON format and saves it to the specified file path, creating directories if needed.
- Parameters:
output_path (str) – Path where the JSON file will be saved. Type: str
articles (List) – Article details to serialize. Type: List[
academic_metrics.models.crossref_article_details.CrossrefArticleDetails]
- Raises:
IOError – If file writing fails or directory creation fails
- Return type:
Notes
Creates output directory if needed
Serializes data to JSON format
Handles complex article objects
Ensures proper file encoding
Validates output before saving
- _flatten_to_list(data)[source]#
Recursively flatten nested dictionaries/lists into a flat list.
Transforms a complex nested structure of dictionaries and lists into a single flat list of dictionaries, preserving all data.
- Parameters:
data (Union[Dict, List]) – Nested structure of dictionaries and lists. Type: Union[Dict[str, Any], List[Dict[str, Any]]]
- Returns:
- Flattened list of dictionaries.
Type: List[Dict[str, Any]]
- Return type:
List
Notes
Handles arbitrary nesting depth
Preserves dictionary contents
Maintains data relationships
Removes nested structure
Keeps all original values
Examples
- Input:
- {
- “cat1”: {
- “article_map”: {
“doi1”: {“title”: “Article 1”}, “doi2”: {“title”: “Article 2”}
}
}
}
- Output:
- [
{“title”: “Article 1”}, {“title”: “Article 2”}
]
- _write_to_json(data, output_path)[source]#
Write data to JSON file, handling extend mode.
Writes the provided data to a JSON file at the specified path, creating directories if needed and handling both new files and file extensions.
- Parameters:
data (Union[List[Dict], Dict]) – Data to write to file. Type: Union[List[Dict[str, Any]], Dict[str, Any]]
output_path (str) – Path where the JSON file will be saved. Type: str
- Raises:
IOError – If file operations fail (creation, writing, or directory access)
- Return type:
Notes
Creates output directory if needed
Handles both new and existing files
Supports list and dictionary data
Ensures proper JSON formatting
Validates file permissions
Maintains data integrity
orchestrators.classification_orchestrator module#
- academic_metrics.orchestrators.classification_orchestrator.ClassificationResultsDict#
Type alias for a dictionary mapping DOIs to lists of classification results.
This type alias is used to represent the return type of the
get_classification_results_by_doi()method.
- academic_metrics.orchestrators.classification_orchestrator.ClassificationResultsTuple#
Type alias for a tuple containing lists of classification results.
This type alias is used to represent the return type of the
get_classification_results_by_doi()method.Notes
Format of the tuple is (top_categories, mid_categories, low_categories, themes)
- class academic_metrics.orchestrators.classification_orchestrator.ClassificationOrchestrator(abstract_classifier_factory, utilities)[source]#
Bases:
objectManages the classification process for research abstracts.
Orchestrates the process of extracting DOIs and abstracts from research metadata, classifying them using AbstractClassifier, and integrating results back into the original data. Tracks unclassified items for monitoring.
- abstract_classifier_factory#
Factory function for AbstractClassifier instances.
- Type:
Callable[…, AbstractClassifier]
- unclassified_dois#
DOIs of unclassified items. Type: List[str]
- Type:
List
- unclassified_abstracts#
Abstracts of unclassified items. Type: List[str]
- Type:
List
- unclassified_doi_abstract_dict#
Maps unclassified DOIs to abstracts. Type: Dict[str, str]
- Type:
Dict
- unclassified_items#
Complete metadata of unclassified items. Type: List[Dict[str, Any]]
- Type:
List
- unclassified_details#
Organized unclassified data. Type: Dict[str, Union[List[str], List[Dict[str, Any]]]] Contains: - dois: List of unclassified DOIs - abstracts: List of unclassified abstracts - items: List of unclassified metadata items
- Type:
Dict
- run_classification() List[Dict][source]#
Processes and classifies a list of research metadata dictionaries.
- get_unclassified_doi_abstract_dict() Dict[str, str][source]#
Returns the DOI to abstract mapping dictionary for unclassified items.
- _classification_orchestrator() List[Dict][source]#
Core classification logic for processing research metadata.
- _extract_categories() ClassificationResultsDict | ClassificationResultsTuple[source]#
Gets classification results for a specific DOI.
- _retrieve_doi_abstract() Tuple[str, str]#
Extracts DOI and abstract from a research metadata dictionary.
- _update_classified_instance_variables() None[source]#
Updates tracking variables for unclassified items.
- _normalize_abstract() str[source]#
Normalizes an abstract by removing LaTeX and converting any resulting unicode to ASCII.
- __init__(abstract_classifier_factory, utilities)[source]#
Initialize the ClassificationOrchestrator.
Sets up the orchestrator with required dependencies for classifying research abstracts and managing the classification process.
- Parameters:
abstract_classifier_factory (Callable) – Factory function for AbstractClassifier. Type: Callable[[Dict[str, str]],
AbstractClassifier]utilities (Utilities) – Utilities instance for attribute extraction. Type:
Utilities
- Returns:
None
Notes
Initializes tracking variables for unclassified items
Sets up classification status flags
Prepares data structures for results
Validates factory function compatibility
- run_classification(data, pre_classification_model='gpt-4o-mini', classification_model='gpt-4o-mini', theme_model='gpt-4o-mini')[source]#
Processes and classifies a list of research metadata dictionaries.
Extracts abstracts from research metadata, classifies them using specified AI models, and injects the classification results back into the original data.
- Parameters:
data (list) – List of dictionaries containing research metadata. Type: List[Dict[str, Any]]
pre_classification_model (str | None) – Model for pre-classification processing. Type: str | None Defaults to “gpt-4o-mini”
classification_model (str | None) – Model for main classification. Type: str | None Defaults to “gpt-4o-mini”
theme_model (str | None) – Model for theme extraction. Type: str | None Defaults to “gpt-4o-mini”
- Returns:
- Modified data with classifications injected.
Type: List[Dict[str, Any]] Includes: - Original metadata - Classification results - Theme information - Processing status
- Return type:
List
Notes
Processes each item sequentially
Tracks unclassified items
Handles missing abstracts
Updates internal statistics
Maintains original data structure
- get_unclassified_item_count()[source]#
Gets the number of unclassified items.
Retrieves the count of items that could not be classified during the classification process.
- Returns:
- Number of unclassified items.
Type: int
- Return type:
- Raises:
RuntimeError – If classification has not been run yet
Notes
Validates classification status
Returns current count
Includes all unclassified types
Requires prior classification run
- get_unclassified_dois()[source]#
Gets the DOIs of unclassified items.
Retrieves the list of Digital Object Identifiers (DOIs) for items that could not be classified during the classification process.
- Returns:
- List of unclassified DOIs.
Type: List[str] Empty list if all items were classified.
- Return type:
List
- Raises:
RuntimeError – If classification has not been run yet
Notes
Validates classification status
Returns unique DOIs only
Maintains original DOI format
Requires prior classification run
- get_unclassified_abstracts()[source]#
Gets the abstracts of unclassified items.
Retrieves the list of research abstracts for items that could not be classified during the classification process.
- Returns:
- List of unclassified abstracts.
Type: List[str] Empty list if all items were classified.
- Return type:
List
- Raises:
RuntimeError – If classification has not been run yet
Notes
Validates classification status
Returns normalized abstracts
Maintains text formatting
Requires prior classification run
May include empty abstracts
- get_unclassified_doi_abstract_dict()[source]#
Gets the DOI to abstract mapping dictionary for unclassified items.
Retrieves a dictionary that maps Digital Object Identifiers (DOIs) to their corresponding abstracts for items that could not be classified.
- Returns:
- Dictionary mapping unclassified DOIs to abstracts.
Type: Dict[str, str] Keys: DOIs (str) Values: Abstracts (str) Empty dict if all items were classified.
- Return type:
Dict
- Raises:
RuntimeError – If classification has not been run yet
Notes
Validates classification status
Maintains DOI-abstract relationships
Contains normalized abstracts
Requires prior classification run
Preserves original DOI format
- get_unclassified_items()[source]#
Gets the unclassified items.
Retrieves the complete list of research items that could not be classified, including all their original metadata.
- Returns:
- List of unclassified items with full metadata.
Type: List[Dict[str, Any]] Empty list if all items were classified. Each dict contains complete item metadata.
- Return type:
List
- Raises:
RuntimeError – If classification has not been run yet
Notes
Validates classification status
Returns complete metadata
Preserves original structure
Requires prior classification run
Maintains all item attributes
- get_unclassified_details_dict()[source]#
Gets the details of unclassified items.
Retrieves a comprehensive dictionary containing organized information about all unclassified items, including DOIs, abstracts, and complete metadata.
- Returns:
- Organized details of unclassified items.
Type: Dict[str, Union[List[str], List[Dict[str, Any]]]] Contains: - dois: List[str] - Unclassified DOIs - abstracts: List[str] - Unclassified abstracts - items: List[Dict] - Complete metadata
- Return type:
Dict
- Raises:
RuntimeError – If classification has not been run yet
Notes
Validates classification status
Provides structured access
Groups related information
Requires prior classification run
Maintains data relationships
- _classification_orchestrator(data, pre_classification_model='gpt-4o-mini', classification_model='gpt-4o-mini', theme_model='gpt-4o-mini')[source]#
Core classification logic for processing research metadata.
Implements the main classification workflow, processing research metadata through multiple stages of classification and theme extraction.
- Parameters:
data (List) – List of dictionaries containing research metadata. Type: List[Dict[str, Any]]
pre_classification_model (str | None) – Model for pre-classification processing. Type: str | None Defaults to “gpt-4o-mini”
classification_model (str | None) – Model for main classification. Type: str | None Defaults to “gpt-4o-mini”
theme_model (str | None) – Model for theme extraction. Type: str | None Defaults to “gpt-4o-mini”
- Returns:
- Modified data with classifications and themes injected.
Type: List[Dict[str, Any]] Includes: - Original metadata - Classification results - Theme information - Processing status
- Return type:
List
Notes
Processes items sequentially
Handles classification failures
Tracks unclassified items
Updates internal statistics
Maintains data integrity
Manages model selection
- _inject_categories(data, categories)[source]#
Adds classification results to a research metadata dictionary.
Injects classification categories and themes into the provided metadata dictionary, handling both dictionary and tuple result formats.
- Parameters:
data (Dict) – Research metadata dictionary. Type: Dict[str, Any]
categories (Union) –
Classification results including categories and themes. Type: Union[ClassificationResultsDict, ClassificationResultsTuple] Where: - ClassificationResultsDict: Dict[str, List[str]]
- Format: {
“top_categories”: List[str], “mid_categories”: List[str], “low_categories”: List[str], “themes”: List[str]
}
- ClassificationResultsTuple: Tuple[List[str], List[str], List[str], List[str]]
- Format: (
top_level_categories: List[str], mid_level_categories: List[str], low_level_categories: List[str], themes: List[str]
)
- Raises:
ValueError – If categories is neither a dict nor a tuple
- Return type:
Notes
Modifies input dictionary in-place
Handles both result formats
Preserves existing metadata
Validates category structure
Maintains hierarchical relationships
Provides default empty lists for missing dictionary keys
- _extract_categories(doi, classifier)[source]#
Gets classification results for a specific DOI.
Retrieves the classification categories and themes for a given DOI using the provided classifier instance. Supports both dictionary and tuple result formats.
- Parameters:
doi (str) – DOI identifier for the research item. Type: str
classifier (AbstractClassifier) – Classifier instance that performed classification. Type:
AbstractClassifier
- Returns:
- Classification results including categories and themes.
- Type: Union[
ClassificationResultsDict,
Where: -
ClassificationResultsDict:- Format: {
“top_categories”: List[str], “mid_categories”: List[str], “low_categories”: List[str], “themes”: List[str]
}
ClassificationResultsTuple:- Format: (
top_level_categories: List[str], mid_level_categories: List[str], low_level_categories: List[str], themes: List[str]
)
- Type: Union[
- Return type:
Union
Notes
Utilizes classifier to obtain results
Supports multiple result formats
Ensures DOI is valid and classified
Handles missing classification gracefully
- _make_doi_abstract_dict(doi, abstract)[source]#
Creates a DOI to abstract mapping dictionary.
Constructs a dictionary that maps a given DOI to its corresponding abstract, ensuring both values are provided.
- Parameters:
- Returns:
- Dictionary mapping DOI to abstract.
Type: Dict[str, str] Format: {doi: abstract}
- Return type:
- Raises:
ValueError – If either DOI or abstract is missing
Notes
Ensures both DOI and abstract are non-empty
Provides a simple mapping structure
Validates input before mapping
- _get_classification_dependencies(item)[source]#
Extracts DOI, abstract, and extra context from a research metadata dictionary.
Uses the utilities module to safely extract required attributes from the research metadata, handling missing or invalid values.
- Parameters:
item (dict) – Research metadata dictionary. Type: Dict[str, Any]
- Returns:
- DOI, abstract, and extra context.
Type: Tuple[str, str, dict] Format: (
doi: str | None, abstract: str | None, extra_context: dict | None
)
- Return type:
Notes
Uses
Utilitiesfor extraction- Extracts attributes:
Returns None for any missing attributes
Preserves original attribute values
Handles missing or malformed data gracefully
- _update_classified_instance_variables(item, doi, abstract)[source]#
Updates tracking variables for unclassified items.
Maintains multiple tracking collections for items that couldn’t be classified, ensuring consistent record-keeping across different data structures.
- Parameters:
- Return type:
- Returns:
None
Notes
- Updates instance variables:
unclassified_details_dict
Handles missing values by using “NULL” placeholder
Maintains parallel data structures for different access patterns
Preserves original metadata in unclassified items list
Increments unclassified item counter
- _set_classification_ran_true()[source]#
Sets the classification ran flag to true.
Updates the internal state to indicate that classification process has been executed.
- Parameters:
None
- Return type:
- Returns:
None
Notes
Updates
_classification_ranUsed for validation checks
State cannot be reset to false
- _validate_classification_ran(classification_ran)[source]#
Checks if classification has been run.
Verifies whether the classification process has been executed by checking the internal state flag.
- Parameters:
None
- Returns:
- True if classification has been run, False otherwise.
Type: bool
- Return type:
Notes
Reads
_classification_ranUsed for validation before accessing results
Cannot detect if classification is currently running
- _normalize_abstract(abstract)[source]#
Normalizes an abstract by removing LaTeX and converting any resulting unicode to ASCII.
Processes research abstract text through two stages: 1. Converts LaTeX notation to unicode text 2. Converts unicode characters to ASCII equivalents
- Parameters:
abstract (str) – Research abstract text. Type: str May contain LaTeX math notation and unicode characters.
- Returns:
- Normalized abstract text.
Type: str Contains only ASCII characters.
- Return type:
Notes
Uses
LatexNodes2Textfor LaTeX conversionUses Unidecode for unicode to ASCII conversion
Handles mathematical notation
Preserves text structure
Removes special characters
Math mode set to “text” for consistent conversion