orchestrators package#

Submodules#

orchestrators.category_data_orchestrator module#

class academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator(*, data, output_dir_path, category_processor, faculty_postprocessor, department_postprocessor, strategy_factory, dataclass_factory, warning_manager, utilities, extend=False)[source]#

Bases: object

Orchestrates the processing and organization of academic publication data.

This class manages the workflow of processing classified publication data through various stages: 1. Processing raw data through CategoryProcessor 2. Managing faculty/department relationships 3. Generating statistical outputs 4. Serializing results to JSON files

data#

Raw classified publication data to process.

Type:: List[Dict]

output_dir_path#

Directory path for output files.

Type:: str

extend#

Whether to extend existing data files.

Type:: bool

strategy_factory#

Factory for creating processing strategies.

Type:: StrategyFactory

warning_manager#

System for handling and logging warnings.

Type:: WarningManager

dataclass_factory#

Factory for creating data model instances.

Type:: DataClassFactory

utils#

General utility functions.

Type:: Utilities

category_processor#

Processor for category-related operations.

Type:: CategoryProcessor

faculty_postprocessor#

Processor for faculty data refinement.

Type:: FacultyPostprocessor

final_category_data#

Processed category statistics.

Type:: List[Dict]

final_faculty_data#

Processed faculty statistics.

Type:: List[Dict]

final_article_stats_data#

Processed article statistics.

Type:: List[Dict]

final_article_data#

Processed article details.

Type:: List[Dict]

final_global_faculty_data#

Processed global faculty statistics.

Type:: List[Dict]

logger#

Logger instance for this class.

Type:: logging.Logger

log_file_path#

Path to the log file.

Type:: str

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.run_orchestrator`: Executes the main data processing workflow.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_category_data`: Returns processed category data.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_faculty_data`: Returns processed faculty data.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_global_faculty_data`: Returns processed global faculty data.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_article_stats_data`: Returns processed article statistics.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_article_data`: Returns processed article details.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._save_all_results`: Saves all processed data to files.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._refine_faculty_sets`: Refines faculty sets by removing duplicates.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._refine_faculty_stats`: Refines faculty statistics based on name variations.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._clean_category_data`: Prepares category data by removing unwanted keys.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_category_data`: Serializes and saves category data.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_faculty_stats`: Serializes and saves faculty statistics.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_global_faculty_stats`: Serializes and saves global faculty statistics.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_category_article_stats`: Serializes and saves article statistics.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_articles`: Serializes and saves article details.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._flatten_to_list`: Flattens nested data structures into a list.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._write_to_json`: Writes data to JSON file.

__init__(*, data, output_dir_path, category_processor, faculty_postprocessor, department_postprocessor, strategy_factory, dataclass_factory, warning_manager, utilities, extend=False)[source]#

Initialize the CategoryDataOrchestrator with required components and settings.

Sets up logging configuration with both file and console handlers and initializes internal data structures for storing processed results.

Parameters:

data (List[Dict]) – Raw classified publication data to process.
output_dir_path (str) – Directory path where output files will be saved.
category_processor (CategoryProcessor) – Processor for handling category-related operations.
faculty_postprocessor (FacultyPostprocessor) – Processor for faculty data refinement.
strategy_factory (StrategyFactory) – Factory for creating processing strategies.
dataclass_factory (DataClassFactory) – Factory for creating data model instances.
warning_manager (WarningManager) – System for handling and logging warnings.
utilities (Utilities) – General utility functions.
extend (bool, optional) – Whether to extend existing data files. Defaults to False.

Raises:

ValueError – If output directory path doesn’t exist or isn’t writable.
TypeError – If any of the processor or factory arguments are of incorrect type.

run_orchestrator(category_data=None)[source]#

Execute the main data processing workflow.

Processes the raw publication data through several stages: 1. Processes data through CategoryProcessor 2. Gets category data for faculty set refinement 3. Refines faculty sets to remove duplicates 4. Refines faculty statistics with name variations 5. Saves all processed results to files

Raises:

ValueError – If category data processing fails.
IOError – If saving results to files fails.

Return type:

None

get_final_category_data()[source]#

Retrieve the processed category data.

Returns:: List of processed category data dictionaries.
Return type:: List[Dict]
Raises:: ValueError – If final category data hasn’t been generated yet.

get_final_faculty_data()[source]#

Retrieve the processed faculty data.

Returns:: List of processed faculty data dictionaries.
Return type:: List[Dict]
Raises:: ValueError – If final faculty data hasn’t been generated yet.

get_final_global_faculty_data()[source]#

Retrieve the processed global faculty data.

Returns:: List of processed global faculty data dictionaries.
Return type:: List[Dict]
Raises:: ValueError – If final global faculty data hasn’t been generated yet.

get_final_article_stats_data()[source]#

Retrieve the processed article statistics data.

Returns:: List of processed article statistics dictionaries.
Return type:: List[Dict]
Raises:: ValueError – If final article statistics data hasn’t been generated yet.

get_final_article_data()[source]#

Retrieve the processed article data.

Returns:: List of processed article data dictionaries.
Return type:: List[Dict]
Raises:: ValueError – If final article data hasn’t been generated yet.

_save_all_results()[source]#

Save all processed data to their respective JSON files.

Serializes and saves: 1. Category data 2. Faculty statistics 3. Article statistics 4. Article details 5. Global faculty statistics

Raises:: IOError – If any file operations fail.
Return type:: None

_refine_faculty(faculty_postprocessor, category_dict)[source]#

Refines faculty sets by removing near duplicates and updating counts.

Uses FacultyPostprocessor to clean faculty data by removing near-duplicate entries and updating all related faculty and department counts.

Parameters:

faculty_postprocessor (FacultyPostprocessor) – Postprocessor for faculty data. Type: academic_metrics.postprocessors.faculty_postprocessor.FacultyPostprocessor
category_dict (dict) – Dictionary of categories and their information. Type: Dict[str, academic_metrics.models.category_info.CategoryInfo]

Return type:

None

Notes

Removes near-duplicate faculty entries
Updates faculty counts per category
Updates department counts
Maintains faculty-department relationships
Ensures data consistency after refinement

_refine_faculty_stats(*, faculty_stats, variations, category_dict)[source]#

Refines faculty statistics based on name variations.

Processes faculty statistics to account for name variations, ensuring accurate attribution of publications and metrics across all faculty members.

Parameters:

faculty_stats (Dict) – Dictionary of faculty statistics. Type: Dict[str, academic_metrics.models.faculty_stats.FacultyStats]
variations (Dict) – Dictionary of name variations. Type: Dict[str, academic_metrics.models.string_variation.StringVariation]
category_dict (Dict) – Dictionary of categories and their information. Type: Dict[str, academic_metrics.models.category_info.CategoryInfo]

Return type:

None

Notes

Iterates through all categories
Processes each faculty member
Applies name variation matching
Updates publication counts
Ensures metric consistency
Maintains statistical accuracy

_refine_departments(department_postprocessor, category_dict)[source]#

Processes department sets by removing near duplicates and updates counts.

Uses DepartmentPostprocessor to clean department data by removing near-duplicate entries and updating all related department counts and relationships.

Parameters:

department_postprocessor (DepartmentPostprocessor) – Postprocessor for department data. Type: academic_metrics.postprocessors.department_postprocessor.DepartmentPostprocessor
category_dict (dict) – Dictionary of categories and their information. Type: Dict[str, academic_metrics.models.category_info.CategoryInfo]

Return type:

None

Notes

Removes near-duplicate department entries
Updates department counts per category
Maintains faculty-department relationships
Ensures naming consistency
Preserves hierarchical relationships
Updates all related statistics

_clean_category_data(category_data)[source]#

Prepare category data by removing unwanted keys.

Cleans the raw category data by removing specified keys that are not needed for further processing or analysis.

Parameters:

category_data (Dict) – Raw category data to clean. Type: Dict[str, academic_metrics.models.category_info.CategoryInfo]

Returns:

Cleaned category data with specified keys removed.: Type: Dict[str, Dict]

Return type:

dict

Notes

Identifies and removes unnecessary keys
Preserves essential category information
Ensures data consistency
Prepares data for downstream processing

_serialize_and_save_category_data(*, output_path, category_data)[source]#

Serialize and save category data to JSON file.

Converts the category data dictionary to JSON format and saves it to the specified file path, creating directories if needed.

Parameters:

output_path (str) – Path where the JSON file will be saved. Type: str
category_data (Dict) – Category data to serialize. Type: Dict[str, Dict]

Raises:

IOError – If file writing fails or directory creation fails

Return type:

None

Notes

Creates output directory if needed
Serializes data to JSON format
Handles nested dictionary structures
Ensures proper file encoding
Validates output before saving

_serialize_and_save_faculty_stats(*, output_path, faculty_stats)[source]#

Serialize and save faculty statistics to JSON file.

Converts the faculty statistics dictionary to JSON format and saves it to the specified file path, creating directories if needed.

Parameters:

output_path (str) – Path where the JSON file will be saved. Type: str
faculty_stats (Dict) – Faculty statistics to serialize. Type: Dict[str, academic_metrics.models.faculty_stats.FacultyStats]

Raises:

IOError – If file writing fails or directory creation fails

Return type:

None

Notes

Creates output directory if needed
Serializes data to JSON format
Handles nested dictionary structures
Ensures proper file encoding
Validates output before saving

_serialize_and_save_global_faculty_stats(*, output_path, global_faculty_stats)[source]#

Serialize and save global faculty statistics to JSON file.

Converts the global faculty statistics dictionary to JSON format and saves it to the specified file path, creating directories if needed.

Parameters:

output_path (str) – Path where the JSON file will be saved. Type: str
global_faculty_stats (Dict) – Global faculty statistics to serialize. Type: Dict[str, academic_metrics.models.faculty_stats.FacultyStats]

Raises:

IOError – If file writing fails or directory creation fails

Return type:

None

Notes

Creates output directory if needed
Serializes data to JSON format
Handles nested dictionary structures
Ensures proper file encoding
Validates output before saving

_serialize_and_save_category_article_stats(*, output_path, article_stats)[source]#

Serialize and save article statistics to JSON file.

Converts the category article statistics dictionary to JSON format and saves it to the specified file path, creating directories if needed.

Parameters:

output_path (str) – Path where the JSON file will be saved. Type: str
article_stats (Dict) – Article statistics to serialize. Type: Dict[str, academic_metrics.models.crossref_article_stats.CrossrefArticleStats]

Raises:

IOError – If file writing fails or directory creation fails

Return type:

None

Notes

Creates output directory if needed
Serializes data to JSON format
Handles nested dictionary structures
Ensures proper file encoding
Validates output before saving

_serialize_and_save_articles(*, output_path, articles)[source]#

Serialize and save article details to JSON file.

Converts the list of article details to JSON format and saves it to the specified file path, creating directories if needed.

Parameters:

output_path (str) – Path where the JSON file will be saved. Type: str
articles (List) – Article details to serialize. Type: List[academic_metrics.models.crossref_article_details.CrossrefArticleDetails]

Raises:

IOError – If file writing fails or directory creation fails

Return type:

None

Notes

Creates output directory if needed
Serializes data to JSON format
Handles complex article objects
Ensures proper file encoding
Validates output before saving

_flatten_to_list(data)[source]#

Recursively flatten nested dictionaries/lists into a flat list.

Transforms a complex nested structure of dictionaries and lists into a single flat list of dictionaries, preserving all data.

Parameters:

data (Union[Dict, List]) – Nested structure of dictionaries and lists. Type: Union[Dict[str, Any], List[Dict[str, Any]]]

Returns:

Flattened list of dictionaries.: Type: List[Dict[str, Any]]

Return type:

List

Notes

Handles arbitrary nesting depth
Preserves dictionary contents
Maintains data relationships
Removes nested structure
Keeps all original values

Examples

Input:

{

“cat1”: {

“article_map”: {: “doi1”: {“title”: “Article 1”}, “doi2”: {“title”: “Article 2”}

}

Output:

[: {“title”: “Article 1”}, {“title”: “Article 2”}

]

_write_to_json(data, output_path)[source]#

Write data to JSON file, handling extend mode.

Writes the provided data to a JSON file at the specified path, creating directories if needed and handling both new files and file extensions.

Parameters:

data (Union[List[Dict], Dict]) – Data to write to file. Type: Union[List[Dict[str, Any]], Dict[str, Any]]
output_path (str) – Path where the JSON file will be saved. Type: str

Raises:

IOError – If file operations fail (creation, writing, or directory access)

Return type:

None

Notes

Creates output directory if needed
Handles both new and existing files
Supports list and dictionary data
Ensures proper JSON formatting
Validates file permissions
Maintains data integrity

orchestrators.classification_orchestrator module#

academic_metrics.orchestrators.classification_orchestrator.ClassificationResultsDict#

Type alias for a dictionary mapping DOIs to lists of classification results.

This type alias is used to represent the return type of the get_classification_results_by_doi() method.

alias of Dict[str, List[str]]

academic_metrics.orchestrators.classification_orchestrator.ClassificationResultsTuple#

Type alias for a tuple containing lists of classification results.

This type alias is used to represent the return type of the get_classification_results_by_doi() method.

Notes

Format of the tuple is (top_categories, mid_categories, low_categories, themes)

alias of Tuple[List[str], List[str], List[str], List[str]]

class academic_metrics.orchestrators.classification_orchestrator.ClassificationOrchestrator(abstract_classifier_factory, utilities)[source]#

Bases: object

Manages the classification process for research abstracts.

Orchestrates the process of extracting DOIs and abstracts from research metadata, classifying them using AbstractClassifier, and integrating results back into the original data. Tracks unclassified items for monitoring.

abstract_classifier_factory#

Factory function for AbstractClassifier instances.

Type:: Callable[…, AbstractClassifier]

taxonomy#

Classification hierarchy for AbstractClassifier.

Type:: Taxonomy

utilities#

Utilities for attribute extraction.

Type:: Utilities

ai_api_key#

API key for AI service access.

Type:: str

unclassified_item_count#

Count of unclassified items. Type: int

Type:: int

unclassified_dois#

DOIs of unclassified items. Type: List[str]

Type:: List

unclassified_abstracts#

Abstracts of unclassified items. Type: List[str]

Type:: List

unclassified_doi_abstract_dict#

Maps unclassified DOIs to abstracts. Type: Dict[str, str]

Type:: Dict

unclassified_items#

Complete metadata of unclassified items. Type: List[Dict[str, Any]]

Type:: List

unclassified_details#

Organized unclassified data. Type: Dict[str, Union[List[str], List[Dict[str, Any]]]] Contains: - dois: List of unclassified DOIs - abstracts: List of unclassified abstracts - items: List of unclassified metadata items

Type:: Dict

run_classification() → List[Dict][source]#: Processes and classifies a list of research metadata dictionaries.

get_unclassified_item_count() → int[source]#: Returns the number of unclassified items.

get_unclassified_dois() → List[str][source]#: Returns the DOIs of unclassified items.

get_unclassified_abstracts() → List[str][source]#: Returns the abstracts of unclassified items.

get_unclassified_doi_abstract_dict() → Dict[str, str][source]#: Returns the DOI to abstract mapping dictionary for unclassified items.

get_unclassified_items() → List[Dict][source]#: Returns the unclassified items.

get_unclassified_details_dict() → Dict[source]#: Returns the details of unclassified items.

_classification_orchestrator() → List[Dict][source]#: Core classification logic for processing research metadata.

_inject_categories() → None[source]#: Adds classification results to a research metadata dictionary.

_extract_categories() → ClassificationResultsDict | ClassificationResultsTuple[source]#: Gets classification results for a specific DOI.

_make_doi_abstract_dict() → Dict[str, str][source]#: Creates a DOI to abstract mapping dictionary.

_retrieve_doi_abstract() → Tuple[str, str]#: Extracts DOI and abstract from a research metadata dictionary.

_update_classified_instance_variables() → None[source]#: Updates tracking variables for unclassified items.

_set_classification_ran_true() → None[source]#: Sets the classification ran flag to true.

_has_ran_classification() → bool[source]#: Checks if classification has been run.

_validate_classification_ran() → None[source]#: Validates if classification has been run.

_normalize_abstract() → str[source]#: Normalizes an abstract by removing LaTeX and converting any resulting unicode to ASCII.

__init__(abstract_classifier_factory, utilities)[source]#

Initialize the ClassificationOrchestrator.

Sets up the orchestrator with required dependencies for classifying research abstracts and managing the classification process.

Parameters:

abstract_classifier_factory (Callable) – Factory function for AbstractClassifier. Type: Callable[[Dict[str, str]], AbstractClassifier]
utilities (Utilities) – Utilities instance for attribute extraction. Type: Utilities

Returns:

None

Notes

Initializes tracking variables for unclassified items
Sets up classification status flags
Prepares data structures for results
Validates factory function compatibility

run_classification(data, pre_classification_model='gpt-4o-mini', classification_model='gpt-4o-mini', theme_model='gpt-4o-mini')[source]#

Processes and classifies a list of research metadata dictionaries.

Extracts abstracts from research metadata, classifies them using specified AI models, and injects the classification results back into the original data.

Parameters:

data (list) – List of dictionaries containing research metadata. Type: List[Dict[str, Any]]
pre_classification_model (str | None) – Model for pre-classification processing. Type: str | None Defaults to “gpt-4o-mini”
classification_model (str | None) – Model for main classification. Type: str | None Defaults to “gpt-4o-mini”
theme_model (str | None) – Model for theme extraction. Type: str | None Defaults to “gpt-4o-mini”

Returns:

Modified data with classifications injected.: Type: List[Dict[str, Any]] Includes: - Original metadata - Classification results - Theme information - Processing status

Return type:

List

Notes

Processes each item sequentially
Tracks unclassified items
Handles missing abstracts
Updates internal statistics
Maintains original data structure

get_unclassified_item_count()[source]#

Gets the number of unclassified items.

Retrieves the count of items that could not be classified during the classification process.

Returns:

Number of unclassified items.: Type: int

Return type:

int

Raises:

RuntimeError – If classification has not been run yet

Notes

Validates classification status
Returns current count
Includes all unclassified types
Requires prior classification run

get_unclassified_dois()[source]#

Gets the DOIs of unclassified items.

Retrieves the list of Digital Object Identifiers (DOIs) for items that could not be classified during the classification process.

Returns:

List of unclassified DOIs.: Type: List[str] Empty list if all items were classified.

Return type:

List

Raises:

RuntimeError – If classification has not been run yet

Notes

Validates classification status
Returns unique DOIs only
Maintains original DOI format
Requires prior classification run

get_unclassified_abstracts()[source]#

Gets the abstracts of unclassified items.

Retrieves the list of research abstracts for items that could not be classified during the classification process.

Returns:

List of unclassified abstracts.: Type: List[str] Empty list if all items were classified.

Return type:

List

Raises:

RuntimeError – If classification has not been run yet

Notes

Validates classification status
Returns normalized abstracts
Maintains text formatting
Requires prior classification run
May include empty abstracts

get_unclassified_doi_abstract_dict()[source]#

Gets the DOI to abstract mapping dictionary for unclassified items.

Retrieves a dictionary that maps Digital Object Identifiers (DOIs) to their corresponding abstracts for items that could not be classified.

Returns:

Dictionary mapping unclassified DOIs to abstracts.: Type: Dict[str, str] Keys: DOIs (str) Values: Abstracts (str) Empty dict if all items were classified.

Return type:

Dict

Raises:

RuntimeError – If classification has not been run yet

Notes

Validates classification status
Maintains DOI-abstract relationships
Contains normalized abstracts
Requires prior classification run
Preserves original DOI format

get_unclassified_items()[source]#

Gets the unclassified items.

Retrieves the complete list of research items that could not be classified, including all their original metadata.

Returns:

List of unclassified items with full metadata.: Type: List[Dict[str, Any]] Empty list if all items were classified. Each dict contains complete item metadata.

Return type:

List

Raises:

RuntimeError – If classification has not been run yet

Notes

Validates classification status
Returns complete metadata
Preserves original structure
Requires prior classification run
Maintains all item attributes

get_unclassified_details_dict()[source]#

Gets the details of unclassified items.

Retrieves a comprehensive dictionary containing organized information about all unclassified items, including DOIs, abstracts, and complete metadata.

Returns:

Organized details of unclassified items.: Type: Dict[str, Union[List[str], List[Dict[str, Any]]]] Contains: - dois: List[str] - Unclassified DOIs - abstracts: List[str] - Unclassified abstracts - items: List[Dict] - Complete metadata

Return type:

Dict

Raises:

RuntimeError – If classification has not been run yet

Notes

Validates classification status
Provides structured access
Groups related information
Requires prior classification run
Maintains data relationships

_classification_orchestrator(data, pre_classification_model='gpt-4o-mini', classification_model='gpt-4o-mini', theme_model='gpt-4o-mini')[source]#

Core classification logic for processing research metadata.

Implements the main classification workflow, processing research metadata through multiple stages of classification and theme extraction.

Parameters:

data (List) – List of dictionaries containing research metadata. Type: List[Dict[str, Any]]
pre_classification_model (str | None) – Model for pre-classification processing. Type: str | None Defaults to “gpt-4o-mini”
classification_model (str | None) – Model for main classification. Type: str | None Defaults to “gpt-4o-mini”
theme_model (str | None) – Model for theme extraction. Type: str | None Defaults to “gpt-4o-mini”

Returns:

Modified data with classifications and themes injected.: Type: List[Dict[str, Any]] Includes: - Original metadata - Classification results - Theme information - Processing status

Return type:

List

Notes

Processes items sequentially
Handles classification failures
Tracks unclassified items
Updates internal statistics
Maintains data integrity
Manages model selection

_inject_categories(data, categories)[source]#

Adds classification results to a research metadata dictionary.

Injects classification categories and themes into the provided metadata dictionary, handling both dictionary and tuple result formats.

Parameters:

data (Dict) – Research metadata dictionary. Type: Dict[str, Any]
categories (Union) –
Classification results including categories and themes. Type: Union[ClassificationResultsDict, ClassificationResultsTuple] Where: - ClassificationResultsDict: Dict[str, List[str]]

Format: {
“top_categories”: List[str], “mid_categories”: List[str], “low_categories”: List[str], “themes”: List[str]

}
- ClassificationResultsTuple: Tuple[List[str], List[str], List[str], List[str]]
  
  Format: (
  top_level_categories: List[str], mid_level_categories: List[str], low_level_categories: List[str], themes: List[str]
  
  )

Raises:

ValueError – If categories is neither a dict nor a tuple

Return type:

None

Notes

Modifies input dictionary in-place
Handles both result formats
Preserves existing metadata
Validates category structure
Maintains hierarchical relationships
Provides default empty lists for missing dictionary keys

_extract_categories(doi, classifier)[source]#

Gets classification results for a specific DOI.

Retrieves the classification categories and themes for a given DOI using the provided classifier instance. Supports both dictionary and tuple result formats.

Parameters:

doi (str) – DOI identifier for the research item. Type: str
classifier (AbstractClassifier) – Classifier instance that performed classification. Type: AbstractClassifier

Returns:

Classification results including categories and themes.

Type: Union[ClassificationResultsDict,: ClassificationResultsTuple]

Where: - ClassificationResultsDict:

Format: {
“top_categories”: List[str], “mid_categories”: List[str], “low_categories”: List[str], “themes”: List[str]

}

ClassificationResultsTuple:

Format: (
top_level_categories: List[str], mid_level_categories: List[str], low_level_categories: List[str], themes: List[str]

)

Return type:

Union

Notes

Utilizes classifier to obtain results
Supports multiple result formats
Ensures DOI is valid and classified
Handles missing classification gracefully

_make_doi_abstract_dict(doi, abstract)[source]#

Creates a DOI to abstract mapping dictionary.

Constructs a dictionary that maps a given DOI to its corresponding abstract, ensuring both values are provided.

Parameters:

doi (str) – DOI identifier for the research item. Type: str
abstract (str) – Research abstract text. Type: str

Returns:

Dictionary mapping DOI to abstract.: Type: Dict[str, str] Format: {doi: abstract}

Return type:

dict

Raises:

ValueError – If either DOI or abstract is missing

Notes

Ensures both DOI and abstract are non-empty
Provides a simple mapping structure
Validates input before mapping

_get_classification_dependencies(item)[source]#

Extracts DOI, abstract, and extra context from a research metadata dictionary.

Uses the utilities module to safely extract required attributes from the research metadata, handling missing or invalid values.

Parameters:

item (dict) – Research metadata dictionary. Type: Dict[str, Any]

Returns:

DOI, abstract, and extra context.

Type: Tuple[str, str, dict] Format: (

doi: str | None, abstract: str | None, extra_context: dict | None

)

Return type:

tuple

Notes

Uses Utilities for extraction
Extracts attributes:
Returns None for any missing attributes
Preserves original attribute values
Handles missing or malformed data gracefully

_update_classified_instance_variables(item, doi, abstract)[source]#

Updates tracking variables for unclassified items.

Maintains multiple tracking collections for items that couldn’t be classified, ensuring consistent record-keeping across different data structures.

Parameters:

item (dict) – Research metadata dictionary. Type: Dict[str, Any]
doi (str) – DOI identifier. Type: str | None
abstract (str) – Research abstract text. Type: str | None

Return type:

None

Returns:

None

Notes

Updates instance variables:
- unclassified_item_count
- unclassified_dois
- unclassified_abstracts
- unclassified_doi_abstract_dict
- unclassified_items
- unclassified_details_dict
Handles missing values by using “NULL” placeholder
Maintains parallel data structures for different access patterns
Preserves original metadata in unclassified items list
Increments unclassified item counter

_set_classification_ran_true()[source]#

Sets the classification ran flag to true.

Updates the internal state to indicate that classification process has been executed.

Parameters:: None
Return type:: None
Returns:: None

Notes

Updates _classification_ran
Used for validation checks
State cannot be reset to false

_has_ran_classification()[source]#

Checks if classification has been run.

Returns

Return type:: bool

_validate_classification_ran(classification_ran)[source]#

Checks if classification has been run.

Verifies whether the classification process has been executed by checking the internal state flag.

Parameters:

None

Returns:

True if classification has been run, False otherwise.: Type: bool

Return type:

bool

Notes

Reads _classification_ran
Used for validation before accessing results
Cannot detect if classification is currently running

_normalize_abstract(abstract)[source]#

Normalizes an abstract by removing LaTeX and converting any resulting unicode to ASCII.

Processes research abstract text through two stages: 1. Converts LaTeX notation to unicode text 2. Converts unicode characters to ASCII equivalents

Parameters:

abstract (str) – Research abstract text. Type: str May contain LaTeX math notation and unicode characters.

Returns:

Normalized abstract text.: Type: str Contains only ASCII characters.

Return type:

str

Notes

Uses LatexNodes2Text for LaTeX conversion
Uses Unidecode for unicode to ASCII conversion
Handles mathematical notation
Preserves text structure
Removes special characters
Math mode set to “text” for consistent conversion

Previous topic

Next topic

Table of Contents

This Page

orchestrators package#

Submodules#

orchestrators.category_data_orchestrator module#

orchestrators.classification_orchestrator module#

Module contents#

This Page