orchestrators package#

Submodules#

orchestrators.category_data_orchestrator module#

class academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator(*, data, output_dir_path, category_processor, faculty_postprocessor, department_postprocessor, strategy_factory, dataclass_factory, warning_manager, utilities, extend=False)[source]#

Bases: object

Orchestrates the processing and organization of academic publication data.

This class manages the workflow of processing classified publication data through various stages: 1. Processing raw data through CategoryProcessor 2. Managing faculty/department relationships 3. Generating statistical outputs 4. Serializing results to JSON files

data#

Raw classified publication data to process.

Type:

List[Dict]

output_dir_path#

Directory path for output files.

Type:

str

extend#

Whether to extend existing data files.

Type:

bool

strategy_factory#

Factory for creating processing strategies.

Type:

StrategyFactory

warning_manager#

System for handling and logging warnings.

Type:

WarningManager

dataclass_factory#

Factory for creating data model instances.

Type:

DataClassFactory

utils#

General utility functions.

Type:

Utilities

category_processor#

Processor for category-related operations.

Type:

CategoryProcessor

faculty_postprocessor#

Processor for faculty data refinement.

Type:

FacultyPostprocessor

final_category_data#

Processed category statistics.

Type:

List[Dict]

final_faculty_data#

Processed faculty statistics.

Type:

List[Dict]

final_article_stats_data#

Processed article statistics.

Type:

List[Dict]

final_article_data#

Processed article details.

Type:

List[Dict]

final_global_faculty_data#

Processed global faculty statistics.

Type:

List[Dict]

logger#

Logger instance for this class.

Type:

logging.Logger

log_file_path#

Path to the log file.

Type:

str

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.run_orchestrator`

Executes the main data processing workflow.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_category_data`

Returns processed category data.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_faculty_data`

Returns processed faculty data.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_global_faculty_data`

Returns processed global faculty data.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_article_stats_data`

Returns processed article statistics.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator.get_final_article_data`

Returns processed article details.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._save_all_results`

Saves all processed data to files.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._refine_faculty_sets`

Refines faculty sets by removing duplicates.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._refine_faculty_stats`

Refines faculty statistics based on name variations.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._clean_category_data`

Prepares category data by removing unwanted keys.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_category_data`

Serializes and saves category data.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_faculty_stats`

Serializes and saves faculty statistics.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_global_faculty_stats`

Serializes and saves global faculty statistics.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_category_article_stats`

Serializes and saves article statistics.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._serialize_and_save_articles`

Serializes and saves article details.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._flatten_to_list`

Flattens nested data structures into a list.

:meth:`~academic_metrics.orchestrators.category_data_orchestrator.CategoryDataOrchestrator._write_to_json`

Writes data to JSON file.

__init__(*, data, output_dir_path, category_processor, faculty_postprocessor, department_postprocessor, strategy_factory, dataclass_factory, warning_manager, utilities, extend=False)[source]#

Initialize the CategoryDataOrchestrator with required components and settings.

Sets up logging configuration with both file and console handlers and initializes internal data structures for storing processed results.

Parameters:
  • data (List[Dict]) – Raw classified publication data to process.

  • output_dir_path (str) – Directory path where output files will be saved.

  • category_processor (CategoryProcessor) – Processor for handling category-related operations.

  • faculty_postprocessor (FacultyPostprocessor) – Processor for faculty data refinement.

  • strategy_factory (StrategyFactory) – Factory for creating processing strategies.

  • dataclass_factory (DataClassFactory) – Factory for creating data model instances.

  • warning_manager (WarningManager) – System for handling and logging warnings.

  • utilities (Utilities) – General utility functions.

  • extend (bool, optional) – Whether to extend existing data files. Defaults to False.

Raises:
  • ValueError – If output directory path doesn’t exist or isn’t writable.

  • TypeError – If any of the processor or factory arguments are of incorrect type.

run_orchestrator(category_data=None)[source]#

Execute the main data processing workflow.

Processes the raw publication data through several stages: 1. Processes data through CategoryProcessor 2. Gets category data for faculty set refinement 3. Refines faculty sets to remove duplicates 4. Refines faculty statistics with name variations 5. Saves all processed results to files

Raises:
  • ValueError – If category data processing fails.

  • IOError – If saving results to files fails.

Return type:

None

get_final_category_data()[source]#

Retrieve the processed category data.

Returns:

List of processed category data dictionaries.

Return type:

List[Dict]

Raises:

ValueError – If final category data hasn’t been generated yet.

get_final_faculty_data()[source]#

Retrieve the processed faculty data.

Returns:

List of processed faculty data dictionaries.

Return type:

List[Dict]

Raises:

ValueError – If final faculty data hasn’t been generated yet.

get_final_global_faculty_data()[source]#

Retrieve the processed global faculty data.

Returns:

List of processed global faculty data dictionaries.

Return type:

List[Dict]

Raises:

ValueError – If final global faculty data hasn’t been generated yet.

get_final_article_stats_data()[source]#

Retrieve the processed article statistics data.

Returns:

List of processed article statistics dictionaries.

Return type:

List[Dict]

Raises:

ValueError – If final article statistics data hasn’t been generated yet.

get_final_article_data()[source]#

Retrieve the processed article data.

Returns:

List of processed article data dictionaries.

Return type:

List[Dict]

Raises:

ValueError – If final article data hasn’t been generated yet.

_save_all_results()[source]#

Save all processed data to their respective JSON files.

Serializes and saves: 1. Category data 2. Faculty statistics 3. Article statistics 4. Article details 5. Global faculty statistics

Raises:

IOError – If any file operations fail.

Return type:

None

_refine_faculty(faculty_postprocessor, category_dict)[source]#

Refines faculty sets by removing near duplicates and updating counts.

Uses FacultyPostprocessor to clean faculty data by removing near-duplicate entries and updating all related faculty and department counts.

Parameters:
  • faculty_postprocessor (FacultyPostprocessor) – Postprocessor for faculty data. Type: academic_metrics.postprocessors.faculty_postprocessor.FacultyPostprocessor

  • category_dict (dict) – Dictionary of categories and their information. Type: Dict[str, academic_metrics.models.category_info.CategoryInfo]

Return type:

None

Notes

  • Removes near-duplicate faculty entries

  • Updates faculty counts per category

  • Updates department counts

  • Maintains faculty-department relationships

  • Ensures data consistency after refinement

_refine_faculty_stats(*, faculty_stats, variations, category_dict)[source]#

Refines faculty statistics based on name variations.

Processes faculty statistics to account for name variations, ensuring accurate attribution of publications and metrics across all faculty members.

Parameters:
  • faculty_stats (Dict) – Dictionary of faculty statistics. Type: Dict[str, academic_metrics.models.faculty_stats.FacultyStats]

  • variations (Dict) – Dictionary of name variations. Type: Dict[str, academic_metrics.models.string_variation.StringVariation]

  • category_dict (Dict) – Dictionary of categories and their information. Type: Dict[str, academic_metrics.models.category_info.CategoryInfo]

Return type:

None

Notes

  • Iterates through all categories

  • Processes each faculty member

  • Applies name variation matching

  • Updates publication counts

  • Ensures metric consistency

  • Maintains statistical accuracy

_refine_departments(department_postprocessor, category_dict)[source]#

Processes department sets by removing near duplicates and updates counts.

Uses DepartmentPostprocessor to clean department data by removing near-duplicate entries and updating all related department counts and relationships.

Parameters:
  • department_postprocessor (DepartmentPostprocessor) – Postprocessor for department data. Type: academic_metrics.postprocessors.department_postprocessor.DepartmentPostprocessor

  • category_dict (dict) – Dictionary of categories and their information. Type: Dict[str, academic_metrics.models.category_info.CategoryInfo]

Return type:

None

Notes

  • Removes near-duplicate department entries

  • Updates department counts per category

  • Maintains faculty-department relationships

  • Ensures naming consistency

  • Preserves hierarchical relationships

  • Updates all related statistics

_clean_category_data(category_data)[source]#

Prepare category data by removing unwanted keys.

Cleans the raw category data by removing specified keys that are not needed for further processing or analysis.

Parameters:

category_data (Dict) – Raw category data to clean. Type: Dict[str, academic_metrics.models.category_info.CategoryInfo]

Returns:

Cleaned category data with specified keys removed.

Type: Dict[str, Dict]

Return type:

dict

Notes

  • Identifies and removes unnecessary keys

  • Preserves essential category information

  • Ensures data consistency

  • Prepares data for downstream processing

_serialize_and_save_category_data(*, output_path, category_data)[source]#

Serialize and save category data to JSON file.

Converts the category data dictionary to JSON format and saves it to the specified file path, creating directories if needed.

Parameters:
  • output_path (str) – Path where the JSON file will be saved. Type: str

  • category_data (Dict) – Category data to serialize. Type: Dict[str, Dict]

Raises:

IOError – If file writing fails or directory creation fails

Return type:

None

Notes

  • Creates output directory if needed

  • Serializes data to JSON format

  • Handles nested dictionary structures

  • Ensures proper file encoding

  • Validates output before saving

_serialize_and_save_faculty_stats(*, output_path, faculty_stats)[source]#

Serialize and save faculty statistics to JSON file.

Converts the faculty statistics dictionary to JSON format and saves it to the specified file path, creating directories if needed.

Parameters:
  • output_path (str) – Path where the JSON file will be saved. Type: str

  • faculty_stats (Dict) – Faculty statistics to serialize. Type: Dict[str, academic_metrics.models.faculty_stats.FacultyStats]

Raises:

IOError – If file writing fails or directory creation fails

Return type:

None

Notes

  • Creates output directory if needed

  • Serializes data to JSON format

  • Handles nested dictionary structures

  • Ensures proper file encoding

  • Validates output before saving

_serialize_and_save_global_faculty_stats(*, output_path, global_faculty_stats)[source]#

Serialize and save global faculty statistics to JSON file.

Converts the global faculty statistics dictionary to JSON format and saves it to the specified file path, creating directories if needed.

Parameters:
  • output_path (str) – Path where the JSON file will be saved. Type: str

  • global_faculty_stats (Dict) – Global faculty statistics to serialize. Type: Dict[str, academic_metrics.models.faculty_stats.FacultyStats]

Raises:

IOError – If file writing fails or directory creation fails

Return type:

None

Notes

  • Creates output directory if needed

  • Serializes data to JSON format

  • Handles nested dictionary structures

  • Ensures proper file encoding

  • Validates output before saving

_serialize_and_save_category_article_stats(*, output_path, article_stats)[source]#

Serialize and save article statistics to JSON file.

Converts the category article statistics dictionary to JSON format and saves it to the specified file path, creating directories if needed.

Parameters:
  • output_path (str) – Path where the JSON file will be saved. Type: str

  • article_stats (Dict) – Article statistics to serialize. Type: Dict[str, academic_metrics.models.crossref_article_stats.CrossrefArticleStats]

Raises:

IOError – If file writing fails or directory creation fails

Return type:

None

Notes

  • Creates output directory if needed

  • Serializes data to JSON format

  • Handles nested dictionary structures

  • Ensures proper file encoding

  • Validates output before saving

_serialize_and_save_articles(*, output_path, articles)[source]#

Serialize and save article details to JSON file.

Converts the list of article details to JSON format and saves it to the specified file path, creating directories if needed.

Parameters:
  • output_path (str) – Path where the JSON file will be saved. Type: str

  • articles (List) – Article details to serialize. Type: List[academic_metrics.models.crossref_article_details.CrossrefArticleDetails]

Raises:

IOError – If file writing fails or directory creation fails

Return type:

None

Notes

  • Creates output directory if needed

  • Serializes data to JSON format

  • Handles complex article objects

  • Ensures proper file encoding

  • Validates output before saving

_flatten_to_list(data)[source]#

Recursively flatten nested dictionaries/lists into a flat list.

Transforms a complex nested structure of dictionaries and lists into a single flat list of dictionaries, preserving all data.

Parameters:

data (Union[Dict, List]) – Nested structure of dictionaries and lists. Type: Union[Dict[str, Any], List[Dict[str, Any]]]

Returns:

Flattened list of dictionaries.

Type: List[Dict[str, Any]]

Return type:

List

Notes

  • Handles arbitrary nesting depth

  • Preserves dictionary contents

  • Maintains data relationships

  • Removes nested structure

  • Keeps all original values

Examples

Input:
{
“cat1”: {
“article_map”: {

“doi1”: {“title”: “Article 1”}, “doi2”: {“title”: “Article 2”}

}

}

}

Output:
[

{“title”: “Article 1”}, {“title”: “Article 2”}

]

_write_to_json(data, output_path)[source]#

Write data to JSON file, handling extend mode.

Writes the provided data to a JSON file at the specified path, creating directories if needed and handling both new files and file extensions.

Parameters:
  • data (Union[List[Dict], Dict]) – Data to write to file. Type: Union[List[Dict[str, Any]], Dict[str, Any]]

  • output_path (str) – Path where the JSON file will be saved. Type: str

Raises:

IOError – If file operations fail (creation, writing, or directory access)

Return type:

None

Notes

  • Creates output directory if needed

  • Handles both new and existing files

  • Supports list and dictionary data

  • Ensures proper JSON formatting

  • Validates file permissions

  • Maintains data integrity

orchestrators.classification_orchestrator module#

academic_metrics.orchestrators.classification_orchestrator.ClassificationResultsDict#

Type alias for a dictionary mapping DOIs to lists of classification results.

This type alias is used to represent the return type of the get_classification_results_by_doi() method.

alias of Dict[str, List[str]]

academic_metrics.orchestrators.classification_orchestrator.ClassificationResultsTuple#

Type alias for a tuple containing lists of classification results.

This type alias is used to represent the return type of the get_classification_results_by_doi() method.

Notes

  • Format of the tuple is (top_categories, mid_categories, low_categories, themes)

alias of Tuple[List[str], List[str], List[str], List[str]]

class academic_metrics.orchestrators.classification_orchestrator.ClassificationOrchestrator(abstract_classifier_factory, utilities)[source]#

Bases: object

Manages the classification process for research abstracts.

Orchestrates the process of extracting DOIs and abstracts from research metadata, classifying them using AbstractClassifier, and integrating results back into the original data. Tracks unclassified items for monitoring.

abstract_classifier_factory#

Factory function for AbstractClassifier instances.

Type:

Callable[…, AbstractClassifier]

taxonomy#

Classification hierarchy for AbstractClassifier.

Type:

Taxonomy

utilities#

Utilities for attribute extraction.

Type:

Utilities

ai_api_key#

API key for AI service access.

Type:

str

unclassified_item_count#

Count of unclassified items. Type: int

Type:

int

unclassified_dois#

DOIs of unclassified items. Type: List[str]

Type:

List

unclassified_abstracts#

Abstracts of unclassified items. Type: List[str]

Type:

List

unclassified_doi_abstract_dict#

Maps unclassified DOIs to abstracts. Type: Dict[str, str]

Type:

Dict

unclassified_items#

Complete metadata of unclassified items. Type: List[Dict[str, Any]]

Type:

List

unclassified_details#

Organized unclassified data. Type: Dict[str, Union[List[str], List[Dict[str, Any]]]] Contains: - dois: List of unclassified DOIs - abstracts: List of unclassified abstracts - items: List of unclassified metadata items

Type:

Dict

run_classification() List[Dict][source]#

Processes and classifies a list of research metadata dictionaries.

get_unclassified_item_count() int[source]#

Returns the number of unclassified items.

get_unclassified_dois() List[str][source]#

Returns the DOIs of unclassified items.

get_unclassified_abstracts() List[str][source]#

Returns the abstracts of unclassified items.

get_unclassified_doi_abstract_dict() Dict[str, str][source]#

Returns the DOI to abstract mapping dictionary for unclassified items.

get_unclassified_items() List[Dict][source]#

Returns the unclassified items.

get_unclassified_details_dict() Dict[source]#

Returns the details of unclassified items.

_classification_orchestrator() List[Dict][source]#

Core classification logic for processing research metadata.

_inject_categories() None[source]#

Adds classification results to a research metadata dictionary.

_extract_categories() ClassificationResultsDict | ClassificationResultsTuple[source]#

Gets classification results for a specific DOI.

_make_doi_abstract_dict() Dict[str, str][source]#

Creates a DOI to abstract mapping dictionary.

_retrieve_doi_abstract() Tuple[str, str]#

Extracts DOI and abstract from a research metadata dictionary.

_update_classified_instance_variables() None[source]#

Updates tracking variables for unclassified items.

_set_classification_ran_true() None[source]#

Sets the classification ran flag to true.

_has_ran_classification() bool[source]#

Checks if classification has been run.

_validate_classification_ran() None[source]#

Validates if classification has been run.

_normalize_abstract() str[source]#

Normalizes an abstract by removing LaTeX and converting any resulting unicode to ASCII.

__init__(abstract_classifier_factory, utilities)[source]#

Initialize the ClassificationOrchestrator.

Sets up the orchestrator with required dependencies for classifying research abstracts and managing the classification process.

Parameters:
  • abstract_classifier_factory (Callable) – Factory function for AbstractClassifier. Type: Callable[[Dict[str, str]], AbstractClassifier]

  • utilities (Utilities) – Utilities instance for attribute extraction. Type: Utilities

Returns:

None

Notes

  • Initializes tracking variables for unclassified items

  • Sets up classification status flags

  • Prepares data structures for results

  • Validates factory function compatibility

run_classification(data, pre_classification_model='gpt-4o-mini', classification_model='gpt-4o-mini', theme_model='gpt-4o-mini')[source]#

Processes and classifies a list of research metadata dictionaries.

Extracts abstracts from research metadata, classifies them using specified AI models, and injects the classification results back into the original data.

Parameters:
  • data (list) – List of dictionaries containing research metadata. Type: List[Dict[str, Any]]

  • pre_classification_model (str | None) – Model for pre-classification processing. Type: str | None Defaults to “gpt-4o-mini”

  • classification_model (str | None) – Model for main classification. Type: str | None Defaults to “gpt-4o-mini”

  • theme_model (str | None) – Model for theme extraction. Type: str | None Defaults to “gpt-4o-mini”

Returns:

Modified data with classifications injected.

Type: List[Dict[str, Any]] Includes: - Original metadata - Classification results - Theme information - Processing status

Return type:

List

Notes

  • Processes each item sequentially

  • Tracks unclassified items

  • Handles missing abstracts

  • Updates internal statistics

  • Maintains original data structure

get_unclassified_item_count()[source]#

Gets the number of unclassified items.

Retrieves the count of items that could not be classified during the classification process.

Returns:

Number of unclassified items.

Type: int

Return type:

int

Raises:

RuntimeError – If classification has not been run yet

Notes

  • Validates classification status

  • Returns current count

  • Includes all unclassified types

  • Requires prior classification run

get_unclassified_dois()[source]#

Gets the DOIs of unclassified items.

Retrieves the list of Digital Object Identifiers (DOIs) for items that could not be classified during the classification process.

Returns:

List of unclassified DOIs.

Type: List[str] Empty list if all items were classified.

Return type:

List

Raises:

RuntimeError – If classification has not been run yet

Notes

  • Validates classification status

  • Returns unique DOIs only

  • Maintains original DOI format

  • Requires prior classification run

get_unclassified_abstracts()[source]#

Gets the abstracts of unclassified items.

Retrieves the list of research abstracts for items that could not be classified during the classification process.

Returns:

List of unclassified abstracts.

Type: List[str] Empty list if all items were classified.

Return type:

List

Raises:

RuntimeError – If classification has not been run yet

Notes

  • Validates classification status

  • Returns normalized abstracts

  • Maintains text formatting

  • Requires prior classification run

  • May include empty abstracts

get_unclassified_doi_abstract_dict()[source]#

Gets the DOI to abstract mapping dictionary for unclassified items.

Retrieves a dictionary that maps Digital Object Identifiers (DOIs) to their corresponding abstracts for items that could not be classified.

Returns:

Dictionary mapping unclassified DOIs to abstracts.

Type: Dict[str, str] Keys: DOIs (str) Values: Abstracts (str) Empty dict if all items were classified.

Return type:

Dict

Raises:

RuntimeError – If classification has not been run yet

Notes

  • Validates classification status

  • Maintains DOI-abstract relationships

  • Contains normalized abstracts

  • Requires prior classification run

  • Preserves original DOI format

get_unclassified_items()[source]#

Gets the unclassified items.

Retrieves the complete list of research items that could not be classified, including all their original metadata.

Returns:

List of unclassified items with full metadata.

Type: List[Dict[str, Any]] Empty list if all items were classified. Each dict contains complete item metadata.

Return type:

List

Raises:

RuntimeError – If classification has not been run yet

Notes

  • Validates classification status

  • Returns complete metadata

  • Preserves original structure

  • Requires prior classification run

  • Maintains all item attributes

get_unclassified_details_dict()[source]#

Gets the details of unclassified items.

Retrieves a comprehensive dictionary containing organized information about all unclassified items, including DOIs, abstracts, and complete metadata.

Returns:

Organized details of unclassified items.

Type: Dict[str, Union[List[str], List[Dict[str, Any]]]] Contains: - dois: List[str] - Unclassified DOIs - abstracts: List[str] - Unclassified abstracts - items: List[Dict] - Complete metadata

Return type:

Dict

Raises:

RuntimeError – If classification has not been run yet

Notes

  • Validates classification status

  • Provides structured access

  • Groups related information

  • Requires prior classification run

  • Maintains data relationships

_classification_orchestrator(data, pre_classification_model='gpt-4o-mini', classification_model='gpt-4o-mini', theme_model='gpt-4o-mini')[source]#

Core classification logic for processing research metadata.

Implements the main classification workflow, processing research metadata through multiple stages of classification and theme extraction.

Parameters:
  • data (List) – List of dictionaries containing research metadata. Type: List[Dict[str, Any]]

  • pre_classification_model (str | None) – Model for pre-classification processing. Type: str | None Defaults to “gpt-4o-mini”

  • classification_model (str | None) – Model for main classification. Type: str | None Defaults to “gpt-4o-mini”

  • theme_model (str | None) – Model for theme extraction. Type: str | None Defaults to “gpt-4o-mini”

Returns:

Modified data with classifications and themes injected.

Type: List[Dict[str, Any]] Includes: - Original metadata - Classification results - Theme information - Processing status

Return type:

List

Notes

  • Processes items sequentially

  • Handles classification failures

  • Tracks unclassified items

  • Updates internal statistics

  • Maintains data integrity

  • Manages model selection

_inject_categories(data, categories)[source]#

Adds classification results to a research metadata dictionary.

Injects classification categories and themes into the provided metadata dictionary, handling both dictionary and tuple result formats.

Parameters:
  • data (Dict) – Research metadata dictionary. Type: Dict[str, Any]

  • categories (Union) –

    Classification results including categories and themes. Type: Union[ClassificationResultsDict, ClassificationResultsTuple] Where: - ClassificationResultsDict: Dict[str, List[str]]

    Format: {

    “top_categories”: List[str], “mid_categories”: List[str], “low_categories”: List[str], “themes”: List[str]

    }

    • ClassificationResultsTuple: Tuple[List[str], List[str], List[str], List[str]]
      Format: (

      top_level_categories: List[str], mid_level_categories: List[str], low_level_categories: List[str], themes: List[str]

      )

Raises:

ValueError – If categories is neither a dict nor a tuple

Return type:

None

Notes

  • Modifies input dictionary in-place

  • Handles both result formats

  • Preserves existing metadata

  • Validates category structure

  • Maintains hierarchical relationships

  • Provides default empty lists for missing dictionary keys

_extract_categories(doi, classifier)[source]#

Gets classification results for a specific DOI.

Retrieves the classification categories and themes for a given DOI using the provided classifier instance. Supports both dictionary and tuple result formats.

Parameters:
Returns:

Classification results including categories and themes.
Type: Union[ClassificationResultsDict,

ClassificationResultsTuple]

Where: - ClassificationResultsDict:

Format: {

“top_categories”: List[str], “mid_categories”: List[str], “low_categories”: List[str], “themes”: List[str]

}

  • ClassificationResultsTuple:
    Format: (

    top_level_categories: List[str], mid_level_categories: List[str], low_level_categories: List[str], themes: List[str]

    )

Return type:

Union

Notes

  • Utilizes classifier to obtain results

  • Supports multiple result formats

  • Ensures DOI is valid and classified

  • Handles missing classification gracefully

_make_doi_abstract_dict(doi, abstract)[source]#

Creates a DOI to abstract mapping dictionary.

Constructs a dictionary that maps a given DOI to its corresponding abstract, ensuring both values are provided.

Parameters:
  • doi (str) – DOI identifier for the research item. Type: str

  • abstract (str) – Research abstract text. Type: str

Returns:

Dictionary mapping DOI to abstract.

Type: Dict[str, str] Format: {doi: abstract}

Return type:

dict

Raises:

ValueError – If either DOI or abstract is missing

Notes

  • Ensures both DOI and abstract are non-empty

  • Provides a simple mapping structure

  • Validates input before mapping

_get_classification_dependencies(item)[source]#

Extracts DOI, abstract, and extra context from a research metadata dictionary.

Uses the utilities module to safely extract required attributes from the research metadata, handling missing or invalid values.

Parameters:

item (dict) – Research metadata dictionary. Type: Dict[str, Any]

Returns:

DOI, abstract, and extra context.

Type: Tuple[str, str, dict] Format: (

doi: str | None, abstract: str | None, extra_context: dict | None

)

Return type:

tuple

Notes

_update_classified_instance_variables(item, doi, abstract)[source]#

Updates tracking variables for unclassified items.

Maintains multiple tracking collections for items that couldn’t be classified, ensuring consistent record-keeping across different data structures.

Parameters:
  • item (dict) – Research metadata dictionary. Type: Dict[str, Any]

  • doi (str) – DOI identifier. Type: str | None

  • abstract (str) – Research abstract text. Type: str | None

Return type:

None

Returns:

None

Notes

_set_classification_ran_true()[source]#

Sets the classification ran flag to true.

Updates the internal state to indicate that classification process has been executed.

Parameters:

None

Return type:

None

Returns:

None

Notes

  • Updates _classification_ran

  • Used for validation checks

  • State cannot be reset to false

_has_ran_classification()[source]#

Checks if classification has been run.

Returns

Return type:

bool

_validate_classification_ran(classification_ran)[source]#

Checks if classification has been run.

Verifies whether the classification process has been executed by checking the internal state flag.

Parameters:

None

Returns:

True if classification has been run, False otherwise.

Type: bool

Return type:

bool

Notes

  • Reads _classification_ran

  • Used for validation before accessing results

  • Cannot detect if classification is currently running

_normalize_abstract(abstract)[source]#

Normalizes an abstract by removing LaTeX and converting any resulting unicode to ASCII.

Processes research abstract text through two stages: 1. Converts LaTeX notation to unicode text 2. Converts unicode characters to ASCII equivalents

Parameters:

abstract (str) – Research abstract text. Type: str May contain LaTeX math notation and unicode characters.

Returns:

Normalized abstract text.

Type: str Contains only ASCII characters.

Return type:

str

Notes

  • Uses LatexNodes2Text for LaTeX conversion

  • Uses Unidecode for unicode to ASCII conversion

  • Handles mathematical notation

  • Preserves text structure

  • Removes special characters

  • Math mode set to “text” for consistent conversion

Module contents#