AI package#

Submodules#

AI.abstract_classifier module#

class academic_metrics.AI.abstract_classifier.AbstractClassifier(taxonomy, doi_to_abstract_dict, api_key, log_to_console=True, extra_context=None, pre_classification_model='gpt-4o-mini', classification_model='gpt-4o-mini', theme_model='gpt-4o-mini', max_classification_retries=3)[source]#

Bases: object

A class for processing research paper abstracts through AI-powered analysis and classification.

This class manages a complete pipeline for analyzing academic paper abstracts, including: - Method extraction from abstracts - Sentence-by-sentence analysis - Abstract summarization - Hierarchical taxonomy classification - Theme recognition and analysis

The pipeline uses three separate chain managers for different stages of processing: 1. Pre-classification: Method extraction, sentence analysis, and summarization 2. Classification: Hierarchical taxonomy classification 3. Theme Recognition: Theme identification and analysis

Parameters:

taxonomy (Taxonomy) – Taxonomy instance containing the classification hierarchy
doi_to_abstract_dict (Dict[str, str]) – Mapping of DOIs to abstract texts
api_key (str) – API key for LLM access
log_to_console (bool, optional) – Whether to log output to console. Defaults to True
extra_context (Dict[str, Any], optional) – Additional context for classification. Defaults to None
pre_classification_model (str, optional) – Model name for pre-classification tasks. Defaults to “gpt-4o-mini”
classification_model (str, optional) – Model name for classification tasks. Defaults to “gpt-4o-mini”
theme_model (str, optional) – Model name for theme recognition tasks. Defaults to “gpt-4o-mini”
max_classification_retries (int, optional) – Maximum retries for failed classifications. Defaults to 3

classification_results#

Processed results by DOI, containing categories and themes

Type:: Dict[str, Dict]

raw_classification_outputs#

Raw outputs from the classification chain

Type:: List[Dict]

raw_theme_outputs#

Raw theme analysis results by DOI

Type:: Dict[str, Dict]

classify()[source]#: Process all abstracts through the complete pipeline

get_classification_results_by_doi(doi, return_type)[source]#: Get results for a specific DOI

get_raw_classification_outputs()[source]#: Get all raw classification outputs

get_raw_theme_results()[source]#: Get all raw theme analysis results

save_classification_results(output_path)[source]#: Save processed results to JSON

save_raw_classification_results(output_path)[source]#: Save raw classification outputs

save_raw_theme_results(output_path)[source]#: Save raw theme results

Raises:

ValueError – If required attributes are missing or invalid
TypeError – If api_key cannot be converted to string

__init__(taxonomy, doi_to_abstract_dict, api_key, log_to_console=True, extra_context=None, pre_classification_model='gpt-4o-mini', classification_model='gpt-4o-mini', theme_model='gpt-4o-mini', max_classification_retries=3)[source]#

Initializes a new AbstractClassifier instance.

Sets up the complete classification pipeline including chain managers for pre-classification, classification, and theme recognition. Initializes data structures for storing results and configures logging.

Parameters:

taxonomy (Taxonomy) – Taxonomy instance containing the hierarchical category structure. Type: academic_metrics.utils.taxonomy_util.Taxonomy
doi_to_abstract_dict (Dict[str, str]) – Dictionary mapping DOIs to their abstract texts.
api_key (str) – API key for accessing the language model service.
log_to_console (bool | None) – Whether to output logs to console. Type: bool | None Defaults to LOG_TO_CONSOLE config value.
extra_context (Dict[str, Any] | None) – Additional context for classification. Type: Dict[str, Any] | None Defaults to None.
pre_classification_model (str | None) – Model name for pre-classification tasks. Type: str | None Defaults to “gpt-4o-mini”.
classification_model (str | None) – Model name for classification tasks. Type: str | None Defaults to “gpt-4o-mini”.
theme_model (str | None) – Model name for theme recognition tasks. Type: str | None Defaults to “gpt-4o-mini”.
max_classification_retries (int | None) – Maximum attempts for failed classifications. Type: int | None Defaults to 3.

Raises:

ValueError – If api_key is empty or invalid.
TypeError – If api_key cannot be converted to string.

_run_initial_api_key_validation(api_key)[source]#

Validates the API key format and presence during initialization.

Performs initial validation of the API key to ensure it exists and can be converted to a string type. Called during class initialization before any API operations are attempted.

Parameters:: api_key (str) – The API key to validate. Should be a non-empty string or a value that can be converted to a string. Type: str
Raises:: ValueError – If the API key is empty, None, cannot be converted to a string, or if the conversion fails for any reason.
Return type:: None
Returns:: None

_initialize_pre_classification_chain_manager()[source]#

Initializes a new ChainManager instance for pre-classification tasks.

Creates and configures a ChainManager specifically for the pre-classification stage of the pipeline, which includes method extraction, sentence analysis, and abstract summarization.

Returns:

A new ChainManager instance configured with:: Type: academic_metrics.ChainBuilder.ChainBuilder.ChainManager - Model: self._pre_classification_model - Temperature: 0.0 (deterministic outputs) - Console logging: Based on self.log_to_console setting

Return type:

ChainManager

_initialize_classification_chain_manager()[source]#

Initializes a new ChainManager instance for taxonomy classification tasks.

Creates and configures a ChainManager specifically for the classification stage of the pipeline, which handles the hierarchical taxonomy classification of abstracts at all levels (top, mid, and low).

Returns:

A new ChainManager instance configured with:: Type: academic_metrics.ChainBuilder.ChainBuilder.ChainManager - Model: self._classification_model - Temperature: 0.0 (deterministic outputs) - Console logging: Based on self.log_to_console setting

Return type:

ChainManager

_initialize_theme_chain_manager()[source]#

Initializes a new ChainManager instance for theme recognition tasks.

Creates and configures a ChainManager specifically for the theme recognition stage of the pipeline, which identifies key themes and concepts from classified abstracts.

Parameters:

None

Returns:

A new ChainManager instance configured with:: Type: academic_metrics.ChainBuilder.ChainBuilder.ChainManager - Model: self._theme_model - Temperature: 0.9 (creative theme generation) - Console logging: Based on self.log_to_console setting

Return type:

ChainManager

_add_method_extraction_layer(chain_manager)[source]#

Adds the method extraction processing layer to the chain manager.

This layer analyzes abstracts to identify and extract research methods, techniques, and approaches used in the paper.

Parameters:

chain_manager (ChainManager) – The chain manager to add the layer to. Type: academic_metrics.ChainBuilder.ChainBuilder.ChainManager

Returns:

Returns self for method chaining.: Type: academic_metrics.AI.AbstractClassifier.AbstractClassifier

Return type:

Self

Notes

System prompt: METHOD_EXTRACTION_SYSTEM_MESSAGE
Human prompt: HUMAN_MESSAGE_PROMPT
Primary parser: JSON with MethodExtractionOutput Pydantic model
Fallback parser: String output if JSON parsing fails
Output key: “method_json_output”
No preprocessor or postprocessor
No output key error ignoring

_add_sentence_analysis_layer(chain_manager)[source]#

Adds the sentence-by-sentence analysis layer to the chain manager.

This layer performs detailed analysis of each sentence in the abstract, identifying key components like objectives, methods, results, and conclusions.

Parameters:

chain_manager (ChainManager) – The chain manager to add the layer to. Type: academic_metrics.ChainBuilder.ChainBuilder.ChainManager

Returns:

Returns self for method chaining.: Type: academic_metrics.AI.AbstractClassifier.AbstractClassifier

Return type:

Self

Notes

System prompt: ABSTRACT_SENTENCE_ANALYSIS_SYSTEM_MESSAGE
Human prompt: HUMAN_MESSAGE_PROMPT
Primary parser: JSON with AbstractSentenceAnalysis Pydantic model
Fallback parser: String output if JSON parsing fails
Output key: “sentence_analysis_output”
No preprocessor or postprocessor
No output key error ignoring

_add_summary_layer(chain_manager)[source]#

Adds the abstract summarization layer to the chain manager.

This layer generates a concise summary of the abstract, capturing the main points and key findings in a structured format.

Parameters:

chain_manager (ChainManager) – The chain manager to add the layer to. Type: academic_metrics.ChainBuilder.ChainBuilder.ChainManager

Returns:

Returns self for method chaining.: Type: academic_metrics.AI.AbstractClassifier.AbstractClassifier

Return type:

Self

Notes

System prompt: ABSTRACT_SUMMARY_SYSTEM_MESSAGE
Human prompt: HUMAN_MESSAGE_PROMPT
Primary parser: JSON with AbstractSummary Pydantic model
Fallback parser: String output if JSON parsing fails
Output key: “abstract_summary_output”
No preprocessor or postprocessor
No output key error ignoring

_add_classification_layer(chain_manager)[source]#

Adds the taxonomy classification layer to the chain manager.

This layer performs hierarchical classification of abstracts according to the taxonomy structure, categorizing content at top, mid, and low levels.

Parameters:

chain_manager (ChainManager) – The chain manager to add the layer to. Type: academic_metrics.ChainBuilder.ChainBuilder.ChainManager

Returns:

Returns self for method chaining.: Type: academic_metrics.AI.AbstractClassifier.AbstractClassifier

Return type:

Self

Notes

System prompt: CLASSIFICATION_SYSTEM_MESSAGE
Human prompt: HUMAN_MESSAGE_PROMPT
Primary parser: JSON with ClassificationOutput Pydantic model
No fallback parser (classification must succeed)
Output key: “classification_output”
No preprocessor or postprocessor
No output key error ignoring

_add_theme_recognition_layer(chain_manager)[source]#

Adds the theme recognition layer to the chain manager.

This layer identifies and extracts key themes, concepts, and patterns from the abstract, providing a higher-level thematic analysis.

Parameters:

chain_manager (ChainManager) – The chain manager to add the layer to. Type: academic_metrics.ChainBuilder.ChainBuilder.ChainManager

Returns:

Returns self for method chaining.: Type: academic_metrics.AI.AbstractClassifier.AbstractClassifier

Return type:

Self

Notes

System prompt: THEME_RECOGNITION_SYSTEM_MESSAGE
Human prompt: HUMAN_MESSAGE_PROMPT
Primary parser: JSON with ThemeAnalysis Pydantic model
No fallback parser
Output key: “theme_output”
No preprocessor or postprocessor
No output key error ignoring
Uses higher temperature setting for creative theme generation

_get_classification_results_by_doi(doi)[source]#

Retrieves the raw classification results for a specific DOI.

This private method provides direct access to the classification results dictionary for a given DOI, without theme processing. It’s used internally during the classification pipeline, particularly before theme recognition processing.

Parameters:

doi (str) – The DOI identifier for the abstract to retrieve results for. Type: str

Returns:

The raw classification results dictionary containing:: Type: Dict[str, Any] - Top-level categories as keys - Nested dictionaries of mid-level categories - Lists of low-level categories

Return type:

Dict[str, Any]

Notes

Returns the raw defaultdict structure
Does not include theme information
Does not support different return types
Used internally during classification pipeline
Does not include themes (unlike the public get_classification_results_by_doi)

get_classification_results_by_doi(doi, return_type=<class 'dict'>)[source]#

Retrieves all categories and themes for a specific abstract via a DOI lookup.

This method provides access to the complete classification results for an abstract, including all taxonomy levels (top, mid, low) and identified themes. Results can be returned either as a dictionary or as a tuple of lists.

Parameters:

doi (str) – The DOI identifier for the abstract to retrieve results for. Type: str
return_type (type[dict] | type[tuple]) – The desired return type class. Type: type[dict] | type[tuple] Use dict for dictionary return or tuple for tuple return. Defaults to dict.

Returns:

The classification results in the requested format:

Type: Union[Tuple[str, …], Dict[str, Any]]

If dict return type:

top_categories (List[str]): Top-level taxonomy categories
mid_categories (List[str]): Mid-level taxonomy categories
low_categories (List[str]): Low-level taxonomy categories
themes (List[str]): Identified themes for the abstract

If tuple return type:

Tuple of (top_categories, mid_categories, low_categories, themes) where each element is a List[str]

Return type:

Union[Tuple[str, …], Dict[str, Any]]

Notes

Categories at each level are returned in order of classification
Low-level categories are deduplicated while preserving order
Returns empty lists for categories/themes if DOI not found
Theme list will be empty if theme recognition hasn’t been run

classify_abstract(abstract, doi, prompt_variables, level='top', parent_category=None, current_dict=None)[source]#

Recursively classifies an abstract through the taxonomy hierarchy.

This method implements a depth-first traversal of the taxonomy tree, classifying the abstract at each level and recursively processing subcategories. It maintains state using a nested defaultdict structure that mirrors the taxonomy hierarchy.

Parameters:

abstract (str) – The text of the abstract to classify. Type: str
doi (str) – The DOI identifier for the abstract. Type: str
prompt_variables (Dict[str, Any]) – Variables required for classification. Type: Dict[str, Any] Pre-classification requirements: - method_json_output: Method extraction results - sentence_analysis_output: Sentence analysis results - abstract_summary_output: Abstract summary Classification requirements: - abstract: The abstract text - categories: Available categories for current level - CLASSIFICATION_JSON_FORMAT: Format specification - TAXONOMY_EXAMPLE: Example classifications
level (str | None) – Current taxonomy level (“top”, “mid”, or “low”). Type: str | None Defaults to “top”.
parent_category (str | None) – The parent category from previous level. Type: str | None Defaults to None.
current_dict (Dict[str, Any] | None) – Current position in classification results. Type: Dict[str, Any] | None Defaults to None.

Return type:

None

Returns:

None

Raises:

ValueError – If classification fails validation after max retries.
Exception – If any other error occurs during classification.

Notes

Pre-classification must run method extraction, sentence analysis, and summarization
Top level classification processes into top categories then recursively into subcategories
Mid level classification processes into mid categories under parent then into low categories
Low level classification appends results to parent mid category’s list
Validates all classified categories against taxonomy
Retries classification up to max_classification_retries times
On final retry, bans invalid categories to force valid results

extract_classified_categories(classification_output)[source]#

Extracts category names from a classification output object.

Flattens the nested structure of ClassificationOutput into a simple list of category names. Handles multiple classifications within the output object.

Parameters:

classification_output (ClassificationOutput) –

Pydantic model containing classification results. Type: academic_metrics.AI.models.ClassificationOutput Structure: {

”classifications”: [

{
“categories”: [“category1”, “category2”], “confidence”: 0.95

}, {

”categories”: [“category3”], “confidence”: 0.85

}

]

}

Returns:

Flattened list of all classified category names.: Type: List[str]

Return type:

List[str]

Notes

Extracts categories from all classification entries
Maintains the order of categories as they appear
Ignores confidence scores in the output
Does not deduplicate categories

is_valid_category(category, level)[source]#

Validates if a category exists in the taxonomy at the specified level.

This method delegates category validation to the taxonomy instance, checking whether a given category exists at the specified taxonomy level.

Parameters:

category (str) – The category name to validate. Type: str
level (str) – The taxonomy level to check against. Type: str Must be one of: “top”, “mid”, or “low”.

Returns:

True if the category exists at the specified level, False otherwise.: Type: bool

Return type:

bool

Notes

Used to validate classified categories before processing
Triggers retry logic if invalid categories are found
Supports the category banning mechanism on final retries

classify()[source]#

Orchestrates the complete classification pipeline for all abstracts.

This method manages the end-to-end processing of all abstracts present in the doi_to_abstract_dict dictionary through three stages: pre-classification, classification, and theme recognition.

Parameters:

None

Returns:

Returns self for method chaining.: Type: academic_metrics.AI.AbstractClassifier.AbstractClassifier

Return type:

Self

Notes

Pipeline Stages: - Pre-classification:

Method extraction: Identifies research methods and techniques

Sentence analysis: Analyzes abstract structure and components

Summarization: Generates structured abstract summary

Classification:
- Uses enriched data from pre-classification
- Recursively classifies through taxonomy levels
- Validates and retries invalid classifications
Theme Recognition:
- Processes classified abstracts
- Identifies key themes and concepts
- Uses higher temperature for creative analysis

State Updates: - classification_results: Nested defaultdict structure: {

“doi1”: {

“top_category1”: {
“mid_category1”: [“low1”, “low2”], “mid_category2”: [“low3”, “low4”]

}, “themes”: [“theme1”, “theme2”]

}

} - raw_classification_outputs: List of raw outputs from classification - raw_theme_outputs: Dictionary mapping DOIs to theme analysis results

Processing Details: - Processes abstracts sequentially - Requires initialized chain managers - Updates multiple result stores - Maintains logging throughout process - Chains data between processing stages

_make_dirs_helper(output_path)[source]#

Creates necessary directories for an output file path.

This private helper method ensures that all directories in the path exist, creating them if necessary. Used by save methods before writing files.

Parameters:: output_path (str) – The full path where a file will be saved. Type: str Can be either absolute or relative path.
Return type:: None

Notes

Creates directories recursively
Uses exist_ok=True to handle existing directories
Creates parent directories only (not the file itself)

save_classification_results(output_path)[source]#

Saves processed classification results to a JSON file.

Writes the complete classification results dictionary to a JSON file, creating any necessary directories in the process. The output includes all categories and themes for all processed abstracts.

Parameters:

output_path (str) – Path where the JSON file should be saved. Type: str Can be absolute or relative path.

Returns:

Returns self for method chaining.: Type: academic_metrics.AI.AbstractClassifier.AbstractClassifier

Return type:

Self

Notes

Output Format: {

“doi1”: {

“top_category1”: {
“mid_category1”: [“low1”, “low2”], “mid_category2”: [“low3”, “low4”]

}, “themes”: [“theme1”, “theme2”]

}

}

get_classification_results_dict()[source]#

Retrieves processed classification results for all processed abstracts.

Provides direct access to the complete classification results dictionary, containing all categories and themes for every processed abstract.

Returns:

A dictionary where:

Type: Dict[str, Dict[str, Any]] - Keys are DOI strings - Values are nested dictionaries containing: {

”top_category1”: {
“mid_category1”: [“low1”, “low2”], “mid_category2”: [“low3”, “low4”]

}, “themes”: [“theme1”, “theme2”]

}

Return type:

Dict[str, Dict[str, Any]]

Notes

Returns the raw defaultdict structure
Includes theme information if theme recognition was run
Structure matches the save_classification_results output format

get_raw_classification_outputs()[source]#

Retrieves raw classification outputs from all processed abstracts.

Provides access to the complete, unprocessed outputs from the classification chain, including all prompt variables and intermediate results.

Returns:

List of raw classification outputs, where each output contains:: Type: List[Dict[str, Any]] - classifications: List of classifications with categories and confidence scores - abstract: The original abstract text - method_json_output: Output from method extraction - sentence_analysis_output: Output from sentence analysis - abstract_summary_output: Output from abstract summarization - Other chain variables and outputs

Return type:

List[Dict[str, Any]]

Notes

Contains all chain variables and outputs
Includes pre-classification results
Useful for debugging and analysis
May contain large amounts of data

get_raw_theme_results()[source]#

Retrieves raw theme analysis results for all processed abstracts.

Provides access to the complete, unprocessed outputs from the theme recognition chain for each abstract.

Parameters:

None

Returns:

Dictionary where:

Type: Dict[str, Dict[str, Any]] - Keys are DOI strings - Values are raw theme analysis results with structure: {

”themes”: [“theme1”, “theme2”], “confidence_scores”: {

”theme1”: 0.95, “theme2”: 0.85

}, “analysis”: “Theme analysis text…”, # Other theme recognition outputs

}

Return type:

Dict[str, Dict[str, Any]]

Notes

Contains complete theme recognition outputs
Includes confidence scores and analysis text
Available after theme recognition stage
Empty dictionaries for unprocessed DOIs

save_raw_classification_results(output_path)[source]#

Saves raw classification outputs to a JSON file.

Writes the complete, unprocessed outputs from the classification chain to a JSON file, creating any necessary directories in the process. Includes all prompt variables and intermediate results.

Parameters:

output_path (str) – Path where the JSON file should be saved. Type: str Can be absolute or relative path.

Returns:

Returns self for method chaining.: Type: academic_metrics.AI.AbstractClassifier.AbstractClassifier

Return type:

Self

Notes

Output Format: [

{

“classifications”: [

{
“categories”: [“category1”, “category2”], “confidence”: 0.95

}

], “abstract”: “original abstract text”, “method_json_output”: {…}, “sentence_analysis_output”: {…}, “abstract_summary_output”: {…}, # Other chain variables and outputs

}, # Additional classification outputs…

]

save_raw_theme_results(output_path)[source]#

Saves raw theme analysis results to a JSON file.

Writes the complete, unprocessed outputs from the theme recognition chain to a JSON file, creating any necessary directories in the process. Includes theme analysis results for each processed abstract.

Parameters:

output_path (str) – Path where the JSON file should be saved. Type: str Can be absolute or relative path.

Returns:

Returns self for method chaining.: Type: academic_metrics.AI.AbstractClassifier.AbstractClassifier

Return type:

Self

Notes

Output Format: {

“10.1234/example”: {
“themes”: [“theme1”, “theme2”], “confidence_scores”: {

“theme1”: 0.95, “theme2”: 0.85

}, “analysis”: “Theme analysis text…”, # Other theme recognition outputs

}, # Additional DOIs and their theme results…

}

Previous topic

Next topic

Table of Contents

This Page

AI package#

Submodules#

AI.abstract_classifier module#

Module contents#

This Page