utils package#
Submodules#
utils.api_key_validator module#
- class academic_metrics.utils.api_key_validator.ValidationResult(openai=False, anthropic=False, google=False)[source]#
Bases:
object
- class academic_metrics.utils.api_key_validator.APIKeyValidator[source]#
Bases:
objectValidator for LLM API keys across different services.
Example
>>> validator = APIKeyValidator(api_key="sk-...") >>> if validator.is_valid(): >>> print("Key is valid!") >>> validator.print_results() # See which services work
- is_valid(api_key, model=None)[source]#
Check if the API key is valid for any service. Validates if not already done.
- Return type:
- get_results_for_api_key(api_key)[source]#
Get detailed validation results. Validates if not already done.
utils.minhash_util module#
- class academic_metrics.utils.minhash_util.MinHashUtility(num_hashes, large_prime=999983)[source]#
Bases:
objectA utility class for performing MinHash calculations to estimate the similarity between sets of data.
This class provides methods for generating hash functions, tokenizing strings into n-grams, computing MinHash signatures, and comparing these signatures to estimate the similarity between sets. The MinHash technique is particularly useful in applications where exact matches are not necessary, but approximate matches are sufficient, such as duplicate detection, document similarity, and clustering.
- num_hashes#
The number of hash functions to use in MinHash calculations, affecting the accuracy and performance of the similarity estimation.
- Type:
- large_prime#
A large prime number used as the modulus in hash functions to minimize collisions.
- Type:
- hash_fns#
A list of pre-generated hash functions used for computing MinHash signatures.
- Type:
list[callable]
- generate_coefficients()#
Generates random coefficients for hash functions.
- generate_hash_functions()[source]#
Creates a list of hash functions based on generated coefficients.
- compare_signatures(signature1, signature2)[source]#
Compares two MinHash signatures and returns their estimated similarity.
The class utilizes linear hash functions of the form h(x) = (a * x + b) % large_prime, where ‘a’ and ‘b’ are randomly generated coefficients. This approach helps in reducing the likelihood of hash collisions and ensures a uniform distribution of hash values.
- Example usage:
minhash_util = MinHashUtility(num_hashes=200) tokens = minhash_util.tokenize(“example string”, n=3) signature = minhash_util.compute_signature(tokens) # Further operations such as comparing signatures can be performed.
More on MinHash: https://en.wikipedia.org/wiki/MinHash
- __init__(num_hashes, large_prime=999983)[source]#
Initialize the MinHashUtility with the specified number of hash functions.
- tokenize(string, n=3)[source]#
Tokenize the given string into n-grams to facilitate the identification of similar strings.
N-grams are contiguous sequences of ‘n’ characters extracted from a string. This method is useful in various applications such as text similarity, search, and indexing where the exact match is not necessary, but approximate matches are useful.
More on n-grams: https://en.wikipedia.org/wiki/N-gram
- Parameters:
- Returns:
A set containing unique n-grams derived from the input string.
- Return type:
- Raises:
ValueError – If ‘n’ is greater than the length of the string or less than 1.
- generate_coeeficients()[source]#
Generate a list of tuples, each containing a pair of coefficients (a, b) used for hash functions.
Each tuple consists of: - a (int): A randomly chosen multiplier coefficient. - b (int): A randomly chosen additive coefficient.
These coefficients are used in the linear hash functions for MinHash calculations.
- generate_hash_functions()[source]#
Generate a list of linear hash functions for use in MinHash calculations.
Each hash function is defined by a unique pair of coefficients (a, b) and is created using a factory function. These hash functions are used to compute hash values for elements in a set, which are essential for estimating the similarity between sets using the MinHash technique.
The hash functions are of the form: h(x) = (a * x + b) % large_prime, where ‘large_prime’ is a large prime number used to reduce collisions in hash values.
Overview of hash functions: https://en.wikipedia.org/wiki/Hash_function
- Returns:
A list of lambda functions, each representing a linear hash function.
- Return type:
- compute_signature(tokens)[source]#
Compute MinHash signature for a set of tokens. A MinHash signature consists of the minimum hash value produced by each hash function across all tokens, which is used to estimate the similarity between sets of data.
Detailed explanation of MinHash and its computation: https://en.wikipedia.org/wiki/MinHash
- compare_signatures(signature1, signature2)[source]#
Compare two MinHash signatures and return their similarity. The similarity is calculated as the fraction of hash values that are identical in the two signatures, which estimates the Jaccard similarity of the original sets from which these signatures were derived.
This method is based on the principle that the more similar the sets are, the more hash values they will share, thus providing a proxy for the Jaccard index of the sets.
More on estimating similarity with MinHash: https://en.wikipedia.org/wiki/Jaccard_index#MinHash
- Parameters:
- Returns:
The estimated similarity between the two sets, based on their MinHash signatures.
- Return type:
- Raises:
AssertionError – If the two signatures do not have the same length.
utils.taxonomy_util module#
- academic_metrics.utils.taxonomy_util.TaxonomyDict#
Type alias representing the taxonomy dictionary structure.
- This type represents a three-level nested dictionary structure where:
The outer dictionary maps top-level category names to mid-level dictionaries
The mid-level dictionaries map mid-level category names to lists of low-level categories
The innermost lists contain strings representing low-level category names
- Type Structure:
- Dict[str, Dict[str, List[str]]] where:
First str: Top-level category name
Second str: Mid-level category name
List[str]: List of low-level category names
- Example Structure:
{ "Computer Science": { # Top-level category "Artificial Intelligence": [ # Mid-level category "Machine Learning", # Low-level category "Natural Language Processing", # Low-level category "Computer Vision" # Low-level category ], "Software Engineering": [ "Software Design", "Software Testing", "DevOps" ] } }
- academic_metrics.utils.taxonomy_util.TaxonomyLevel#
Type alias representing valid taxonomy levels.
This type represents the three possible levels in the taxonomy hierarchy using string literals.
- Type Structure:
- Literal[“top”, “mid”, “low”] where:
“top”: Represents the highest level categories (e.g., “Computer Science”)
“mid”: Represents middle-level categories (e.g., “Artificial Intelligence”)
“low”: Represents the most specific categories (e.g., “Machine Learning”)
- Usage:
def example_function(level: TaxonomyLevel) -> None: match level: case "top": # Handle top-level category pass case "mid": # Handle mid-level category pass case "low": # Handle low-level category pass
Note
The type system will ensure that only these three string literals can be used where a TaxonomyLevel is expected. Any other string will result in a type error.
Example
# Valid usage level: TaxonomyLevel = "top" # OK level: TaxonomyLevel = "mid" # OK level: TaxonomyLevel = "low" # OK # Invalid usage (would cause type error) # level: TaxonomyLevel = "other" # Type error!
alias of
Literal[‘top’, ‘mid’, ‘low’]
- class academic_metrics.utils.taxonomy_util.Taxonomy[source]#
Bases:
objectA class for managing and querying a three-level taxonomy structure.
This class provides functionality to work with a hierarchical taxonomy that has three levels: top, mid, and low. It allows for querying categories at each level, validating categories, and finding relationships between categories at different levels.
- _taxonomy#
The complete taxonomy structure as a nested dictionary.
- Type:
TaxonomyDict
- _valid_levels#
List of valid taxonomy levels [“top”, “mid”, “low”].
- Type:
List[TaxonomyLevel]
- logger#
Logger instance for this class.
- Type:
- Public Methods
get_top_categories(): Get all top-level categories. get_mid_categories(top_category): Get mid-level categories for a top category. get_low_categories(top_category, mid_category): Get low-level categories. get_top_cat_for_mid_cat(mid_cat): Find parent top category of a mid category. get_mid_cat_for_low_cat(low_cat): Find parent mid category of a low category. is_valid_category(category, level): Check if a category exists at a level. get_taxonomy(): Get the complete taxonomy dictionary.
- Private Methods
_set_all_top_categories(): Initialize list of top categories. _set_all_mid_categories(): Initialize list of mid categories. _set_all_low_categories(): Initialize list of low categories. _load_taxonomy_from_string(taxonomy_str, logger): Load taxonomy from JSON.
Examples
# Create a taxonomy instance taxonomy = Taxonomy() # Get categories at different levels top_cats = taxonomy.get_top_categories() mid_cats = taxonomy.get_mid_categories(top_cats[0]) low_cats = taxonomy.get_low_categories(top_cats[0], mid_cats[0]) # Validate categories taxonomy.is_valid_category(top_cats[0], "top") True # Find parent categories parent_top = taxonomy.get_top_cat_for_mid_cat(mid_cats[0]) parent_mid = taxonomy.get_mid_cat_for_low_cat(low_cats[0])
- __init__()[source]#
Initializes a new Taxonomy instance.
This constructor initializes the taxonomy by loading the taxonomy data from a predefined string constant (TAXONOMY_AS_STRING). It sets up logging and initializes internal lists of categories at all levels (top, mid, and low).
The taxonomy follows a three-level hierarchical structure: - Top level: Broad categories - Mid level: Sub-categories under each top category - Low level: Specific categories under each mid category
Examples
# Create a new taxonomy instance taxonomy = Taxonomy() isinstance(taxonomy._taxonomy, dict) True # Verify initialization of category lists all(isinstance(cats, list) for cats in [ taxonomy._all_top_categories, taxonomy._all_mid_categories, taxonomy._all_low_categories ]) True # Check that valid levels are properly set taxonomy._valid_levels == ["top", "mid", "low"] True
- __str__()[source]#
Returns a string representation of the taxonomy.
Converts the taxonomy dictionary into a formatted JSON string with proper indentation. This method is useful for debugging and displaying the taxonomy structure.
- Returns:
A JSON-formatted string representation of the taxonomy.
- Return type:
Examples
taxonomy = Taxonomy() taxonomy_str = str(taxonomy) isinstance(taxonomy_str, str) True # Verify it's valid JSON json.loads(taxonomy_str) == taxonomy._taxonomy True
- _set_all_top_categories()[source]#
Sets and returns all top-level categories from the taxonomy.
This private method initializes the list of all top-level categories by calling get_top_categories(). It’s used during taxonomy initialization to cache the top-level categories for faster access.
- Returns:
A list of all top-level categories in the taxonomy.
- Return type:
List[str]
Examples
taxonomy = Taxonomy() top_cats = taxonomy._set_all_top_categories() isinstance(top_cats, list) True # Verify all elements are strings all(isinstance(cat, str) for cat in top_cats) True # Verify it matches direct access to top categories top_cats == taxonomy.get_top_categories() True
- _set_all_mid_categories()[source]#
Sets and returns all mid-level categories from the taxonomy.
This private method collects all mid-level categories across all top-level categories by iterating through the taxonomy structure. It’s used during taxonomy initialization to cache the mid-level categories for faster access.
- Returns:
A list of all mid-level categories in the taxonomy.
- Return type:
List[str]
Examples
taxonomy = Taxonomy() mid_cats = taxonomy._set_all_mid_categories() isinstance(mid_cats, list) True # Verify all elements are strings all(isinstance(cat, str) for cat in mid_cats) True # Verify each mid category belongs to some top category any(mid_cats[0] in taxonomy.get_mid_categories(top_cat) for top_cat in taxonomy.get_top_categories()) True
- _set_all_low_categories()[source]#
Sets and returns all low-level categories from the taxonomy.
This private method collects all low-level categories by iterating through all top and mid-level categories in the taxonomy structure. It’s used during taxonomy initialization to cache the low-level categories for faster access.
- Returns:
A list of all low-level categories in the taxonomy.
- Return type:
List[str]
Examples
taxonomy = Taxonomy() low_cats = taxonomy._set_all_low_categories() isinstance(low_cats, list) True # Verify all elements are strings all(isinstance(cat, str) for cat in low_cats) True # Verify first low category exists in taxonomy structure top_cat = taxonomy.get_top_categories()[0] mid_cat = taxonomy.get_mid_categories(top_cat)[0] low_cats[0] in taxonomy.get_low_categories(top_cat, mid_cat) True
- get_top_categories()[source]#
Retrieves all top-level categories from the taxonomy.
- Returns:
A list of all top-level category names.
- Return type:
List[str]
Examples
taxonomy = Taxonomy() top_cats = taxonomy.get_top_categories() isinstance(top_cats, list) True # Verify all elements are strings all(isinstance(cat, str) for cat in top_cats) True # Verify returned list matches taxonomy keys top_cats == list(taxonomy._taxonomy.keys()) True
- get_mid_categories(top_category)[source]#
Retrieves all mid-level categories for a given top-level category.
- Parameters:
top_category (str) – The top-level category name to get mid-level categories for.
- Returns:
A list of all mid-level category names under the specified top category.
- Return type:
List[str]
- Raises:
KeyError – If the top_category doesn’t exist in the taxonomy.
Examples
taxonomy = Taxonomy() top_cat = taxonomy.get_top_categories()[0] mid_cats = taxonomy.get_mid_categories(top_cat) isinstance(mid_cats, list) True # Verify all elements are strings all(isinstance(cat, str) for cat in mid_cats) True # Verify error handling try: taxonomy.get_mid_categories("nonexistent_category") except KeyError: True
- get_low_categories(top_category, mid_category)[source]#
Retrieves all low-level categories for given top and mid-level categories.
- Parameters:
- Returns:
A list of all low-level category names under the specified categories.
- Return type:
List[str]
- Raises:
KeyError – If either the top_category or mid_category doesn’t exist in the taxonomy.
Examples
taxonomy = Taxonomy() top_cat = taxonomy.get_top_categories()[0] mid_cat = taxonomy.get_mid_categories(top_cat)[0] low_cats = taxonomy.get_low_categories(top_cat, mid_cat) isinstance(low_cats, list) True # Verify all elements are strings all(isinstance(cat, str) for cat in low_cats) True # Verify error handling for invalid categories try: taxonomy.get_low_categories("nonexistent", "category") except KeyError: True
- get_top_cat_for_mid_cat(mid_cat)[source]#
Finds the top-level category that contains a given mid-level category.
This method searches through the taxonomy structure to find which top-level category contains the given mid-level category name.
- Parameters:
mid_cat (str) – The mid-level category to find the parent for.
- Returns:
The name of the top-level category containing the mid-level category.
- Return type:
- Raises:
ValueError – If the mid_cat is not found in any top-level category.
Example
- get_mid_cat_for_low_cat(low_cat)[source]#
Finds the mid-level category that contains a given low-level category.
This method searches through the taxonomy structure to find which mid-level category contains the given low-level category name.
- Parameters:
low_cat (str) – The low-level category to find the parent for.
- Returns:
The name of the mid-level category containing the low-level category.
- Return type:
- Raises:
ValueError – If the low_cat is not found in any mid-level category.
Examples
from academic_metrics.utils import Taxonomy taxonomy = Taxonomy() # Get a known low category and its parent categories top_cat = taxonomy.get_top_categories()[0] mid_cat = taxonomy.get_mid_categories(top_cat)[0] low_cat = taxonomy.get_low_categories(top_cat, mid_cat)[0] # Verify we can find the mid category found_mid = taxonomy.get_mid_cat_for_low_cat(low_cat) assert found_mid == mid_cat # Verify error handling for invalid low category try: taxonomy.get_mid_cat_for_low_cat("nonexistent_category") except ValueError: pass # Expected behavior
- is_valid_category(category, level)[source]#
Validates whether a category exists in the taxonomy at the specified level.
- Parameters:
category (str) – The name of the category to validate.
level (TaxonomyLevel) – The taxonomy level to validate against. TaxonomyLevel is a type alias for the taxonomy levels; it can be one of the following: “top”, “mid”, or “low”.
- Returns:
True if the category exists at the specified level; otherwise, False.
- Return type:
- Raises:
ValueError – If the provided taxonomy level is invalid.
Examples
taxonomy = Taxonomy() # Get known categories at each level top_cat = taxonomy.get_top_categories()[0] mid_cat = taxonomy.get_mid_categories(top_cat)[0] low_cat = taxonomy.get_low_categories(top_cat, mid_cat)[0] # Test valid categories at each level taxonomy.is_valid_category(top_cat, "top") True taxonomy.is_valid_category(mid_cat, "mid") True taxonomy.is_valid_category(low_cat, "low") True # Test invalid categories taxonomy.is_valid_category("nonexistent_category", "top") False # Test category at wrong level taxonomy.is_valid_category(top_cat, "low") False # Test invalid level try: taxonomy.is_valid_category(top_cat, "invalid_level") # type: ignore except ValueError: True
- get_taxonomy()[source]#
Returns the complete taxonomy dictionary.
- Returns:
The complete taxonomy structure as a dictionary.
- Return type:
TaxonomyDict
Note
The structure follows the format:
{ "top_category": { "mid_category": ["low_category1", "low_category2", ...] } }
Examples
taxonomy = Taxonomy() tax_dict = taxonomy.get_taxonomy() isinstance(tax_dict, dict) True # Verify structure top_cat = list(tax_dict.keys())[0] isinstance(tax_dict[top_cat], dict) True mid_cat = list(tax_dict[top_cat].keys())[0] isinstance(tax_dict[top_cat][mid_cat], list) True
- static _load_taxonomy_from_string(taxonomy_str, logger=None)[source]#
Loads and parses a taxonomy from a JSON string.
- Parameters:
taxonomy_str (str) – JSON string containing the taxonomy structure.
logger (logging.Logger | None, optional) – Logger instance for logging operations. Defaults to None.
- Returns:
The parsed taxonomy dictionary.
- Return type:
TaxonomyDict
- Raises:
json.JSONDecodeError – If the taxonomy string is not valid JSON.
Examples
# Create a simple valid taxonomy string tax_str = '{"top": {"mid": ["low1", "low2"]}}' taxonomy = Taxonomy._load_taxonomy_from_string(tax_str) isinstance(taxonomy, dict) True # Verify structure list(taxonomy.keys()) == ["top"] True # Test invalid JSON try: Taxonomy._load_taxonomy_from_string("{invalid json}") except json.JSONDecodeError: True
utils.unicode_chars_dict module#
utils.utilities module#
- class academic_metrics.utils.utilities.Utilities(*, strategy_factory, warning_manager)[source]#
Bases:
objectA class containing various utility methods for processing and analyzing academic data.
- strategy_factory#
An instance of the StrategyFactory class.
- Type:
- warning_manager#
An instance of the WarningManager class.
- Type:
- get_attributes(self, data, attributes)[source]#
Extracts specified attributes from the data and returns them in a dictionary.
- crossref_file_splitter(self, *, path_to_file, split_files_dir_path)[source]#
Splits a crossref file into individual entries and creates a separate file for each entry in the specified output directory.
- make_files(self, *, path_to_file
str, split_files_dir_path: str): Splits a document into individual entries and creates a separate file for each entry in the specified output directory.
- __init__(*, strategy_factory, warning_manager)[source]#
Initializes the Utilities class with the provided strategy factory and warning manager.
- Parameters:
strategy_factory (StrategyFactory) – An instance of the StrategyFactory class.
warning_manager (WarningManager) – An instance of the WarningManager class.
- get_attributes(data, attributes)[source]#
Extracts specified attributes from the article entry and returns them in a dictionary. It also warns about missing or invalid attributes.
- Parameters:
- Returns:
- A dictionary where keys are attribute names and values are tuples.
Each tuple contains a boolean indicating success or failure of extraction, and the extracted attribute value or None.
- Return type:
- Raises:
ValueError – If an attribute not defined in self.attribute_patterns is requested.
- crossref_file_splitter(*, path_to_file, split_files_dir_path)[source]#
Splits a crossref file into individual entries and creates a separate file for each entry in the specified output directory.
- make_files(*, path_to_file, split_files_dir_path)[source]#
Splits a document into individual entries and creates a separate file for each entry in the specified output directory.
- Parameters:
- Returns:
A dictionary where each key is the number of the entry (starting from 1) and each value is the path to the corresponding file.
- Return type:
file_paths
This method first splits the document into individual entries using the splitter method. It then iterates over each entry, extracts the necessary attributes to form a filename, ensures the output directory exists, and writes each entry’s content to a new file in the output directory. Then returns the file_paths dictionary to make referencing any specific document later easier
utils.warning_manager module#
- exception academic_metrics.utils.warning_manager.CustomWarning(category, message, entry_id=None)[source]#
Bases:
WarningCustom warning class to store warning details.
- Parameters:
Warning (str) – _description_
- __init__(self, category
str, message: str, entry_id: str = None): Initializes the CustomWarning class with the provided category, message, and entry ID.
- Summary:
This class is a custom warning class that is used to store warning details. It is used to store warning details in a structured way.
- class academic_metrics.utils.warning_manager.WarningManager[source]#
Bases:
objectClass to manage warnings.
- log_warning(self, category
str, warning_message: str, entry_id: str = None) -> CustomWarning: Logs a warning with the provided category, message, and entry ID.
- Parameters:
- Returns:
The warning that was logged.
- Return type:
- display_warning_summary(self)[source]#
Displays the summary of the warnings.
- Parameters:
None
- Returns:
None
- Summary:
This class is used to manage warnings. It is used to store warnings in a list and display the summary of the warnings.
- __init__()[source]#
Initializes the WarningManager class.
- Parameters:
None
- Summary:
This method initializes the WarningManager class.