postprocessing package#
Submodules#
postprocessing.BasePostprocessor module#
- class academic_metrics.postprocessing.BasePostprocessor.BasePostprocessor(attribute_name, minhash_util, threshold=0.5)[source]#
Bases:
objectA class responsible for processing and standardizing attribute data across different categories.
This class provides methods to extract attribute sets from category data, remove near-duplicate values, and standardize attribute values to ensure consistency across categories. It utilizes MinHash for estimating the similarity between values to effectively identify and remove duplicates. Additionally, it maintains a dictionary of value variations to track the most frequent spelling variations of each value.
- processed_sets_list#
Stores processed attribute sets after deduplication and standardization.
- Type:
- minhash_util#
Utility for generating MinHash signatures and comparing them.
- Type:
- extract_sets(category_dict)[source]#
Extracts the specified attribute from each CategoryInfo object in the provided dictionary.
- remove_near_duplicates(category_dict)[source]#
Processes each CategoryInfo object to remove near-duplicate attribute values and standardize them.
- standardized_data_update(category_dict, standardized_sets)[source]#
Updates CategoryInfo objects with standardized attribute sets.
- standardize_attribute(category_dict)[source]#
Standardizes attribute values across all categories based on the most frequent spelling variations.
- remove_update_attribute(category_dict, attribute_sets_list)[source]#
Removes near-duplicate attribute values within each attribute set.
- duplicate_postprocessor(attribute_set, attribute_sets, similarity_threshold)[source]#
Processes a set of attribute values to remove near-duplicates.
- process_value_pair(similarity_threshold, most_frequent_variation, value_signatures, to_remove, v1, v2)[source]#
Compares two attribute values and determines which to keep.
- value_to_remove(most_frequent_variation, v1, v2, v1_normalized, v2_normalized)[source]#
Determines which of two attribute values to remove based on their variations.
- get_duplicate_utilities(attribute_set, attribute_sets)[source]#
Generates utilities needed for duplicate removal.
- generate_signatures(attribute_set)[source]#
Generates MinHash signatures for each value in an attribute set.
- get_most_frequent_value_variation(attribute_sets_list)#
Maps each normalized value to its most frequent spelling variation.
- standardize_values_across_sets(attribute_sets_list)[source]#
Standardizes values in attribute sets based on the most frequent value variation.
- extract_sets(category_dict)[source]#
Extracts the specified attribute from each CategoryInfo object in the provided dictionary.
This method iterates over a dictionary of CategoryInfo objects and collects the set from the attribute specified by self.attribute_name from each object. These sets are typically used for further processing, such as deduplication or analysis.
- Parameters:
category_dict (Dict[str, CategoryInfo]) – A dictionary where the keys are category identifiers and the values are CategoryInfo objects.
- Returns:
A list containing the attribute set from each CategoryInfo object.
- Return type:
List[Set[str]]
- remove_near_duplicates(*, category_dict)[source]#
Processes each CategoryInfo object to remove near-duplicate values and standardize them across categories.
This method orchestrates several steps to enhance the integrity and consistency of the specified attribute: 1. Extract sets from each category 2. Remove near-duplicate values within each set 3. Standardize values across all categories 4. Update the category data with cleaned and standardized sets
- Parameters:
category_dict (Dict[str, CategoryInfo]) – A dictionary where the keys are category identifiers and the values are CategoryInfo objects.
- Returns:
- The updated dictionary with cleaned and standardized
values across all CategoryInfo objects.
- Return type:
Dict[str, CategoryInfo]
- standardized_data_update(category_dict, standardized_sets)[source]#
Updates the CategoryInfo objects in the dictionary with standardized sets.
- Parameters:
category_dict (Dict[str, CategoryInfo]) –
A dictionary where the keys are category identifiers
and the values are CategoryInfo objects.
standardized_sets (List[Set[str]]) – A list of sets containing standardized values.
- Return type:
This method iterates over the category dictionary and updates each CategoryInfo object with the corresponding standardized set for the specified attribute.
- standardize_attribute(category_dict)[source]#
Standardizes attribute values across all categories based on the most frequent spelling variations.
- Parameters:
category_dict (Dict[str, CategoryInfo]) – A dictionary where the keys are category identifiers and the values are CategoryInfo objects.
- Returns:
A list of sets containing the standardized attribute values across all categories.
- Return type:
List[Set[str]]
This method extracts updated attribute sets after duplicate removal and standardizes values across all sets based on the most frequent global variation.
- remove_update_attribute(category_dict, attribute_sets_list)[source]#
Removes near-duplicate values within each set based on MinHash similarity.
- Parameters:
category_dict (Dict[str, CategoryInfo]) – A dictionary where the keys are category identifiers and the values are CategoryInfo objects.
attribute_sets_list (List[Set[str]]) – A list containing the attribute set from each CategoryInfo object.
- Return type:
This method iterates over each category and processes the attribute set to remove near-duplicates, updating the specified attribute of each CategoryInfo object.
- duplicate_postprocessor(attribute_set, attribute_sets, similarity_threshold=0.5)[source]#
Processes a set of values to remove near-duplicates based on MinHash similarity and most frequent variations.
This method first generates the necessary utilities for comparison and removal. It then compares each value against all others in the set for near duplicates. If a value is deemed to be a duplicate based on MinHash similarity and the most frequent variation, it is added to the set of values to be removed. Finally, the refined set is returned, excluding any values deemed to be duplicates.
- Parameters:
- Returns:
The refined set, excluding any values deemed to be duplicates.
- Return type:
Set[str]
- process_value_pair(similarity_threshold, most_frequent_variation, value_signatures, to_remove, v1, v2)[source]#
Compares two values and determines which one to keep based on MinHash similarity and most frequent variation.
This method first compares the MinHash signatures of the two values to determine their similarity. If the similarity exceeds the specified threshold, it then determines which value to remove based on the most frequent variation. The value not chosen as the most frequent variation is added to the set of values to be removed.
- Parameters:
similarity_threshold (float) – Threshold for considering values as duplicates.
most_frequent_variation (Dict[str, str]) – Dictionary mapping normalized values to their most frequent variations.
value_signatures (Dict[str, List[int]]) – Dictionary of MinHash signatures.
to_remove (Set[str]) – Set of values to be removed.
v1 (str) – Values to compare.
v2 (str) – Values to compare.
- Return type:
- value_to_remove(most_frequent_variation, v1, v2, v1_normalized, v2_normalized)[source]#
Determines which of two values to remove based on their normalized forms and most frequent variations.
This method checks if the normalized form of each value matches its most frequent variation. If one value matches its most frequent variation and the other does not, the non-matching value is chosen for removal. If neither or both values match their most frequent variations, the lexicographically greater value is chosen for removal.
- Parameters:
most_frequent_variation (Dict[str, str]) – Dictionary mapping normalized values to their most frequent variations.
v1 (str) – Original values to compare.
v2 (str) – Original values to compare.
v1_normalized (str) – Normalized forms of the values.
v2_normalized (str) – Normalized forms of the values.
- Returns:
The value to be removed.
- Return type:
- get_duplicate_utilities(attribute_set, attribute_sets)[source]#
Generates utilities needed for duplicate removal.
- Parameters:
- Returns:
- A tuple containing:
most_frequent_variation: Dictionary mapping normalized values to their most frequent variations
value_signatures: Dictionary mapping each value to its MinHash signature
to_remove: Empty set for collecting values to be removed
- Return type:
- generate_signatures(attribute_set)[source]#
Generates MinHash signatures for each value in the given set.
This method tokenizes each value into n-grams, computes a MinHash signature for these n-grams, and stores the result. A MinHash signature is a compact representation of the set of n-grams and is used to estimate the similarity between sets of values.
- get_most_frequent_variation(attribute_sets_list)[source]#
Creates a dictionary that maps each unique normalized value to its most commonly occurring spelling variation across all provided sets.
A ‘normalized value’ is derived by converting the original value to lowercase and removing all spaces, which helps in identifying different spellings of the same value as equivalent. The ‘most frequent variation’ refers to the spelling of the value that appears most often in the data, maintaining the original case and spaces.
- Parameters:
attribute_sets_list (List[Set[str]]) – A list where each set contains values from a specific category. Each set may include various spelling variations.
- Returns:
- A dictionary with normalized values as keys and their most
frequent original spelling variations as values.
- Return type:
- standardize_values_across_sets(attribute_sets_list)[source]#
Standardizes values across all sets by mapping each value to its most frequent variation across all sets.
This method first generates a mapping of the most frequent variations for all values, then uses this mapping to standardize each value in each set. If a value has no recorded frequent variation, it remains unchanged.
postprocessing.DepartmentPostprocessor module#
- class academic_metrics.postprocessing.DepartmentPostprocessor.DepartmentPostprocessor(minhash_util, threshold=0.7)[source]#
Bases:
BasePostprocessor- __init__(minhash_util, threshold=0.7)[source]#
Initialize the DepartmentPostprocessor with a MinHashUtility instance.
- Parameters:
minhash_util (MinHashUtility) – A MinHashUtility instance for minhash operations.
- remove_near_duplicates(*, category_dict)[source]#
Remove near-duplicate department names from the category dictionary.
- Parameters:
category_dict (Dict[str, CategoryInfo]) – A dictionary mapping category names to CategoryInfo objects.
- Returns:
- A dictionary mapping category names to CategoryInfo
objects with near-duplicate department names removed.
- Return type:
Dict[str, CategoryInfo]
postprocessing.FacultyPostprocessor module#
- class academic_metrics.postprocessing.FacultyPostprocessor.FacultyPostprocessor(minhash_util, threshold=0.5)[source]#
Bases:
BasePostprocessor- __init__(minhash_util, threshold=0.5)[source]#
Initialize the FacultyPostprocessor with a MinHashUtility instance.
- Parameters:
minhash_util (MinHashUtility) – A MinHashUtility instance for minhash operations.
- remove_near_duplicates(*, category_dict)[source]#
Remove near-duplicate faculty names from the category dictionary.
- Parameters:
category_dict (Dict[str, CategoryInfo]) – A dictionary mapping category names to CategoryInfo objects.
- Returns:
- A dictionary mapping category names to CategoryInfo
objects with near-duplicate faculty names removed.
- Return type:
Dict[str, CategoryInfo]