postprocessing package#

Submodules#

postprocessing.BasePostprocessor module#

class academic_metrics.postprocessing.BasePostprocessor.BasePostprocessor(attribute_name, minhash_util, threshold=0.5)[source]#

Bases: object

A class responsible for processing and standardizing attribute data across different categories.

This class provides methods to extract attribute sets from category data, remove near-duplicate values, and standardize attribute values to ensure consistency across categories. It utilizes MinHash for estimating the similarity between values to effectively identify and remove duplicates. Additionally, it maintains a dictionary of value variations to track the most frequent spelling variations of each value.

processed_sets_list#

Stores processed attribute sets after deduplication and standardization.

Type:

list

minhash_util#

Utility for generating MinHash signatures and comparing them.

Type:

MinHashUtility

value_variations#

Stores NameVariation objects for each normalized attribute value.

Type:

dict

extract_sets(category_dict)[source]#

Extracts the specified attribute from each CategoryInfo object in the provided dictionary.

remove_near_duplicates(category_dict)[source]#

Processes each CategoryInfo object to remove near-duplicate attribute values and standardize them.

standardized_data_update(category_dict, standardized_sets)[source]#

Updates CategoryInfo objects with standardized attribute sets.

standardize_attribute(category_dict)[source]#

Standardizes attribute values across all categories based on the most frequent spelling variations.

remove_update_attribute(category_dict, attribute_sets_list)[source]#

Removes near-duplicate attribute values within each attribute set.

duplicate_postprocessor(attribute_set, attribute_sets, similarity_threshold)[source]#

Processes a set of attribute values to remove near-duplicates.

process_value_pair(similarity_threshold, most_frequent_variation, value_signatures, to_remove, v1, v2)[source]#

Compares two attribute values and determines which to keep.

value_to_remove(most_frequent_variation, v1, v2, v1_normalized, v2_normalized)[source]#

Determines which of two attribute values to remove based on their variations.

get_duplicate_utilities(attribute_set, attribute_sets)[source]#

Generates utilities needed for duplicate removal.

generate_signatures(attribute_set)[source]#

Generates MinHash signatures for each value in an attribute set.

get_most_frequent_value_variation(attribute_sets_list)#

Maps each normalized value to its most frequent spelling variation.

standardize_values_across_sets(attribute_sets_list)[source]#

Standardizes values in attribute sets based on the most frequent value variation.

extract_sets(category_dict)[source]#

Extracts the specified attribute from each CategoryInfo object in the provided dictionary.

This method iterates over a dictionary of CategoryInfo objects and collects the set from the attribute specified by self.attribute_name from each object. These sets are typically used for further processing, such as deduplication or analysis.

Parameters:

category_dict (Dict[str, CategoryInfo]) – A dictionary where the keys are category identifiers and the values are CategoryInfo objects.

Returns:

A list containing the attribute set from each CategoryInfo object.

Return type:

List[Set[str]]

remove_near_duplicates(*, category_dict)[source]#

Processes each CategoryInfo object to remove near-duplicate values and standardize them across categories.

This method orchestrates several steps to enhance the integrity and consistency of the specified attribute: 1. Extract sets from each category 2. Remove near-duplicate values within each set 3. Standardize values across all categories 4. Update the category data with cleaned and standardized sets

Parameters:

category_dict (Dict[str, CategoryInfo]) – A dictionary where the keys are category identifiers and the values are CategoryInfo objects.

Returns:

The updated dictionary with cleaned and standardized

values across all CategoryInfo objects.

Return type:

Dict[str, CategoryInfo]

standardized_data_update(category_dict, standardized_sets)[source]#

Updates the CategoryInfo objects in the dictionary with standardized sets.

Parameters:
  • category_dict (Dict[str, CategoryInfo]) –

    • A dictionary where the keys are category identifiers

    • and the values are CategoryInfo objects.

  • standardized_sets (List[Set[str]]) – A list of sets containing standardized values.

Return type:

None

This method iterates over the category dictionary and updates each CategoryInfo object with the corresponding standardized set for the specified attribute.

standardize_attribute(category_dict)[source]#

Standardizes attribute values across all categories based on the most frequent spelling variations.

Parameters:

category_dict (Dict[str, CategoryInfo]) – A dictionary where the keys are category identifiers and the values are CategoryInfo objects.

Returns:

A list of sets containing the standardized attribute values across all categories.

Return type:

List[Set[str]]

This method extracts updated attribute sets after duplicate removal and standardizes values across all sets based on the most frequent global variation.

remove_update_attribute(category_dict, attribute_sets_list)[source]#

Removes near-duplicate values within each set based on MinHash similarity.

Parameters:
  • category_dict (Dict[str, CategoryInfo]) – A dictionary where the keys are category identifiers and the values are CategoryInfo objects.

  • attribute_sets_list (List[Set[str]]) – A list containing the attribute set from each CategoryInfo object.

Return type:

None

This method iterates over each category and processes the attribute set to remove near-duplicates, updating the specified attribute of each CategoryInfo object.

duplicate_postprocessor(attribute_set, attribute_sets, similarity_threshold=0.5)[source]#

Processes a set of values to remove near-duplicates based on MinHash similarity and most frequent variations.

This method first generates the necessary utilities for comparison and removal. It then compares each value against all others in the set for near duplicates. If a value is deemed to be a duplicate based on MinHash similarity and the most frequent variation, it is added to the set of values to be removed. Finally, the refined set is returned, excluding any values deemed to be duplicates.

Parameters:
  • attribute_set (Set[str] | List[str]) – A set or list of values to be processed.

  • attribute_sets (List[Set[str]]) – A list of sets, where each set contains values from a different category.

  • similarity_threshold (float) – The threshold for considering values as duplicates based on MinHash similarity.

Returns:

The refined set, excluding any values deemed to be duplicates.

Return type:

Set[str]

process_value_pair(similarity_threshold, most_frequent_variation, value_signatures, to_remove, v1, v2)[source]#

Compares two values and determines which one to keep based on MinHash similarity and most frequent variation.

This method first compares the MinHash signatures of the two values to determine their similarity. If the similarity exceeds the specified threshold, it then determines which value to remove based on the most frequent variation. The value not chosen as the most frequent variation is added to the set of values to be removed.

Parameters:
  • similarity_threshold (float) – Threshold for considering values as duplicates.

  • most_frequent_variation (Dict[str, str]) – Dictionary mapping normalized values to their most frequent variations.

  • value_signatures (Dict[str, List[int]]) – Dictionary of MinHash signatures.

  • to_remove (Set[str]) – Set of values to be removed.

  • v1 (str) – Values to compare.

  • v2 (str) – Values to compare.

Return type:

None

value_to_remove(most_frequent_variation, v1, v2, v1_normalized, v2_normalized)[source]#

Determines which of two values to remove based on their normalized forms and most frequent variations.

This method checks if the normalized form of each value matches its most frequent variation. If one value matches its most frequent variation and the other does not, the non-matching value is chosen for removal. If neither or both values match their most frequent variations, the lexicographically greater value is chosen for removal.

Parameters:
  • most_frequent_variation (Dict[str, str]) – Dictionary mapping normalized values to their most frequent variations.

  • v1 (str) – Original values to compare.

  • v2 (str) – Original values to compare.

  • v1_normalized (str) – Normalized forms of the values.

  • v2_normalized (str) – Normalized forms of the values.

Returns:

The value to be removed.

Return type:

str

get_duplicate_utilities(attribute_set, attribute_sets)[source]#

Generates utilities needed for duplicate removal.

Parameters:
  • attribute_set (set[str]) – A set of values for which to generate MinHash signatures.

  • attribute_sets (list[set[str]]) – A list of sets, where each set contains values from a different category.

Returns:

A tuple containing:
  • most_frequent_variation: Dictionary mapping normalized values to their most frequent variations

  • value_signatures: Dictionary mapping each value to its MinHash signature

  • to_remove: Empty set for collecting values to be removed

Return type:

tuple[Dict[str, str], Dict[str, List[int]], Set[str]]

generate_signatures(attribute_set)[source]#

Generates MinHash signatures for each value in the given set.

This method tokenizes each value into n-grams, computes a MinHash signature for these n-grams, and stores the result. A MinHash signature is a compact representation of the set of n-grams and is used to estimate the similarity between sets of values.

Parameters:

attribute_set (set[str]) – A set of values for which to generate MinHash signatures.

Returns:

A dictionary mapping each value to its corresponding

MinHash signature.

Return type:

dict[str, list[int]]

get_most_frequent_variation(attribute_sets_list)[source]#

Creates a dictionary that maps each unique normalized value to its most commonly occurring spelling variation across all provided sets.

A ‘normalized value’ is derived by converting the original value to lowercase and removing all spaces, which helps in identifying different spellings of the same value as equivalent. The ‘most frequent variation’ refers to the spelling of the value that appears most often in the data, maintaining the original case and spaces.

Parameters:

attribute_sets_list (List[Set[str]]) – A list where each set contains values from a specific category. Each set may include various spelling variations.

Returns:

A dictionary with normalized values as keys and their most

frequent original spelling variations as values.

Return type:

Dict[str, str]

standardize_values_across_sets(attribute_sets_list)[source]#

Standardizes values across all sets by mapping each value to its most frequent variation across all sets.

This method first generates a mapping of the most frequent variations for all values, then uses this mapping to standardize each value in each set. If a value has no recorded frequent variation, it remains unchanged.

Parameters:

attribute_sets_list (List[Set[str]]) – A list of sets, where each set contains values from a different category.

Returns:

A list of sets containing the standardized values across

all categories.

Return type:

List[Set[str]]

postprocessing.DepartmentPostprocessor module#

class academic_metrics.postprocessing.DepartmentPostprocessor.DepartmentPostprocessor(minhash_util, threshold=0.7)[source]#

Bases: BasePostprocessor

__init__(minhash_util, threshold=0.7)[source]#

Initialize the DepartmentPostprocessor with a MinHashUtility instance.

Parameters:

minhash_util (MinHashUtility) – A MinHashUtility instance for minhash operations.

remove_near_duplicates(*, category_dict)[source]#

Remove near-duplicate department names from the category dictionary.

Parameters:

category_dict (Dict[str, CategoryInfo]) – A dictionary mapping category names to CategoryInfo objects.

Returns:

A dictionary mapping category names to CategoryInfo

objects with near-duplicate department names removed.

Return type:

Dict[str, CategoryInfo]

postprocessing.FacultyPostprocessor module#

class academic_metrics.postprocessing.FacultyPostprocessor.FacultyPostprocessor(minhash_util, threshold=0.5)[source]#

Bases: BasePostprocessor

__init__(minhash_util, threshold=0.5)[source]#

Initialize the FacultyPostprocessor with a MinHashUtility instance.

Parameters:

minhash_util (MinHashUtility) – A MinHashUtility instance for minhash operations.

remove_near_duplicates(*, category_dict)[source]#

Remove near-duplicate faculty names from the category dictionary.

Parameters:

category_dict (Dict[str, CategoryInfo]) – A dictionary mapping category names to CategoryInfo objects.

Returns:

A dictionary mapping category names to CategoryInfo

objects with near-duplicate faculty names removed.

Return type:

Dict[str, CategoryInfo]

Module contents#