tfidf_based_kg#

class predict_backend.ml.nlp.tfidf_based_kg.TfIdfBasedKG(asset_label, asset_name, asset_metadata=None, asset_version=1, min_df=1, max_df=1.0, description=None, stopwords=None)#

Bases: Asset

add_extra_features(kg, corpus_vectors, feature_names, nlp_module, num_top_entities=10, num_top_events=5)#

Add extra (from the original dataset) features to the kg nodes

Parameters:
  • kg (Graph) – The input networkX graph

  • corpus_vectors – Sparse Matrix

  • feature_names (List[str]) – List of feature names

  • nlp_module (LanguageProcessor) – LanguageProcessor from which extract the extra features

  • num_top_entities (int) – Number of top-entities to add to every node

  • num_top_events (int) – Number of top-events to add to every node

Return type:

Graph

Returns:

The networkX object with the augmented nodes

compute_nlp_feature_vectors(nlp_module, return_out=False)#

Create a TF-IDF model using the extracted features stored in the nlp_module object. This function can be used outside the context of the asset, using the return_out parameter.

Parameters:
  • nlp_module (LanguageProcessor) – used to retrieve the feature in order to build a TF-IDF model

  • return_out (bool) – control whether store in the Asset the outputs or return them

Return type:

Optional[Tuple[TfidfVectorizer, csr_matrix]]

Returns:

If return_out, return the created TF-IDF model and feature vector matrix

construct_knowledge_graph(nlp_module, similarity_threshold, drop_singletons=False, num_top_entities=10, num_top_events=5, include_nlp_features=True, save_kg=False)#
Return type:

Optional[Graph]

get_f_vectors()#

Retrieve from the store persistence the feature vector matrix and returns it

Returns:

sparse matrix

get_kg_corpus()#

Retrieve from the store persistence the KG corpus and returns it

Return type:

DataFrame

Returns:

The stored kg corpus

get_tfidif_model()#

Retrieve from the store persistence the TF-IDF model and returns it

Return type:

TfidfVectorizer

Returns:

tf-idf model

set_f_vectors(feature_vectors)#

Store the kg feature vectors in the persistence and save its id as object param

Parameters:

feature_vectors – sparse matrix

set_kg_corpus(corpus)#

Store the kg corpus in the persistence and save its id as object param

Parameters:

corpus (DataFrame) – The corpus to store

set_tfidif_model(tfidf_model)#

Store the TF-IDF model in the persistence and save its id as object param

Parameters:

tfidf_model (TfidfVectorizer) – Sklearn tfidf model

predict_backend.ml.nlp.tfidf_based_kg.analyze_entity_proportions(tf_idf_kg, nlp_module, feature='Segments', min_group_entity_count=10, min_group_entity_total=10, min_complement_group_entity_count=3, min_complement_group_entity_total=10, entity_ratio_difference=0.3, alpha=0.05, apply_bonferroni_correction=False, verbose=False)#

Apply the difference of proportions test to identify, for each feature value, those entities that appear in a higher proportion of documents associated with that feature value than do appear in all the other documents in the corpus.

predict_backend.ml.nlp.tfidf_based_kg.compute_opt_sim_mat(corpus_vectors)#

It computes the matrix’s lower triangle filtering out everything below the diagonal (included). This way we only have (a,b) and we filter out (b,a), (a,a), (b,b).

Parameters:

corpus_vectors – Sparse matrix.

Returns:

Coo Scipy sparse matrix.

predict_backend.ml.nlp.tfidf_based_kg.difference_of_proportion_test(p1, n1, p2, n2, verbose=False)#

A two-tailed difference of proportion test - for more details, see: https://stattrek.com/hypothesis-test/difference-in-proportions.aspx

predict_backend.ml.nlp.tfidf_based_kg.get_ranked_nlp_features(feature_index, feature_vector_mapping, feature_names)#
predict_backend.ml.nlp.tfidf_based_kg.preprocessor(x)#
predict_backend.ml.nlp.tfidf_based_kg.similarity_threshold_experiment(tfidf_basedkg, nlp_module, thresholds, flow_metadata=None, starting_progress=None, target_progress=None, tfidf_model=None, corpus_vectors=None)#

Compute a Similarity Threshold experiment. For testing purpose or to work with this function outside Predict you can use this function passing a tfidf_model and corpus_vectors params instead of a tfidf_basedkg object. This let you store them in memory or somewhere and not use the store interface

Parameters:
  • tfidf_basedkg (Optional[TfIdfBasedKG]) – Optional. TfIdfBasedKG object

  • tfidf_model (TfidfVectorizer) – TF-IDF model, in order to extract the feature names

  • nlp_module (LanguageProcessor) – LanguageProcessor

  • thresholds (List[float]) – Thresholds to use in the Similarity Threshold experiment

  • flow_metadata (Dict) – The flow_metadata dict

  • starting_progress (Union[int, float, None]) – Base progress percentage

  • target_progress (Union[int, float, None]) – Target progress percentage

  • tfidf_model – TF-IDF model, in order to extract the feature names

  • corpus_vectors – Sparse matrix with the score for every doc and feature

Return type:

Tuple[Asset, DataFrame]

Returns:

Return a Tuple with the experiment asset and an output pandas DataFrame

predict_backend.ml.nlp.tfidf_based_kg.tokenizer(x)#