tfidf_based_kg#
- class predict_backend.ml.nlp.tfidf_based_kg.TfIdfBasedKG(asset_label, asset_name, asset_metadata=None, asset_version=1, min_df=1, max_df=1.0, description=None, stopwords=None)#
Bases:
Asset
- add_extra_features(kg, corpus_vectors, feature_names, nlp_module, num_top_entities=10, num_top_events=5)#
Add extra (from the original dataset) features to the kg nodes
- Parameters:
kg (
Graph
) – The input networkX graphcorpus_vectors – Sparse Matrix
feature_names (
List
[str
]) – List of feature namesnlp_module (
LanguageProcessor
) – LanguageProcessor from which extract the extra featuresnum_top_entities (
int
) – Number of top-entities to add to every nodenum_top_events (
int
) – Number of top-events to add to every node
- Return type:
Graph
- Returns:
The networkX object with the augmented nodes
- compute_nlp_feature_vectors(nlp_module, return_out=False)#
Create a TF-IDF model using the extracted features stored in the nlp_module object. This function can be used outside the context of the asset, using the return_out parameter.
- Parameters:
nlp_module (
LanguageProcessor
) – used to retrieve the feature in order to build a TF-IDF modelreturn_out (
bool
) – control whether store in the Asset the outputs or return them
- Return type:
Optional
[Tuple
[TfidfVectorizer
,csr_matrix
]]- Returns:
If return_out, return the created TF-IDF model and feature vector matrix
- construct_knowledge_graph(nlp_module, similarity_threshold, drop_singletons=False, num_top_entities=10, num_top_events=5, include_nlp_features=True, save_kg=False)#
- Return type:
Optional
[Graph
]
- get_f_vectors()#
Retrieve from the store persistence the feature vector matrix and returns it
- Returns:
sparse matrix
- get_kg_corpus()#
Retrieve from the store persistence the KG corpus and returns it
- Return type:
DataFrame
- Returns:
The stored kg corpus
- get_tfidif_model()#
Retrieve from the store persistence the TF-IDF model and returns it
- Return type:
TfidfVectorizer
- Returns:
tf-idf model
- set_f_vectors(feature_vectors)#
Store the kg feature vectors in the persistence and save its id as object param
- Parameters:
feature_vectors – sparse matrix
- set_kg_corpus(corpus)#
Store the kg corpus in the persistence and save its id as object param
- Parameters:
corpus (
DataFrame
) – The corpus to store
- set_tfidif_model(tfidf_model)#
Store the TF-IDF model in the persistence and save its id as object param
- Parameters:
tfidf_model (
TfidfVectorizer
) – Sklearn tfidf model
- predict_backend.ml.nlp.tfidf_based_kg.analyze_entity_proportions(tf_idf_kg, nlp_module, feature='Segments', min_group_entity_count=10, min_group_entity_total=10, min_complement_group_entity_count=3, min_complement_group_entity_total=10, entity_ratio_difference=0.3, alpha=0.05, apply_bonferroni_correction=False, verbose=False)#
Apply the difference of proportions test to identify, for each feature value, those entities that appear in a higher proportion of documents associated with that feature value than do appear in all the other documents in the corpus.
- predict_backend.ml.nlp.tfidf_based_kg.compute_opt_sim_mat(corpus_vectors)#
It computes the matrix’s lower triangle filtering out everything below the diagonal (included). This way we only have (a,b) and we filter out (b,a), (a,a), (b,b).
- Parameters:
corpus_vectors – Sparse matrix.
- Returns:
Coo Scipy sparse matrix.
- predict_backend.ml.nlp.tfidf_based_kg.difference_of_proportion_test(p1, n1, p2, n2, verbose=False)#
A two-tailed difference of proportion test - for more details, see: https://stattrek.com/hypothesis-test/difference-in-proportions.aspx
- predict_backend.ml.nlp.tfidf_based_kg.get_ranked_nlp_features(feature_index, feature_vector_mapping, feature_names)#
- predict_backend.ml.nlp.tfidf_based_kg.preprocessor(x)#
- predict_backend.ml.nlp.tfidf_based_kg.similarity_threshold_experiment(tfidf_basedkg, nlp_module, thresholds, flow_metadata=None, starting_progress=None, target_progress=None, tfidf_model=None, corpus_vectors=None)#
Compute a Similarity Threshold experiment. For testing purpose or to work with this function outside Predict you can use this function passing a tfidf_model and corpus_vectors params instead of a tfidf_basedkg object. This let you store them in memory or somewhere and not use the store interface
- Parameters:
tfidf_basedkg (
Optional
[TfIdfBasedKG
]) – Optional. TfIdfBasedKG objecttfidf_model (
TfidfVectorizer
) – TF-IDF model, in order to extract the feature namesnlp_module (
LanguageProcessor
) – LanguageProcessorthresholds (
List
[float
]) – Thresholds to use in the Similarity Threshold experimentflow_metadata (
Dict
) – The flow_metadata dictstarting_progress (
Union
[int
,float
,None
]) – Base progress percentagetarget_progress (
Union
[int
,float
,None
]) – Target progress percentagetfidf_model – TF-IDF model, in order to extract the feature names
corpus_vectors – Sparse matrix with the score for every doc and feature
- Return type:
Tuple
[Asset
,DataFrame
]- Returns:
Return a Tuple with the experiment asset and an output pandas DataFrame
- predict_backend.ml.nlp.tfidf_based_kg.tokenizer(x)#