knowledge_graph¶

class virtualitics_sdk.assets.knowledge_graph.TfIdfBasedKG(asset_label, asset_name, asset_metadata=None, asset_version=1, min_df=1, max_df=1.0, description=None, stopwords=None, **kwargs)¶

Bases: Asset

Provide an interface to create, store and manipulate a TF-IDF based knowledge graph. It extends Asset, so that it can be stored and retrieved from the store. It also uses the Store to independently store internal assets like the created TF-IDF model, the KG corpus and finally the feature matrix generated with the TF-IDF model. This means that if you want to get the TF-IDF model you should use the provided getter that will query the store and return the original object.

Parameters:

asset_label (str) – Asset parameter
asset_name (str) – Asset parameter
asset_metadata (Optional[Dict]) – Asset parameter
asset_version (Union[int, float]) – Asset parameter
min_df (int) – TF-IDF model parameter
max_df (float) – TF-IDF model parameter
description (Optional[str]) – Asset parameter
stopwords (Optional[List[str]]) – TF-IDF model parameter

EXAMPLE:

# Imports 
from virtualitics_sdk import TfIdfBasedKG
. . .
# Example usage
store_interface = StoreInterface(**flow_metadata)
# Knowledge Graph
tfidf_basedkg = TfIdfBasedKG(asset_label="label", asset_name="name", min_df=2)

add_extra_features(kg, corpus_vectors, feature_names, nlp_module, num_top_entities=10, num_top_events=5)¶

Add extra (from the original dataset) features to the kg nodes

Parameters:

kg (Graph) – The input networkX graph
corpus_vectors – Sparse Matrix
feature_names (List[str]) – List of feature names
nlp_module (LanguageProcessor) – LanguageProcessor from which extract the extra features
num_top_entities (int) – Number of top-entities to add to every node
num_top_events (int) – Number of top-events to add to every node

Return type:

Graph

Returns:

The networkX object with the augmented nodes

compute_nlp_feature_vectors(nlp_module, return_out=False)¶

Create a TF-IDF model using the extracted features stored in the nlp_module object. This function can be used outside the context of the asset, using the return_out parameter.

Parameters:

nlp_module (LanguageProcessor) – used to retrieve the feature in order to build a TF-IDF model
return_out (bool) – control whether store in the Asset the outputs or return them

Return type:

Optional[Tuple[TfidfVectorizer, csr_matrix]]

Returns:

If return_out, return the created TF-IDF model and feature vector matrix

construct_knowledge_graph(nlp_module, similarity_threshold, drop_singletons=False, num_top_entities=10, num_top_events=5, include_nlp_features=True, save_kg=False)¶

Return type:: Optional[Graph]

get_f_vectors()¶

Retrieve from the store persistence the feature vector matrix and returns it

Returns:: sparse matrix

get_kg_corpus()¶

Retrieve from the store persistence the KG corpus and returns it

Return type:: DataFrame
Returns:: The stored kg corpus

get_tfidif_model()¶

Retrieve from the store persistence the TF-IDF model and returns it

Return type:: TfidfVectorizer
Returns:: tf-idf model

set_f_vectors(feature_vectors)¶

Store the kg feature vectors in the persistence and save its id as object param

Parameters:: feature_vectors – sparse matrix

set_kg_corpus(corpus)¶

Store the kg corpus in the persistence and save its id as object param

Parameters:: corpus (DataFrame) – The corpus to store

set_tfidif_model(tfidf_model)¶

Store the TF-IDF model in the persistence and save its id as object param

Parameters:: tfidf_model (TfidfVectorizer) – Sklearn tfidf model

virtualitics_sdk.assets.knowledge_graph.analyze_entity_proportions(tf_idf_kg, nlp_module, feature='Segments', min_group_entity_count=10, min_group_entity_total=10, min_complement_group_entity_count=3, min_complement_group_entity_total=10, entity_ratio_difference=0.3, alpha=0.05, apply_bonferroni_correction=False, verbose=False)¶: Apply the difference of proportions test to identify, for each feature value, those entities that appear in a higher proportion of documents associated with that feature value than do appear in all the other documents in the corpus.

virtualitics_sdk.assets.knowledge_graph.compute_opt_sim_mat(corpus_vectors)¶

It computes the matrix’s lower triangle filtering out everything below the diagonal (included). This way we only have (a,b) and we filter out (b,a), (a,a), (b,b).

Parameters:: corpus_vectors – Sparse matrix.
Returns:: Coo Scipy sparse matrix.

virtualitics_sdk.assets.knowledge_graph.difference_of_proportion_test(p1, n1, p2, n2, verbose=False)¶: A two-tailed difference of proportion test - for more details, see: https://stattrek.com/hypothesis-test/difference-in-proportions.aspx

virtualitics_sdk.assets.knowledge_graph.get_ranked_nlp_features(feature_index, feature_vector_mapping, feature_names)¶

virtualitics_sdk.assets.knowledge_graph.preprocessor(x)¶

virtualitics_sdk.assets.knowledge_graph.similarity_threshold_experiment(tfidf_basedkg, nlp_module, thresholds, flow_metadata=None, starting_progress=None, target_progress=None, tfidf_model=None, corpus_vectors=None)¶

Compute a Similarity Threshold experiment. For testing purpose or to work with this function outside VAIP you can use this function passing a tfidf_model and corpus_vectors params instead of a tfidf_basedkg object. This let you store them in memory or somewhere and not use the store interface

Parameters:

tfidf_basedkg (Optional[TfIdfBasedKG]) – Optional. TfIdfBasedKG object
tfidf_model (Optional[TfidfVectorizer]) – TF-IDF model, in order to extract the feature names
nlp_module (LanguageProcessor) – LanguageProcessor
thresholds (List[float]) – Thresholds to use in the Similarity Threshold experiment
flow_metadata (Optional[FlowMetadata]) – The flow_metadata dict
starting_progress (Union[int, float, None]) – Base progress percentage
target_progress (Union[int, float, None]) – Target progress percentage
tfidf_model – TF-IDF model, in order to extract the feature names
corpus_vectors – Sparse matrix with the score for every doc and feature

Return type:

Tuple[Asset, DataFrame]

Returns:

Return a Tuple with the experiment asset and an output pandas DataFrame

virtualitics_sdk.assets.knowledge_graph.tokenizer(x)¶