knowledge_graph

class virtualitics_sdk.assets.knowledge_graph.TfIdfBasedKG(asset_label, asset_name, asset_metadata=None, asset_version=1, min_df=1, max_df=1.0, description=None, stopwords=None, **kwargs)

Bases: Asset

Provide an interface to create, store and manipulate a TF-IDF based knowledge graph. It extends Asset, so that it can be stored and retrieved from the store. It also uses the Store to independently store internal assets like the created TF-IDF model, the KG corpus and finally the feature matrix generated with the TF-IDF model. This means that if you want to get the TF-IDF model you should use the provided getter that will query the store and return the original object.

Parameters:
  • asset_label (str) – Asset parameter

  • asset_name (str) – Asset parameter

  • asset_metadata (Optional[Dict]) – Asset parameter

  • asset_version (Union[int, float]) – Asset parameter

  • min_df (int) – TF-IDF model parameter

  • max_df (float) – TF-IDF model parameter

  • description (Optional[str]) – Asset parameter

  • stopwords (Optional[List[str]]) – TF-IDF model parameter

EXAMPLE:

# Imports 
from virtualitics_sdk.assets.knowledge_graph import TfIdfBasedKG
. . .
# Example usage
store_interface = StoreInterface(**flow_metadata)
# Knowledge Graph
tfidf_basedkg = TfIdfBasedKG(asset_label="label", asset_name="name", min_df=2)
add_extra_features(kg, corpus_vectors, feature_names, nlp_module, num_top_entities=10, num_top_events=5)

Add extra (from the original dataset) features to the kg nodes

Parameters:
  • kg (Graph) – The input networkX graph

  • corpus_vectors – Sparse Matrix

  • feature_names (List[str]) – List of feature names

  • nlp_module (LanguageProcessor) – LanguageProcessor from which extract the extra features

  • num_top_entities (int) – Number of top-entities to add to every node

  • num_top_events (int) – Number of top-events to add to every node

Return type:

Graph

Returns:

The networkX object with the augmented nodes

compute_nlp_feature_vectors(nlp_module, return_out=False)

Create a TF-IDF model using the extracted features stored in the nlp_module object. This function can be used outside the context of the asset, using the return_out parameter.

Parameters:
  • nlp_module (LanguageProcessor) – used to retrieve the feature in order to build a TF-IDF model

  • return_out (bool) – control whether store in the Asset the outputs or return them

Return type:

Optional[Tuple[TfidfVectorizer, csr_matrix]]

Returns:

If return_out, return the created TF-IDF model and feature vector matrix

construct_knowledge_graph(nlp_module, similarity_threshold, drop_singletons=False, num_top_entities=10, num_top_events=5, include_nlp_features=True, save_kg=False)
Return type:

Optional[Graph]

get_f_vectors()

Retrieve from the store persistence the feature vector matrix and returns it

Returns:

sparse matrix

get_kg_corpus()

Retrieve from the store persistence the KG corpus and returns it

Return type:

DataFrame

Returns:

The stored kg corpus

get_tfidif_model()

Retrieve from the store persistence the TF-IDF model and returns it

Return type:

TfidfVectorizer

Returns:

tf-idf model

set_f_vectors(feature_vectors)

Store the kg feature vectors in the persistence and save its id as object param

Parameters:

feature_vectors – sparse matrix

set_kg_corpus(corpus)

Store the kg corpus in the persistence and save its id as object param

Parameters:

corpus (DataFrame) – The corpus to store

set_tfidif_model(tfidf_model)

Store the TF-IDF model in the persistence and save its id as object param

Parameters:

tfidf_model (TfidfVectorizer) – Sklearn tfidf model

virtualitics_sdk.assets.knowledge_graph.analyze_entity_proportions(tf_idf_kg, nlp_module, feature='Segments', min_group_entity_count=10, min_group_entity_total=10, min_complement_group_entity_count=3, min_complement_group_entity_total=10, entity_ratio_difference=0.3, alpha=0.05, apply_bonferroni_correction=False, verbose=False)

Apply the difference of proportions test to identify, for each feature value, those entities that appear in a higher proportion of documents associated with that feature value than do appear in all the other documents in the corpus.

virtualitics_sdk.assets.knowledge_graph.compute_opt_sim_mat(corpus_vectors)

It computes the matrix’s lower triangle filtering out everything below the diagonal (included). This way we only have (a,b) and we filter out (b,a), (a,a), (b,b).

Parameters:

corpus_vectors – Sparse matrix.

Returns:

Coo Scipy sparse matrix.

virtualitics_sdk.assets.knowledge_graph.difference_of_proportion_test(p1, n1, p2, n2, verbose=False)

A two-tailed difference of proportion test - for more details, see: https://stattrek.com/hypothesis-test/difference-in-proportions.aspx

virtualitics_sdk.assets.knowledge_graph.get_ranked_nlp_features(feature_index, feature_vector_mapping, feature_names)
virtualitics_sdk.assets.knowledge_graph.preprocessor(x)
virtualitics_sdk.assets.knowledge_graph.similarity_threshold_experiment(tfidf_basedkg, nlp_module, thresholds, flow_metadata=None, starting_progress=None, target_progress=None, tfidf_model=None, corpus_vectors=None)

Compute a Similarity Threshold experiment. For testing purpose or to work with this function outside VAIP you can use this function passing a tfidf_model and corpus_vectors params instead of a tfidf_basedkg object. This let you store them in memory or somewhere and not use the store interface

Parameters:
  • tfidf_basedkg (Optional[TfIdfBasedKG]) – Optional. TfIdfBasedKG object

  • tfidf_model (Optional[TfidfVectorizer]) – TF-IDF model, in order to extract the feature names

  • nlp_module (LanguageProcessor) – LanguageProcessor

  • thresholds (List[float]) – Thresholds to use in the Similarity Threshold experiment

  • flow_metadata (Optional[FlowMetadata]) – The flow_metadata dict

  • starting_progress (Union[int, float, None]) – Base progress percentage

  • target_progress (Union[int, float, None]) – Target progress percentage

  • tfidf_model – TF-IDF model, in order to extract the feature names

  • corpus_vectors – Sparse matrix with the score for every doc and feature

Return type:

Tuple[Asset, DataFrame]

Returns:

Return a Tuple with the experiment asset and an output pandas DataFrame

virtualitics_sdk.assets.knowledge_graph.tokenizer(x)