knowledge_graph¶
- class virtualitics_sdk.assets.knowledge_graph.TfIdfBasedKG(asset_label, asset_name, asset_metadata=None, asset_version=1, min_df=1, max_df=1.0, description=None, stopwords=None, **kwargs)¶
Bases:
Asset
Provide an interface to create, store and manipulate a TF-IDF based knowledge graph. It extends Asset, so that it can be stored and retrieved from the store. It also uses the Store to independently store internal assets like the created TF-IDF model, the KG corpus and finally the feature matrix generated with the TF-IDF model. This means that if you want to get the TF-IDF model you should use the provided getter that will query the store and return the original object.
- Parameters:
asset_label (
str
) – Asset parameterasset_name (
str
) – Asset parameterasset_metadata (
Optional
[Dict
]) – Asset parameterasset_version (
Union
[int
,float
]) – Asset parametermin_df (
int
) – TF-IDF model parametermax_df (
float
) – TF-IDF model parameterdescription (
Optional
[str
]) – Asset parameterstopwords (
Optional
[List
[str
]]) – TF-IDF model parameter
EXAMPLE:
# Imports from virtualitics_sdk.assets.knowledge_graph import TfIdfBasedKG . . . # Example usage store_interface = StoreInterface(**flow_metadata) # Knowledge Graph tfidf_basedkg = TfIdfBasedKG(asset_label="label", asset_name="name", min_df=2)
- add_extra_features(kg, corpus_vectors, feature_names, nlp_module, num_top_entities=10, num_top_events=5)¶
Add extra (from the original dataset) features to the kg nodes
- Parameters:
kg (
Graph
) – The input networkX graphcorpus_vectors – Sparse Matrix
feature_names (
List
[str
]) – List of feature namesnlp_module (
LanguageProcessor
) – LanguageProcessor from which extract the extra featuresnum_top_entities (
int
) – Number of top-entities to add to every nodenum_top_events (
int
) – Number of top-events to add to every node
- Return type:
Graph
- Returns:
The networkX object with the augmented nodes
- compute_nlp_feature_vectors(nlp_module, return_out=False)¶
Create a TF-IDF model using the extracted features stored in the nlp_module object. This function can be used outside the context of the asset, using the return_out parameter.
- Parameters:
nlp_module (
LanguageProcessor
) – used to retrieve the feature in order to build a TF-IDF modelreturn_out (
bool
) – control whether store in the Asset the outputs or return them
- Return type:
Optional
[Tuple
[TfidfVectorizer
,csr_matrix
]]- Returns:
If return_out, return the created TF-IDF model and feature vector matrix
- construct_knowledge_graph(nlp_module, similarity_threshold, drop_singletons=False, num_top_entities=10, num_top_events=5, include_nlp_features=True, save_kg=False)¶
- Return type:
Optional
[Graph
]
- get_f_vectors()¶
Retrieve from the store persistence the feature vector matrix and returns it
- Returns:
sparse matrix
- get_kg_corpus()¶
Retrieve from the store persistence the KG corpus and returns it
- Return type:
DataFrame
- Returns:
The stored kg corpus
- get_tfidif_model()¶
Retrieve from the store persistence the TF-IDF model and returns it
- Return type:
TfidfVectorizer
- Returns:
tf-idf model
- set_f_vectors(feature_vectors)¶
Store the kg feature vectors in the persistence and save its id as object param
- Parameters:
feature_vectors – sparse matrix
- set_kg_corpus(corpus)¶
Store the kg corpus in the persistence and save its id as object param
- Parameters:
corpus (
DataFrame
) – The corpus to store
- set_tfidif_model(tfidf_model)¶
Store the TF-IDF model in the persistence and save its id as object param
- Parameters:
tfidf_model (
TfidfVectorizer
) – Sklearn tfidf model
- virtualitics_sdk.assets.knowledge_graph.analyze_entity_proportions(tf_idf_kg, nlp_module, feature='Segments', min_group_entity_count=10, min_group_entity_total=10, min_complement_group_entity_count=3, min_complement_group_entity_total=10, entity_ratio_difference=0.3, alpha=0.05, apply_bonferroni_correction=False, verbose=False)¶
Apply the difference of proportions test to identify, for each feature value, those entities that appear in a higher proportion of documents associated with that feature value than do appear in all the other documents in the corpus.
- virtualitics_sdk.assets.knowledge_graph.compute_opt_sim_mat(corpus_vectors)¶
It computes the matrix’s lower triangle filtering out everything below the diagonal (included). This way we only have (a,b) and we filter out (b,a), (a,a), (b,b).
- Parameters:
corpus_vectors – Sparse matrix.
- Returns:
Coo Scipy sparse matrix.
- virtualitics_sdk.assets.knowledge_graph.difference_of_proportion_test(p1, n1, p2, n2, verbose=False)¶
A two-tailed difference of proportion test - for more details, see: https://stattrek.com/hypothesis-test/difference-in-proportions.aspx
- virtualitics_sdk.assets.knowledge_graph.get_ranked_nlp_features(feature_index, feature_vector_mapping, feature_names)¶
- virtualitics_sdk.assets.knowledge_graph.preprocessor(x)¶
- virtualitics_sdk.assets.knowledge_graph.similarity_threshold_experiment(tfidf_basedkg, nlp_module, thresholds, flow_metadata=None, starting_progress=None, target_progress=None, tfidf_model=None, corpus_vectors=None)¶
Compute a Similarity Threshold experiment. For testing purpose or to work with this function outside VAIP you can use this function passing a tfidf_model and corpus_vectors params instead of a tfidf_basedkg object. This let you store them in memory or somewhere and not use the store interface
- Parameters:
tfidf_basedkg (
Optional
[TfIdfBasedKG
]) – Optional. TfIdfBasedKG objecttfidf_model (
Optional
[TfidfVectorizer
]) – TF-IDF model, in order to extract the feature namesnlp_module (
LanguageProcessor
) – LanguageProcessorthresholds (
List
[float
]) – Thresholds to use in the Similarity Threshold experimentflow_metadata (
Optional
[FlowMetadata
]) – The flow_metadata dictstarting_progress (
Union
[int
,float
,None
]) – Base progress percentagetarget_progress (
Union
[int
,float
,None
]) – Target progress percentagetfidf_model – TF-IDF model, in order to extract the feature names
corpus_vectors – Sparse matrix with the score for every doc and feature
- Return type:
Tuple
[Asset
,DataFrame
]- Returns:
Return a Tuple with the experiment asset and an output pandas DataFrame
- virtualitics_sdk.assets.knowledge_graph.tokenizer(x)¶