language_processor¶

class virtualitics_sdk.assets.language_processor.LanguageProcessor(model_name, asset_label, asset_name, document_identifier, narrative_feature, feature_names=None, pipeline_task=None, description=None, version=0, metadata=None, remove_model_after=True, seed=None, **kwargs)¶

Bases: Asset

NLP Pipeline responsible to extract features like entities, events from raw text. It extends Asset, this way it can be easily stored and retrieved. It provides an easy interface to ingest documents and store their metadata.

Parameters:

model_name (str) – The spacy model use under the hood
asset_label (str) – Asset parameter
asset_name (str) – Asset parameter
document_identifier (str) – The dataset feature to use as document id
narrative_feature (str) – The dataset feature that contains the text to process
feature_names (Optional[List[str]]) – The list of dataset features to store together with the doc extracted features
pipeline_task (List[str]) – List of task names to use inside the pipeline
description (Optional[str]) – Asset parameter
version (int) – Asset parameter
metadata – Asset parameter
remove_model_after (bool) – Remove the model after the ingestion process
seed (Optional[int]) – numpy seed value

EXAMPLE:

# Imports 
from virtualitics_sdk import LanguageProcessor
. . .
# Example usage
selected_model = "en_core_web_lg"
pipeline_config = ["Event Extraction", "Entity Extraction", "Corpus Statistics"]
feature_names = [
    "Unnamed: 0",
    "DATE RAISE ANNOUNCED",
    "COMPANY",
    "AMOUNT",
    "HQ Location",
    "TOP INVESTORS (in this round)",
    "LINK",
    "Website",
    "Round ",
    "Category",
    "NOTES",
    "Expansion Plans",
    "Founder First Name",
    "Founder Last Name",
    "Founder LinkedIn",
    "Founder Twitter",
    "Founder AngelList",
    "Unnamed: 16",
]
asset_label = '17181175-32-1f4a-442c-8f91-4d32e2b905fd_lp'
asset_name = '_lp'
id_col = 'COMPANY'
narr_col = 'COMPANY'
nlp_module = LanguageProcessor(
    model_name=selected_model,
    pipeline_task=pipeline_config,
    feature_names=feature_names,
    asset_label=asset_label,
    asset_name=asset_name,
    document_identifier=id_col,
    narrative_feature=narr_col,
)

DATASET_UUID = 10¶

available_models = {'en_core_web_lg'}¶

static available_tasks(model)¶

Returns the available task (registered components) available with the specified model

Parameters:: model (str) – The model name
Return type:: List
Returns:: List of available tasks

static entities2features(entities)¶

Transform the entities df into features that can be passed to the tf-idf Vectorizer.

Parameters:: entities (DataFrame) – LanguageProcessor entities table
Return type:: Optional[DataFrame]
Returns:: The same df with the output features

static events2features(events)¶

Transform the events df into features that can be passed to the tf-idf Vectorizer.

Parameters:: events (DataFrame) – LanguageProcessor events table
Return type:: Optional[DataFrame]
Returns:: The same df with the output features

get_doc_ids_and_dates()¶

Returns:: List of tuples of format (doc_number, date)

get_doc_nlp_features()¶

It extracts computed features from the docs. It also returns a list of docs name

Return type:: Tuple[List[str], List[str]]
Returns:: List that represent the doc names, List that represent the extracted features

get_doc_node_attributes(doc_number)¶

Return the base information of a doc_number doc.

Parameters:: doc_number – The doc name
Return type:: Dict
Returns:: a dict with the format {‘feature1’: ‘val1’, ‘feature2’: ‘val2’}

get_doc_numbers()¶

Returns:: A list with doc names

get_single_doc_nlp_features(doc_number)¶: Produce a Counter of string->count for entities and events.

get_table(table)¶

Useful if you want to access a specific language processor internal table. Available tables:

doc_data, where there are the dataset original features

entities,

= events

Parameters:: table (str) – table name
Return type:: DataFrame
Returns:: pd.DataFrame

ingest(data, flow_metadata=None, extract_doc_bin=False, starting_progress=None, target_progress=None)¶

It runs the whole pipeline on the pd dataframe provided. For validation purpose, it will check if the init params provided are present inside the data df. For store, starting_progress and target_progress params docs, check StepProgressTqdm docs.

Parameters:

data (DataFrame) – Mandatory. The data used to feed the pipeline.
flow_metadata (Optional[FlowMetadata]) – The flow metadata necessary to create a store interface.
extract_doc_bin (bool) – If true, the method return a DocBin.
starting_progress (Union[int, float, None]) – Used to init a StepProgressTqdm. Mandatory if store provided.
target_progress (Union[int, float, None]) – Used to init a StepProgressTqdm.

Return type:

Optional[DocBin]

Returns:

A DocBin object, if requested.

initialize_model(model_name)¶

Initialize the internal Spacy model with the provided model name and save it as instance attribute.

Parameters:: model_name (str)

virtualitics_sdk.assets.language_processor.counter_to_str_repr(c)¶

Constructs a list of tokens from a counter, with each token appearing as many times as its count in the counter.

Parameters:: c (Counter)
Return type:: List[str]
Returns:: List of tokens