language_processor

class virtualitics_sdk.assets.language_processor.LanguageProcessor(model_name, asset_label, asset_name, document_identifier, narrative_feature, feature_names=None, pipeline_task=None, description=None, version=0, metadata=None, remove_model_after=True, seed=None, **kwargs)

Bases: Asset

NLP Pipeline responsible to extract features like entities, events from raw text. It extends Asset, this way it can be easily stored and retrieved. It provides an easy interface to ingest documents and store their metadata.

Parameters:
  • model_name (str) – The spacy model use under the hood

  • asset_label (str) – Asset parameter

  • asset_name (str) – Asset parameter

  • document_identifier (str) – The dataset feature to use as document id

  • narrative_feature (str) – The dataset feature that contains the text to process

  • feature_names (Optional[List[str]]) – The list of dataset features to store together with the doc extracted features

  • pipeline_task (List[str]) – List of task names to use inside the pipeline

  • description (Optional[str]) – Asset parameter

  • version (int) – Asset parameter

  • metadata – Asset parameter

  • remove_model_after (bool) – Remove the model after the ingestion process

  • seed (Optional[int]) – numpy seed value

EXAMPLE:

# Imports 
from virtualitics_sdk.assets.language_processor import LanguageProcessor
. . .
# Example usage
selected_model = "en_core_web_lg"
pipeline_config = ["Event Extraction", "Entity Extraction", "Corpus Statistics"]
feature_names = [
    "Unnamed: 0",
    "DATE RAISE ANNOUNCED",
    "COMPANY",
    "AMOUNT",
    "HQ Location",
    "TOP INVESTORS (in this round)",
    "LINK",
    "Website",
    "Round ",
    "Category",
    "NOTES",
    "Expansion Plans",
    "Founder First Name",
    "Founder Last Name",
    "Founder LinkedIn",
    "Founder Twitter",
    "Founder AngelList",
    "Unnamed: 16",
]
asset_label = '17181175-32-1f4a-442c-8f91-4d32e2b905fd_lp'
asset_name = '_lp'
id_col = 'COMPANY'
narr_col = 'COMPANY'
nlp_module = LanguageProcessor(
    model_name=selected_model,
    pipeline_task=pipeline_config,
    feature_names=feature_names,
    asset_label=asset_label,
    asset_name=asset_name,
    document_identifier=id_col,
    narrative_feature=narr_col,
)
DATASET_UUID = 10
available_models = {'en_core_web_lg'}
static available_tasks(model)

Returns the available task (registered components) available with the specified model

Parameters:

model (str) – The model name

Return type:

List

Returns:

List of available tasks

static entities2features(entities)

Transform the entities df into features that can be passed to the tf-idf Vectorizer.

Parameters:

entities (DataFrame) – LanguageProcessor entities table

Return type:

Optional[DataFrame]

Returns:

The same df with the output features

static events2features(events)

Transform the events df into features that can be passed to the tf-idf Vectorizer.

Parameters:

events (DataFrame) – LanguageProcessor events table

Return type:

Optional[DataFrame]

Returns:

The same df with the output features

get_doc_ids_and_dates()
Returns:

List of tuples of format (doc_number, date)

get_doc_nlp_features()

It extracts computed features from the docs. It also returns a list of docs name

Return type:

Tuple[List[str], List[str]]

Returns:

List that represent the doc names, List that represent the extracted features

get_doc_node_attributes(doc_number)

Return the base information of a doc_number doc.

Parameters:

doc_number – The doc name

Return type:

Dict

Returns:

a dict with the format {‘feature1’: ‘val1’, ‘feature2’: ‘val2’}

get_doc_numbers()
Returns:

A list with doc names

get_single_doc_nlp_features(doc_number)

Produce a Counter of string->count for entities and events.

get_table(table)

Useful if you want to access a specific language processor internal table. Available tables:

  • doc_data, where there are the dataset original features

  • entities,

= events

Parameters:

table (str) – table name

Return type:

DataFrame

Returns:

pd.DataFrame

ingest(data, flow_metadata=None, extract_doc_bin=False, starting_progress=None, target_progress=None)

It runs the whole pipeline on the pd dataframe provided. For validation purpose, it will check if the init params provided are present inside the data df. For store, starting_progress and target_progress params docs, check StepProgressTqdm docs.

Parameters:
  • data (DataFrame) – Mandatory. The data used to feed the pipeline.

  • flow_metadata (Optional[FlowMetadata]) – The flow metadata necessary to create a store interface.

  • extract_doc_bin (bool) – If true, the method return a DocBin.

  • starting_progress (Union[int, float, None]) – Used to init a StepProgressTqdm. Mandatory if store provided.

  • target_progress (Union[int, float, None]) – Used to init a StepProgressTqdm.

Return type:

Optional[DocBin]

Returns:

A DocBin object, if requested.

initialize_model(model_name)

Initialize the internal Spacy model with the provided model name and save it as instance attribute.

Parameters:

model_name (str)

virtualitics_sdk.assets.language_processor.counter_to_str_repr(c)

Constructs a list of tokens from a counter, with each token appearing as many times as its count in the counter.

Parameters:

c (Counter)

Return type:

List[str]

Returns:

List of tokens