entity_extractor#

V3

class predict_backend.ml.nlp.components.entity_extractor.NerSummarization(nlp)#

Bases: SqlCompliant, PandasCompliant

Spacy custom component responsible for extracting, clean and count Ner entities from spacy docs. It will set ents_summary as a custom extension to the doc object. It also implements PandasCompliant so that it is compatible with the persistence handler.

assigns = ['doc._.ents_summary']#
beautiful_name = 'Entity Extraction'#
static create_component(nlp, name, depends_on)#

Factory function. Create a new instance of this component.

Parameters:
  • nlp (Any) – The spacy nlp object

  • name (str) – Useless. Just to give an example of registered params

  • depends_on (List[str]) – Useless. Just to give an example of registered params

Returns:

A new instance of the EventExtractor component

depends_on = ['ner', 'tagger', 'attribute_ruler']#
feature_columns = ['entity']#
classmethod get_default()#
Returns:

The default configuration for the component

classmethod get_df_table_name()#
Return type:

str

Returns:

The table neme of the component

classmethod get_feature_columns()#
Return type:

List

Returns:

Return the pandas columns names of the feature it extracts

classmethod get_sql_table_name()#
Return type:

str

Returns:

the table neme of the table that should contain the information extracted by this component.

classmethod init_dataframe()#
Return type:

DataFrame

Returns:

An empty dataframe with the necessary columns to store the information extracted by this component.

classmethod init_extension()#

Register the doc extension used by this component.

classmethod init_sql_db()#
Return type:

List[str]

Returns:

a list with the necessary columns to store the information extracted by this component.

name = 'virtualitics_ner_summary'#
pandas_cols = ['doc_number', 'entity', 'count']#
requires = ['doc.ents']#
sql_cols = ['rowid SERIAL', 'doc_number text', 'entity text', 'count integer']#
table_name = 'entities'#
classmethod to_df(doc, idx_col)#
Parameters:
  • doc – Spacy doc object.

  • idx_col (str) – the column to use as idx.

Return type:

DataFrame

Returns:

Extract a dict with the ents_summary information attached to a document.

classmethod to_dict(doc, idx_col)#
Parameters:
  • doc – Spacy doc object.

  • idx_col (str) – the column to use as idx.

Return type:

List[Dict]

Returns:

Extract a dict with the ents_summary information attached to a document.

classmethod transform_sql(doc, idx_col, value_placeholder='%s')#
Parameters:
  • doc – The spacy doc object

  • idx_col (str) – The column to use as id

  • value_placeholder – The type of placeholder to use (it can vary based on the client you’re using)

Return type:

Tuple[str, List]

Returns:

A list elements ready to be inserted into a db

predict_backend.ml.nlp.components.entity_extractor.clean_items(item)#
predict_backend.ml.nlp.components.entity_extractor.concat_ner_and_chunks(entities, chunks, remove_pos=True)#

Take entities and chunks and merge them into a single result. It also solve the conflicts of overlapping span by selecting the bigger.

predict_backend.ml.nlp.components.entity_extractor.segment_overlaps(a, b)#

Check if two segments overlap.

Parameters:
  • a (Tuple[int, int]) – First segment

  • b (Tuple[int, int]) – Second segment

Return type:

bool

Returns:

True if the two segments overlap