corpus_stats#

class predict_backend.ml.nlp.components.corpus_stats.CorpusStats(nlp)#

Bases: PandasCompliant

Spacy custom component responsible for extracting some statistics from the data. It will set corpus_stats as a custom extension to the doc object. It also implements PandasCompliant and SqlCompliant so that it is compatible with the various persistence handlers.

assigns = ['doc._.stats']#
beautiful_name = 'Corpus Statistics'#
static create_component(nlp, name, depends_on)#

Factory function. Create a new instance of this component.

Parameters:
  • nlp (Any) – The spacy nlp object

  • name (str) – Useless. Just to give an example of registered params

  • depends_on (List[str]) – Useless. Just to give an example of registered params

Returns:

A new instance of the CorpusStats component

depends_on = ['ner', 'tagger', 'parser']#
feature_columns = ['n_sents', 'n_token', 'n_char', 'avg_tok_x_sent', 'avg_char_x_sent', 'n_ents', 'n_events', 'unique_words', 'pos_freqs', 'average_dep_depth']#
classmethod get_default()#
Returns:

The default configuration for the component

classmethod get_df_table_name()#
Return type:

str

Returns:

The table neme of the component.

classmethod get_feature_columns()#
Return type:

List

Returns:

Return the pandas columns names of the feature it extracts

classmethod init_dataframe()#
Return type:

DataFrame

Returns:

An empty dataframe with the necessary columns to store the information extracted by this component.

classmethod init_extension()#

Register the doc extension used by this component.

name = 'virtualitics_corpus_stats'#
pandas_cols = ['doc_number', 'n_sents', 'n_token', 'n_char', 'avg_tok_x_sent', 'avg_char_x_sent', 'n_ents', 'n_events', 'unique_words', 'pos_freqs', 'average_dep_depth']#
requires = ['doc.ents']#
table_name = 'stats'#
classmethod to_df(doc, idx_col)#
Parameters:
  • doc – Spacy doc object.

  • idx_col (str) – The column to use as idx.

Return type:

DataFrame

Returns:

Extract a dict with the corpus_stats information attached to a document.

classmethod to_dict(doc, idx_col)#
Parameters:
  • doc – Spacy doc object.

  • idx_col (str) – The column to use as idx.

Return type:

List[Dict]

Returns:

Extract a dict with the corpus_stats information attached to a document