significant_words module

class wayward.significant_words.SignificantWordsLM(documents: Iterable[Iterable[str]], lambdas: Tuple[numpy.floating, numpy.floating, numpy.floating], thresh: int = 0)[source]

Bases: wayward.parsimonious.ParsimoniousLM

Language model that consists of three sub-models:

  • Corpus model: represents term probabilities in a (large) background collection;
  • Group model: parsimonious term probabilities in a group of documents;
  • Specific model: represents the same group, but is biased towards terms that occur with a high frequency in single docs, and a low frequency in others.

References

M. Dehghani, H. Azarbonyad, J. Kamps, D. Hiemstra, and M. Marx (2016). Luhn Revisited: Significant Words Language Models. Proc. CKIM‘16.

Parameters:
  • documents (iterable over iterable of str terms) – All documents that should be included in the corpus model.
  • lambdas (3-tuple of float) – Weight of corpus, group, and specific models. Will be normalized if the weights in the tuple don’t sum to one.
  • thresh (int) – Don’t include words that occur fewer than thresh times.
vocab

Mapping of terms to numeric indices

Type:dict of term -> int
p_corpus

Log probability of terms in background model (indexed by vocab)

Type:array of float
p_group

Log probability of terms in the last processed group model (indexed by vocab)

Type:array of float
p_specific

Log probability of terms in the last processed specific model (indexed by vocab)

Type:array of float
lambda_corpus

Log probability (weight) of corpus model for documents

Type:array of float
lambda_group

Log probability (weight) of group model for documents

Type:array of float
lambda_specific

Log probability (weight) of specific model for documents

Type:array of float

See also

wayward.parsimonious.ParsimoniousLM
one-sided parsimonious model
fit_parsimonious_group(document_group: Iterable[Iterable[str]], max_iter: int = 50, eps: float = 1e-05, lambdas: Optional[Tuple[numpy.floating, numpy.floating, numpy.floating]] = None, fix_lambdas: bool = False, parsimonize_specific: bool = False, post_parsimonize: bool = False, specific_estimator: Callable[[Sequence[numpy.ndarray]], numpy.ndarray] = <function mutual_exclusion>) → Dict[str, float][source]

Estimate a document group model, and parsimonize it against fixed corpus and specific models. The documents may be unseen, but any terms that are not in the vocabulary will be ignored.

Parameters:
  • document_group (iterable over iterable of str terms) – All documents that should be included in the group model.
  • max_iter (int, optional) – Maximum number of iterations of EM algorithm to run.
  • eps (float, optional) – Epsilon: convergence threshold for EM algorithm.
  • lambdas (3-tuple of float, optional) – Weight of corpus, group, and specific models. Will be normalized if the weights in the tuple don’t sum to one.
  • fix_lambdas (bool, optional) – Fix the weights of the three sub-models (i.e. don’t estimate lambdas as part of the M-step).
  • parsimonize_specific (bool, optional) – Bias the specific model towards uncommon terms before applying the EM algorithm to the group model. This generally results in a group model that stands out less from the corpus model.
  • post_parsimonize (bool, optional) – Bias the group model towards uncommon terms after applying the EM algorithm. This may be used to compensate when the frequency of common terms varies much between the documents in the group.
  • specific_estimator (callable, optional) – Function that estimates the specific terms model based on the document term frequencies of the doc group.
Returns:

t_p_map – Dictionary of terms and their probabilities in the group model.

Return type:

dict of term -> float

group_top(k: int, document_group: Iterable[Iterable[str]], **kwargs) → List[Tuple[str, float]][source]

Get the top k terms of a document_group and their probabilities. This is a shortcut to retrieve the top terms found by fit_parsimonious_group().

Parameters:
  • k (int) – Number of top terms to return.
  • document_group (iterable over iterable of str terms) – All documents that should be included in the group model.
  • kwargs – Optional keyword arguments for fit_parsimonious_group().
Returns:

t_p – Terms and their probabilities in the group model.

Return type:

list of (str, float)

static normalize_lambdas(lambdas: Tuple[numpy.floating, numpy.floating, numpy.floating]) → Tuple[numpy.floating, numpy.floating, numpy.floating][source]

Check and normalize the initial lambdas of the three sub-models.

Parameters:lambdas (3-tuple of float) – Weight of corpus, group, and specific models.
Returns:lambdas – Normalized probability of corpus, group, and specific models.
Return type:3-tuple of float