parsimonious module

class wayward.parsimonious.ParsimoniousLM(documents: Iterable[Iterable[str]], w: numpy.floating, thresh: int = 0)[source]

Bases: object

Language model for a set of documents.

Constructing an object of this class fits a background model. The top method can then be used to fit document-specific models, also for unseen documents (with the same vocabulary as the background corpus).

References

D. Hiemstra, S. Robertson, and H. Zaragoza (2004). Parsimonious Language Models for Information Retrieval. Proc. SIGIR‘04.

Parameters:
  • documents (iterable over iterable of str terms) – All documents that should be included in the corpus model.
  • w (float) – Weight of document model (1 - weight of corpus model).
  • thresh (int) – Don’t include words that occur fewer than thresh times.
vocab

Mapping of terms to numeric indices

Type:dict of term -> int
p_corpus

Log probability of terms in background model (indexed by vocab)

Type:array of float
p_document

Log probability of terms in the last processed document model (indexed by vocab)

Type:array of float
get_term_probabilities(log_prob_distribution: numpy.ndarray) → Dict[str, float][source]

Align a term distribution with the vocabulary, and transform the term log probabilities to linear probabilities.

Parameters:log_prob_distribution (array of float) – Log probability of terms which is indexed by the vocabulary.
Returns:t_p_map – Dictionary of terms and their probabilities in the (sub-)model.
Return type:dict of term -> float
top(k: int, d: Iterable[str], max_iter: int = 50, eps: float = 1e-05, w: Optional[numpy.floating] = None) → List[Tuple[str, float]][source]

Get the top k terms of a document d and their log probabilities.

Uses the Expectation Maximization (EM) algorithm to estimate term probabilities.

Parameters:
  • k (int) – Number of top terms to return.
  • d (iterable of str terms) – Terms that make up the document.
  • max_iter (int, optional) – Maximum number of iterations of EM algorithm to run.
  • eps (float, optional) – Epsilon: convergence threshold for EM algorithm.
  • w (float, optional) – Weight of document model; overrides value given to ParsimoniousLM
Returns:

t_p – Terms and their probabilities in the parsimonious model.

Return type:

list of (str, float)