Dickens Example¶
In this example, three books by Charles Dickens are used as a background corpus. Each of the books is subsequently used as a foreground model, and is parsimonized against the background corpus. This results in top terms that are characteristic for specific books, when compared to common Dickensian language.
This is a minimalistic example, which only analyzes unigrams, and uses a background corpus of limited size. As an exercise, one could expand this example with phrase modeling (e.g. as provided by gensim.phrases) to analyze higher-order ngrams.
The full text of the input books was obtained from Project Gutenberg.
Output¶
INFO:__main__:Fetching terms from Oliver Twist
INFO:__main__:Fetching terms from David Copperfield
INFO:__main__:Fetching terms from Great Expectations
INFO:wayward.parsimonious:Building corpus model
INFO:wayward.parsimonious:Building corpus model
INFO:wayward.parsimonious:Gathering term probabilities
INFO:wayward.parsimonious:EM with max_iter=50, eps=1e-05
... *omitted numpy warnings*
INFO:wayward.significant_words:Lambdas initialized to: Corpus=0.9, Group=0.01, Specific=0.09
Top 20 words in Oliver Twist:
PLM term PLM p SWLM term SWLM p
oliver 0.0824 oliver 0.1361
bumble 0.0372 sikes 0.0526
sikes 0.0332 bumble 0.0520
jew 0.0297 fagin 0.0477
fagin 0.0289 jew 0.0475
brownlow 0.0163 replied 0.0372
monks 0.0126 brownlow 0.0244
noah 0.0124 rose 0.0235
rose 0.0116 gentleman 0.0223
giles 0.0112 girl 0.0178
nancy 0.0109 nancy 0.0164
dodger 0.0107 dodger 0.0161
maylie 0.0093 monks 0.0159
bates 0.0088 noah 0.0156
beadle 0.0081 bates 0.0133
sowerberry 0.0079 giles 0.0118
yer 0.0077 maylie 0.0117
grimwig 0.0062 bill 0.0115
charley 0.0062 rejoined 0.0113
corney 0.0061 lady 0.0110
INFO:wayward.parsimonious:Gathering term probabilities
INFO:wayward.parsimonious:EM with max_iter=50, eps=1e-05
... *omitted wayward logging output*
INFO:wayward.significant_words:Lambdas initialized to: Corpus=0.9, Group=0.01, Specific=0.09
Top 20 words in David Copperfield:
PLM term PLM p SWLM term SWLM p
micawber 0.0367 micawber 0.0584
peggotty 0.0335 peggotty 0.0533
aunt 0.0330 aunt 0.0517
copperfield 0.0226 copperfield 0.0359
traddles 0.0218 traddles 0.0346
dora 0.0216 my 0.0295
agnes 0.0182 dora 0.0290
steerforth 0.0169 agnes 0.0285
murdstone 0.0138 steerforth 0.0259
uriah 0.0100 murdstone 0.0200
ly 0.0088 her 0.0171
dick 0.0085 mother 0.0157
wickfield 0.0084 uriah 0.0145
davy 0.0073 dick 0.0142
barkis 0.0067 ly 0.0140
trotwood 0.0065 wickfield 0.0128
spenlow 0.0064 davy 0.0105
ham 0.0057 trotwood 0.0099
heep 0.0055 barkis 0.0097
creakle 0.0054 ham 0.0094
INFO:wayward.parsimonious:Gathering term probabilities
INFO:wayward.parsimonious:EM with max_iter=50, eps=1e-05
... *omitted wayward logging output*
INFO:wayward.significant_words:Lambdas initialized to: Corpus=0.9, Group=0.01, Specific=0.09
Top 20 words in Great Expectations:
PLM term PLM p SWLM term SWLM p
joe 0.0732 joe 0.1346
pip 0.0335 pip 0.0614
havisham 0.0314 havisham 0.0559
herbert 0.0309 herbert 0.0502
wemmick 0.0280 estella 0.0471
estella 0.0265 wemmick 0.0456
jaggers 0.0239 jaggers 0.0409
biddy 0.0227 biddy 0.0404
pumblechook 0.0161 pumblechook 0.0275
wopsle 0.0118 wopsle 0.0192
drummle 0.0087 pocket 0.0186
provis 0.0067 sister 0.0152
orlick 0.0058 drummle 0.0132
compeyson 0.0057 aged 0.0097
aged 0.0056 marshes 0.0092
marshes 0.0052 orlick 0.0088
handel 0.0051 forge 0.0088
forge 0.0050 handel 0.0082
guardian 0.0047 provis 0.0074
trabb 0.0045 convict 0.0068