Статьи

Моделирование тем в Python и R: довольно любопытный анализ Enron Email Corpus

Имея только когда — либо играл с Allocation Скрытое Дирихле с помощью gensim в питона , я был очень заинтересован , чтобы увидеть хороший пример такого рода темы моделирования в R . Всякий раз, когда я вижу действительно крутой анализ, у меня появляется желание сделать это самому. Что может быть лучше для моделирования темы, чем набор данных электронной почты Enron ?!?!? Позвольте мне сказать вам, эта вещь монстр! Согласно веб-сайту, на котором я его получил, он содержит около 500 тыс. Сообщений, поступающих от 151, в основном, старшего руководства, и организован в папки пользователей. Я не хотел принимать все в свой анализ, поэтому я решил, что буду просматривать только сообщения, содержащиеся в папках «отправлено» или «отправлено».

Будучи большим сторонником R, я действительно старался выполнять всю обработку и анализ в R, но это было слишком сложно и занимало больше времени, чем я хотел. Поэтому я отряхнул свои навыки работы с Python (спасибо, аспирантура!) И выполнил основную часть обработки / подготовки данных на python, а также анализа текста в R. Ниже приведен код (надеюсь, достаточно хорошо прокомментированный), который я использовал для обработки корпус в питоне:

docs = []
from os import listdir, chdir
import re

   
# Here's my attempt at coming up with regular expressions to filter out
# parts of the enron emails that I deem as useless.

email_pat = re.compile(".+@.+")
to_pat = re.compile("To:.+\n")
cc_pat = re.compile("cc:.+\n")
subject_pat = re.compile("Subject:.+\n")
from_pat = re.compile("From:.+\n")
sent_pat = re.compile("Sent:.+\n")
received_pat = re.compile("Received:.+\n")
ctype_pat = re.compile("Content-Type:.+\n")
reply_pat = re.compile("Reply- Organization:.+\n")
date_pat = re.compile("Date:.+\n")
xmail_pat = re.compile("X-Mailer:.+\n")
mimver_pat = re.compile("MIME-Version:.+\n")
contentinfo_pat = re.compile("----------------------------------------.+----------------------------------------")
forwardedby_pat = re.compile("----------------------.+----------------------")
caution_pat = re.compile('''\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*.+\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*''')
privacy_pat = re.compile(" _______________________________________________________________.+ _______________________________________________________________")

# The enron emails are in 151 directories representing each each senior management
# employee whose email account was entered into the dataset.
# The task here is to go into each folder, and enter each 
# email text file into one long nested list.
# I've used readlines() to read in the emails because read() 
# didn't seem to work with these email files.

chdir("/home/inkhorn/enron")
names = [d for d in listdir(".") if "." not in d]
for name in names:
    chdir("/home/inkhorn/enron/%s" % name)
    subfolders = listdir('.')
    sent_dirs = [n for n, sf in enumerate(subfolders) if "sent" in sf]
    sent_dirs_words = [subfolders[i] for i in sent_dirs]
    for d in sent_dirs_words:
        chdir('/home/inkhorn/enron/%s/%s' % (name,d))
        file_list = listdir('.')
        docs.append([" ".join(open(f, 'r').readlines()) for f in file_list if "." in f])
        
# Here i go into each email from each employee, try to filter out all the useless stuff,
# then paste the email into one long flat list.  This is probably inefficient, but oh well - python
# is pretty fast anyway!

docs_final = []
for subfolder in docs:
    for email in subfolder:
        if ".nsf" in email:
            etype = ".nsf"
        elif ".pst" in email:
            etype = ".pst"
        email_new = email[email.find(etype)+4:]
        email_new = to_pat.sub('', email_new)
        email_new = cc_pat.sub('', email_new)
        email_new = subject_pat.sub('', email_new)
        email_new = from_pat.sub('', email_new)
        email_new = sent_pat.sub('', email_new)
        email_new = email_pat.sub('', email_new)
        if "-----Original Message-----" in email_new:
            email_new = email_new.replace("-----Original Message-----","")
        email_new = ctype_pat.sub('', email_new)
        email_new = reply_pat.sub('', email_new)
        email_new = date_pat.sub('', email_new)
        email_new = xmail_pat.sub('', email_new)
        email_new = mimver_pat.sub('', email_new)
        email_new = contentinfo_pat.sub('', email_new)
        email_new = forwardedby_pat.sub('', email_new)
        email_new = caution_pat.sub('', email_new)
        email_new = privacy_pat.sub('', email_new)
        docs_final.append(email_new)

# Here I proceed to dump each and every email into about 126 thousand separate 
# txt files in a newly created 'data' directory.  This gets it ready for entry into a Corpus using the tm (textmining)
# package from R.

for n, doc in enumerate(docs_final):
    outfile = open("/home/inkhorn/enron/data/%s.txt" % n,'w')
    outfile.write(doc)
    outfile.close()

После того, как я увидел, как Python прокручивает эти электронные письма, я был очень впечатлен! Это было очень гибко при создании каталога с наибольшим количеством файлов, которые я когда-либо видел на своем компьютере!

Итак, теперь у меня есть каталог, заполненный множеством текстовых файлов. Следующим шагом было ввести их в R, чтобы я мог отправить их в LDA. Ниже приведен код R, который я использовал:

library(stringr)
library(plyr)
library(tm)
library(tm.plugin.mail)
library(SnowballC)
library(topicmodels)

# At this point, the python script should have been run, 
# creating about 126 thousand txt files.  I was very much afraid
# to import that many txt files into the tm package in R (my computer only
# runs on 8GB of RAM), so I decided to mark 60k of them for a sample, and move the
# rest of them into a separate directory

email_txts = list.files('data/')
email_txts_sample = sample(email_txts, 60000)
email_rename = data.frame(orig=email_txts_sample, new=sub(".txt",".rxr", email_txts_sample))
file.rename(str_c('data/',email_rename$orig), str_c('data/',email_rename$new))

# At this point, all of the non-sampled emails (labelled .txt, not .rxr)
# need to go into a different directory. I created a directory that I called
# nonsampled/ and moved the files there via the terminal command "mv *.txt nonsampled/".
# It's very important that you don't try to do this via a file explorer, windows or linux,
# as the act of trying to display that many file icons is apparently very difficult for a regular machine :$

enron = Corpus(DirSource("/home/inkhorn/enron/data"))

extendedstopwords=c("a","about","above","across","after","MIME Version","forwarded","again","against","all","almost","alone","along","already","also","although","always","am","among","an","and","another","any","anybody","anyone","anything","anywhere","are","area","areas","aren't","around","as","ask","asked","asking","asks","at","away","b","back","backed","backing","backs","be","became","because","become","becomes","been","before","began","behind","being","beings","below","best","better","between","big","both","but","by","c","came","can","cannot","can't","case","cases","certain","certainly","clear","clearly","come","could","couldn't","d","did","didn't","differ","different","differently","do","does","doesn't","doing","done","don't","down","downed","downing","downs","during","e","each","early","either","end","ended","ending","ends","enough","even","evenly","ever","every","everybody","everyone","everything","everywhere","f","face","faces","fact","facts","far","felt","few","find","finds","first","for","four","from","full","fully","further","furthered","furthering","furthers","g","gave","general","generally","get","gets","give","given","gives","go","going","good","goods","got","great","greater","greatest","group","grouped","grouping","groups","h","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","her","here","here's","hers","herself","he's","high","higher","highest","him","himself","his","how","however","how's","i","i'd","if","i'll","i'm","important","in","interest","interested","interesting","interests","into","is","isn't","it","its","it's","itself","i've","j","just","k","keep","keeps","kind","knew","know","known","knows","l","large","largely","last","later","latest","least","less","let","lets","let's","like","likely","long","longer","longest","m","made","make","making","man","many","may","me","member","members","men","might","more","most","mostly","mr","mrs","much","must","mustn't","my","myself","n","necessary","need","needed","needing","needs","never","new","newer","newest","next","no","nobody","non","noone","nor","not","nothing","now","nowhere","number","numbers","o","of","off","often","old","older","oldest","on","once","one","only","open","opened","opening","opens","or","order","ordered","ordering","orders","other","others","ought","our","ours","ourselves","out","over","own","p","part","parted","parting","parts","per","perhaps","place","places","point","pointed","pointing","points","possible","present","presented","presenting","presents","problem","problems","put","puts","q","quite","r","rather","really","right","room","rooms","s","said","same","saw","say","says","second","seconds","see","seem","seemed","seeming","seems","sees","several","shall","shan't","she","she'd","she'll","she's","should","shouldn't","show","showed","showing","shows","side","sides","since","small","smaller","smallest","so","some","somebody","someone","something","somewhere","state","states","still","such","sure","t","take","taken","than","that","that's","the","their","theirs","them","themselves","then","there","therefore","there's","these","they","they'd","they'll","they're","they've","thing","things","think","thinks","this","those","though","thought","thoughts","three","through","thus","to","today","together","too","took","toward","turn","turned","turning","turns","two","u","under","until","up","upon","us","use","used","uses","v","very","w","want","wanted","wanting","wants","was","wasn't","way","ways","we","we'd","well","we'll","wells","went","were","we're","weren't","we've","what","what's","when","when's","where","where's","whether","which","while","who","whole","whom","who's","whose","why","why's","will","with","within","without","won't","work","worked","working","works","would","wouldn't","x","y","year","years","yes","yet","you","you'd","you'll","young","younger","youngest","your","you're","yours","yourself","yourselves","you've","z")
dtm.control = list(
  tolower                         = T,
  removePunctuation         = T,
  removeNumbers                 = T,
  stopwords                         = c(stopwords("english"),extendedstopwords),
  stemming                         = T,
  wordLengths                 = c(3,Inf),
  weighting                         = weightTf)

dtm = DocumentTermMatrix(enron, control=dtm.control)
dtm = removeSparseTerms(dtm,0.999)
dtm = dtm[rowSums(as.matrix(dtm))>0,]

k = 4

# Beware: this step takes a lot of patience!  My computer was chugging along for probably 10 or so minutes before it completed the LDA here.
lda.model = LDA(dtm, k)

# This enables you to examine the words that make up each topic that was calculated.  Bear in mind that I've chosen to stem all words possible in this corpus, so some of the words output will look a little weird.
terms(lda.model,20)

# Here I construct a dataframe that scores each document according to how closely its content 
# matches up with each topic.  The closer the score is to 0, the more likely its content matches
# up with a particular topic. 

emails.topics = posterior(lda.model, dtm)$topics
df.emails.topics = as.data.frame(emails.topics)
df.emails.topics = cbind(email=as.character(rownames(df.emails.topics)), 
                         df.emails.topics, stringsAsFactors=F)

Фу, это заняло много вычислительной мощности! Теперь, когда это сделано, давайте посмотрим на результаты команды в строке 48 из приведенного выше описания:

      Topic 1   Topic 2     Topic 3      Topic 4     
 [1,] "time"    "thank"     "market"     "email"     
 [2,] "vinc"    "pleas"     "enron"      "pleas"     
 [3,] "week"    "deal"      "power"      "messag"    
 [4,] "thank"   "enron"     "compani"    "inform"    
 [5,] "look"    "attach"    "energi"     "receiv"    
 [6,] "day"     "chang"     "price"      "intend"    
 [7,] "dont"    "call"      "gas"        "copi"      
 [8,] "call"    "agreement" "busi"       "attach"    
 [9,] "meet"    "question"  "manag"      "recipi"    
[10,] "hope"    "fax"       "servic"     "enron"     
[11,] "talk"    "america"   "rate"       "confidenti"
[12,] "ill"     "meet"      "trade"      "file"      
[13,] "tri"     "mark"      "provid"     "agreement" 
[14,] "night"   "kay"       "issu"       "thank"     
[15,] "friday"  "corp"      "custom"     "contain"   
[16,] "peopl"   "trade"     "california" "address"   
[17,] "bit"     "ena"       "oper"       "contact"   
[18,] "guy"     "north"     "cost"       "review"    
[19,] "love"    "discuss"   "electr"     "parti"     
[20,] "houston" "regard"    "report"     "contract"

Вот где требуется некоторая действительно субъективная интерпретация, как в анализе PCA. Давайте попробуем интерпретировать темы по одной:

  1. В этой теме я вижу много слов, связанных со временем, а потом вижу слово «встретиться». Я назову эту тему встречи (деловой или другой)!
  2. Я не уверен, как интерпретировать эту вторую тему, так что, возможно, я объясню это в своем анализе!
  3. Эта тема содержит много слов «бизнес-контента», поэтому она, похоже, является темой «разговорного магазина».
  4. Эта тема, хотя все еще довольно «деловая», кажется, не столько о содержании бизнеса, сколько о процессах или, возможно, законности бизнеса.

Для каждой из разумных тем (1,3,4) давайте приведем несколько писем с высокими оценками по этим темам, чтобы увидеть, имеет ли смысл анализ:

sample(which(df.emails.topics$"1" > .95), 10)
 [1] 53749 32102 16478 36204 29296 29243 47654 38733 28515 53254
enron[[32102]]

 I will be out of the office next week on Spring Break. Can you participate on 
 this call? Please report what is said to Christian Yoder 503-464-7845 or 
 Steve Hall 503-4647795

 	03/09/2001 05:48 PM

 I don't know, but I will check with our client.

 Our client Avista Energy has received the communication, below, from the ISO
 regarding withholding of payments to creditors of monies the ISO has
 received from PG&E.  We are interested in whether any of your clients have
 received this communication, are interested in this issue and, if so,
 whether you have any thoughts about how to proceed.

 You are invited to participate in a conference call to discuss this issue on
 Monday, March 12, at 10:00 a.m.

 Call-in number: (888) 320-6636
 Host: Pritchard
 Confirmation number: 1827-1922

 Diane Pritchard
 Morrison & Foerster LLP
 425 Market Street
 San Francisco, California 94105
 (415) 268-7188

Так что это не деловая встреча в физическом смысле, а конференц-связь, которая все еще подпадает под общую категорию собраний.

enron[[29243]]
 Hey Fritz.  I am going to send you an email that attaches a referral form to your job postings.  In addition, I will also personally tell the hiring manager that I have done this and I can also give him an extra copy of youe resume.  Hopefully we can get something going here....

 Tori,

 I received your name from Diane Hoyuela. You and I spoke
 back in 1999 about the gas industry. I tried briefly back
 in 1999 and found few opportunities during de-regulations
 first few steps. Well,...I'm trying again. I've been
 applying for a few job openings at Enron and was wondering
 if you could give me an internal referral. Also, any advice
 on landing a position at Enron or in general as a scheduler
 or analyst.
 Last week I applied for these positions at Enron; gas
 scheduler 110360, gas analyst 110247, and book admin.
 110129. I have a pretty good understanding of the gas
 market.

 I've attached my resume for you. Congrats. on the baby!
 I'll give you a call this afternoon to follow-up, I know
 mornings are your time.
 Regards,

 Fritz Hiser

 __________________________________________________
 Do You Yahoo!?
 Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger. http://im.yahoo.com

Этот, очевидно, показывает кого-то, кто пытался устроиться на работу в Enron и хотел позвонить «сегодня после полудня». Опять же, «звонок», а не физическая встреча.

В заключение,

enron[[29296]]

 Susan,

 Well you have either had a week from hell so far or its just taking you time
 to come up with some good bs.  Without being too forward I will be in town
 next Friday and wanted to know if you would like to go to dinner or
 something.  At least that will give us a chance to talk face to face.  If
 your busy don't worry about it I thought I would just throw it out there.

 I'll keep this one short and sweet since the last one was rather lengthy.
 Hope this Thursday is a little better then last week.

 Kyle

 _________________________________________________________________________
 Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.

 Share information about yourself, create your own public profile at

http://profiles.msn.com.

Ааа, вот особенно сочно. Кайл здесь хочет пойти на ужин «или что-то» (хе-хе-хе) со Сьюзен, чтобы получить возможность поговорить с ней лицом к лицу. Наконец, физическое собрание (может быть, очень физическое …) объединено в категорию с другими деловыми встречами лично или по телефону.

Хорошо, теперь давайте перейдем к теме 3, теме «бизнес-контент».

sample(which(df.emails.topics$"3" > .95), 10)
 [1] 40671 26644  5398 52918 37708  5548 15167 56149 47215 26683

enron[[40671]]

 Please change the counterparty on deal 806589 from TP2 to TP3 (sorry about that).

Хорошо, это кажется справедливым в сфере бизнес-контента, но я не знаю, какого черта это означает. Давайте попробуем еще один:

enron[[5548]]

Phillip, Scott, Hunter, Tom and John -

 Just to reiterate the new trading guidelines on PG&E Energy Trading:

 1.  Both financial and physical trading are approved, with a maximum tenor of 18 months

 2.  Approved entities are:	PG&E Energy Trading - Gas Corporation
 				PG&E Energy Trading - Canada Corporation

 				NO OTHER PG&E ENTITIES ARE APPROVED FOR TRADING

 3.  Both EOL and OTC transactions are OK

 4.  Please call Credit (ext. 31803) with details on every OTC transaction.  We need to track all new positions with PG&E Energy Trading on an ongoing basis.  Please ask the traders and originators on your desks to notify us with the details on any new transactions immediately upon execution.  For large transactions (greater than 2 contracts/day or 5 BCF total), please call for approval before transacting.

 Thanks for your assistance; please call me (ext. 53923) or Russell Diamond (ext. 57095) if you have any questions.

 Jay

Этот однозначно пропитывает бизнес-контентом. Обратите внимание на такие термины, как «Энергетическая торговля», «Газовая корпорация» и т. Д. Наконец, еще одно:

enron[[26683]]

Hi Kathleen, Randy, Chris, and Trish,

 Attached is the text of the August issue of The Islander.  The headings will
 be lined up when Trish adds the art and ads.  A calendar, also, which is in
 the next e-mail.

 I'll appreciate your comments by the end of tomorrow, Monday.

 There are open issues which I sure hope get resolved before printing:

 1.  I'm waiting for a reply from Mike Bass regarding tenses on the Home Depot
 article.  Don't know if there's one developer or more and what the name(s)
 is/are.

 2.  Didn't hear back from Ted Weir regarding minutes for July's water board
 meeting.  I think there are 2 meetings minutes missed, 6/22 and July.

 3.  Waiting to hear back from Cheryl Hanks about the 7/6 City Council and 6/7
 BOA meetings minutes.

 4.  Don't know the name of the folks who were honored with Yard of the Month.
  They're at 509 Narcissus.

 I'm not feeling very good about the missing parts but need to move on
 schedule!  I'm also looking for a good dictionary to check the spellings of
 ettouffe, tree-house and orneryness.  (Makes me feel kind of ornery, come to
 think about it!)

 Please let me know if you have revisions.  Hope your week is starting out
 well.

 'Nita

Хорошо, кажется, это сочетание бизнес-контента и процесса. Так что я могу видеть, как это было вписано в эту тему, но это не совсем то совершенство, которое мне хотелось бы.

Наконец, давайте перейдем к теме 4, которая показалась мне темой «бизнес-процесса». Я с подозрением отношусь к этой теме, так как не думаю, что успешно отфильтровал все, что хотел:

sample(which(df.emails.topics$"4" > .95), 10)
 [1] 51205  5129 48826 51214 55337 15843 52543 11978 48337  2609

enron[[5129]]

very funny today...during the free fall, couldn't price jv and xh low enough 
 on eol, just kept getting cracked.  when we stabilized, customers came in to 
 buy and couldnt price it high enough.  winter versus apr went from +23 cents 
 when we were at the bottom to +27 when april rallied at the end even though 
 it should have tightened theoretically.  however, april is being supported 
 just off the strip.  getting word a lot of utilities are going in front of 
 the puc trying to get approval for hedging programs this year.  

 hey johnny. hope all is well. what u think hrere? utuilites buying this break
 down? charts look awful but 4.86 ish is next big level.
 jut back from skiing in co, fun but took 17 hrs to get home and a 1.5 days to
 get there cuz of twa and weather.

Хм, этот, кажется, какой-то «магазинный разговор», и не слишком общий. Я не уверен, как это относится к теме 4 слова. Давайте попробуем еще один:

enron[[55337]]

Fran, do you have an updated org chart that I could send to the Measurement group?
 	Thanks. Lynn

    Cc:	Estalee Russi

 Lynn,

 Attached are the org charts for ETS Gas Logistics:

 Have a great weekend.  Thanks!

 Miranda

Вот так. Похоже, что это гораздо больше относится к сфере «бизнес-процессов». Посмотрим, смогу ли я найти еще один хороший пример:

enron[[11978]]

 Bill,

 As per our conversation today, I am sending you an outline of what we intend to be doing in Ercot and in particular on the real-time desk. For 2002 Ercot is split into 4 zones with TCRs between 3 of the zones. The zones are fairly diverse from a supply/demand perspective. Ercot has an average load of 38,000 MW, a peak of 57,000 MW with a breakdown of 30% industrial, 30% commercial and 40% residential. There are already several successful aggregators that are looking to pass on their wholesale risk to a credit-worthy QSE (Qualified Scheduling Entity). 

 Our expectation is that we will be a fully qualified QSE by mid-March with the APX covering us up to that point. Our initial on-line products will include a bal day and next day financial product. (There is no day ahead settlement in this market). There are more than 10 industrial loads with greater than 150 MW concentrated at single meters offering good opportunities for real-time optimization. Our intent is to secure one of these within the next 2 months.

 I have included some price history to show the hourly volatility and a business plan to show the scope of the opportunity. In addition, we have very solid analytics that use power flow simulations to map out expected outcomes in the real-time market.

 The initial job opportunity will involve an analysis of the real-time market as it stands today with a view to trading around our information. This will also drive which specific assets we approach to manage. As we are loosely combining our Texas gas and Ercot power desks our information flow will be superior and I believe we will have all the tools needed for a successful real-time operation.

 Let me know if you have any further questions.

 Thanks,

 Doug

Снова, я, кажется, нашел электронное письмо, которое преодолевает границу между бизнес-процессом и бизнес-контентом. Хорошо, я думаю, что эта тема не самая ясная в описании каждого из примеров, которые я нашел!

В целом, я, вероятно, мог бы сделать немного больше, чтобы отфильтровать бесполезные вещи, чтобы построить темы, которые были лучше в описании примеров, которые они представляют. Кроме того, я не уверен, должен ли я быть удивленным или нет, что я не поднял какую-то тему «социального подшучивания», где люди писали по электронной почте о некоммерческих темах. Я полагаю, что электронные письма от социальных сетей могут быть менее предсказуемы по своему содержанию, но, возможно, кто-то намного умнее меня может сказать мне ответ:)

Если вы знаете, как я могу значительно повысить качество этого анализа, не стесняйтесь оставлять свои комментарии!