Моделирование тем в Python и R: Корпорация электронной почты Enron, часть 2

После публикации моего анализа почтового корпуса Enron я понял, что шаблоны регулярных выражений, которые я настроил для сбора и фильтрации предупреждений / сообщений о конфиденциальности в нижней части электронных писем людей, не работают. Давайте посмотрим на мой исправленный код Python для обработки корпуса:

docs = []
from os import listdir, chdir
import re

# Here's the section where I try to filter useless stuff out.
# Notice near the end all of the regex patterns where I've called 
# "re.DOTALL".  This is pretty key here.  What it means is that the
# .+ I have referenced within the regex pattern should be able to 
# pick up alphanumeric characters, in addition to newline characters
# (\n).  Since I did not have this in the first version, the cautionary/
# privacy messages people were pasting at the ends of their emails
# were not getting filtered out and were being entered into the 
# LDA analysis, putting noise in the topics that were modelled.

email_pat = re.compile(".+@.+")
to_pat = re.compile("To:.+\n")
cc_pat = re.compile("cc:.+\n")
subject_pat = re.compile("Subject:.+\n")
from_pat = re.compile("From:.+\n")
sent_pat = re.compile("Sent:.+\n")
received_pat = re.compile("Received:.+\n")
ctype_pat = re.compile("Content-Type:.+\n")
reply_pat = re.compile("Reply- Organization:.+\n")
date_pat = re.compile("Date:.+\n")
xmail_pat = re.compile("X-Mailer:.+\n")
mimver_pat = re.compile("MIME-Version:.+\n")
dash_pat = re.compile("--+.+--+", re.DOTALL)
star_pat = re.compile('\*\*+.+\*\*+', re.DOTALL)
uscore_pat = re.compile(" __+.+__+", re.DOTALL)
equals_pat = re.compile("==+.+==+", re.DOTALL)

# (the below is the same note as before)
# The enron emails are in 151 directories representing each each senior management
# employee whose email account was entered into the dataset.
# The task here is to go into each folder, and enter each 
# email text file into one long nested list.
# I've used readlines() to read in the emails because read() 
# didn't seem to work with these email files.

names = [d for d in listdir(".") if "." not in d]
for name in names:
    chdir("/home/inkhorn/enron/%s" % name)
    subfolders = listdir('.')
    sent_dirs = [n for n, sf in enumerate(subfolders) if "sent" in sf]
    sent_dirs_words = [subfolders[i] for i in sent_dirs]
    for d in sent_dirs_words:
        chdir('/home/inkhorn/enron/%s/%s' % (name,d))
        file_list = listdir('.')
        docs.append([" ".join(open(f, 'r').readlines()) for f in file_list if "." in f])

# (the below is the same note as before)
# Here i go into each email from each employee, try to filter out all the useless stuff,
# then paste the email into one long flat list.  This is probably inefficient, but oh well - python
# is pretty fast anyway!

docs_final = []
for subfolder in docs:
    for email in subfolder:
        if ".nsf" in email:
            etype = ".nsf"
        elif ".pst" in email:
            etype = ".pst"
        email_new = email[email.find(etype)+4:]
        email_new = to_pat.sub('', email_new)
        email_new = cc_pat.sub('', email_new)
        email_new = subject_pat.sub('', email_new)
        email_new = from_pat.sub('', email_new)
        email_new = sent_pat.sub('', email_new)
        email_new = received_pat.sub('', email_new)
        email_new = email_pat.sub('', email_new)
        email_new = ctype_pat.sub('', email_new)
        email_new = reply_pat.sub('', email_new)
        email_new = date_pat.sub('', email_new)
        email_new = xmail_pat.sub('', email_new)
        email_new = mimver_pat.sub('', email_new)
        email_new = dash_pat.sub('', email_new)
        email_new = star_pat.sub('', email_new)
        email_new = uscore_pat.sub('', email_new)
        email_new = equals_pat.sub('', email_new)

# (the below is the same note as before)
# Here I proceed to dump each and every email into about 126 thousand separate 
# txt files in a newly created 'data' directory.  This gets it ready for entry into a Corpus using the tm (textmining)
# package from R.

for n, doc in enumerate(docs_final):
    outfile = open("/home/inkhorn/enron/data/%s.txt" % n,'w')

Поскольку я не изменял код R со времени последнего поста, давайте посмотрим на результаты:

      Topic 1   Topic 2   Topic 3     Topic 4   
 [1,] "enron"   "time"    "pleas"     "deal"    
 [2,] "busi"    "thank"   "thank"     "gas"     
 [3,] "manag"   "day"     "attach"    "price"   
 [4,] "meet"    "dont"    "email"     "contract"
 [5,] "market"  "call"    "enron"     "power"   
 [6,] "compani" "week"    "agreement" "market"  
 [7,] "vinc"    "look"    "fax"       "chang"   
 [8,] "report"  "talk"    "call"      "rate"    
 [9,] "time"    "hope"    "copi"      "trade"   
[10,] "energi"  "ill"     "file"      "day"     
[11,] "inform"  "tri"     "messag"    "month"   
[12,] "pleas"   "bit"     "inform"    "compani" 
[13,] "trade"   "guy"     "phone"     "energi"  
[14,] "risk"    "night"   "send"      "transact"
[15,] "discuss" "friday"  "corp"      "product" 
[16,] "regard"  "weekend" "kay"       "term"    
[17,] "team"    "love"    "review"    "custom"  
[18,] "plan"    "item"    "receiv"    "cost"    
[19,] "servic"  "email"   "question"  "thank"   
[20,] "offic"   "peopl"   "draft"     "purchas"

One at a time, I will try to interpret what each topic is trying to describe:

  1. This one appears to be a business process topic, containing a lot of general business terms, with a few even relating to meetings.
  2. Similar to the last model that I derived, this topic has a lot of time related words in it such as: time, day, week, night, friday, weekend.  I’ll be interested to see if this is another business meeting/interview/social meeting topic, or whether it describes something more social.
  3. Hrm, this topic seems to contain a lot of general terms used when we talk about communication: email, agreement, fax, call, message, inform, phone, send, review, question.  It even has please and thank you!  I suppose it’s very formal and you could perhaps interpret this as professional sounding administrative emails.
  4. This topic seems to be another case of emails containing a lot of ‘shop talk’

Okay, let’s see if we can find some examples for each topic:

sample(which(df.emails.topics$"1" > .95),3)
[1] 27771 45197 27597


 Christi's call.
 	Christi has asked me to schedule the above meeting/conference call.  September 11th (p.m.) seems to be the best date.  Question:  Does this meeting need to be a 1/2 day meeting?  Christi and I were wondering.
 	Give us your thoughts.

Yup, business process, meeting. This email fits the bill! Next!


 I didn't check voice mail until this morning (I don't have a blinking light.  
 The assistants pick up our lines and amtel us when voice mails have been 
 left.)  Anyway, with the uncertainty of the future business under the Texas 
 Desk, the following are my goals for the next six months:
 1)  Ensure a smooth transition of HPL to AEP, with minimal upsets to Texas 
 2)  Develop operations processes and controls for the new Texas Desk.   
 3)  Develop a replacement
  a.  Strong push to improve Liz (if she remains with Enron and )
  b.  Hire new person, internally or externally
 4)  Assist in develop a strong logisitcs team.  With the new business, we 
 will need strong performers who know and accept their responsibilites.
 1 and 2 are open-ended.  How I accomplish these goals and what they entail 
 will depend how the Texas Desk (if we have one) is set up and what type of 
 activity the desk will be invovled in, which is unknown to me at this time.  
 I'm sure as we get further into the finalization of the sale, additional and 
 possibly more urgent goals will develop.  So, in short, who knows what I need 
 to do.

This one also seems to fit the bill. “D” here is writing about his/her goals for the next six months and considers briefly how to accomplish them. Not heavy into the content of the business, so I’m happy here. On to topic 2:

sample(which(df.emails.topics$"2" > .95),3)
[1] 50356 22651 19259


I agree it is Matt, and  I believe he has reviewed this tax stuff (or at 
 least other turbine K's) before.  His concern will be us getting some amount 
 of advance notice before title transfer (ie, delivery).  Obviously, he might 
 have some other comments as well.  I'm happy to send him the latest, or maybe 
 he can access the site?
 Given that the present form of GE world hunger seems to be more domestic than 
 international it would appear that Matt Gockerman would be a good choice for 
 the Enron- GE tax discussion.  Do you want to contact him or do you want me 
 to.   I would be interested in listening in on the conversation for 

Here, the conversants seem to be talking about having a phone conversation with “Matt” to get his ideas on a tax discussion. This fits in with the meeting theme. Next!



Well, that was pretty social, wasn’t it? :) Okay one more from the same topic:


  Mime-Version: 1.0
  Content-Transfer-Encoding: 7bit
 X- X- X- X-b X-Folder: \ExMerge - Giron, Darron C.\Sent Items
 X-Origin: GIRON-D
 X-FileName: darron giron 6-26-02.PST
 Sorry.  I've got a UBS meeting all day.  Catch you later.  I was looking forward to the conversation.
 It seems everyone agreed to Ninfa's.  Let's meet at 11:45; let me know if a
 different time is better.  Ninfa's is located in the tunnel under the JP
 Morgan Chase Tower at 600 Travis.  See you there.

Woops, header info that I didn’t manage to filter out :( . Anyway, DG writes about an impending conversation, and Schroeder writes about a specific time for their meeting. This fits! Next topic!

sample(which(df.emails.topics$"3" > .95),3)
[1] 24147 51673 29717


Kaye:  Can you please email the prior report to me?  Thanks.
 Sara Shackleton
 Enron North America Corp.
 1400 Smith Street, EB 3801a
 Houston, Texas  77002
 713-853-5620 (phone)
 713-646-3490 (fax)

 	04/10/2001 05:56 PM
 At Alan's request, please provide to me by e-mail (with a  Thursday of this week your suggested changes to the March 2001 Monthly 
 Report, so that we can issue the April 2001 Monthly Report by the end of this 
 week.  Thanks for your attention to this matter.

This one definitely fits in with the professional sounding administrative emails interpretation. Emailing reports and such. Next!

I believe this was intended for Susan Scott with ETS...I'm with Nat Gas trading.
 FYI...another executed capacity transaction on EOL for Transwestern.
 This message is to confirm your EOL transaction with Transwestern Pipeline Company.
 You have successfully acquired the package(s) listed below.  If you have questions or
 concerns regarding the transaction(s), please call Michelle Lokay at (713) 345-7932
 prior to placing your nominations for these volumes.
 Product No.:	39096
 Time Stamp:	3/27/01	09:03:47 am
 Product Name:	US PLCapTW Frm CenPool-OasisBlock16
 Shipper Name:  E Prime, Inc.
 Volume:	10,000 Dth/d  
 Rate:	$0.0500 /dth 1-part rate (combined  Res + Com) 100% Load Factor
 		+ applicable fuel and unaccounted for
 TW K#: 27548		
 Points:	RP- (POI# 58649)  Central Pool      10,000 Dth/d
 		DP- (POI# 8516)   Oasis Block 16  10,000 Dth/d
 Alternate Point(s):  NONE
 Note:     	In order to place a nomination with this agreement, you must log 
 	            	off the TW system and then log back on.  This action will update
 	            	the agreement's information on your PC and allow you to place
 		nominations under the agreement number shown above.
 Contact Info:		Michelle Lokay
 	 			Phone (713) 345-7932
               			Fax       (713) 646-8000

Rather long, but even the short part at the beginning falls under the right category for this topic! Okay, let’s look at the final topic:

sample(which(df.emails.topics$"4" > .95),3)
[1] 39100  31681  6427


 Randy, your proposal is fine by me.  Jim

Hrm, this is supposed to be a ‘business content’ topic, so I suppose I can see why this email was classified as such. It doesn’t take long to go from ‘proposal’ to ‘contract’ if you free associate, right? Next!


 Attached is the latest version of the Wildhorse Entrada Letter.  Please 
 review.  I reviewed the letter with Jim Osborne and Ken Krisa yesterday and 
 should get their comments today.  My plan is to Fedex to Midland for Ken's 
 signature tomorrow morning and from there it will got to Wildhorse.

This one makes me feel a little better, referencing a specific business letter that the emailer probably wants the emailed person to see. Let’s find one more for good luck:


 At a ratio of 10:1, you should have your 4th one signed and have the fifth 
 one on the way...
  	09/19/2000 05:40 PM
 ONLY 450!  Why, I thought you guys hit 450 a long time ago.
 Marie Heard
 Senior Legal Specialist
 Enron Broadband Services
 Phone:  (713) 853-3907
 Fax:  (713) 646-8537

 	09/19/00 05:34 PM
 Well, I do believe this makes 450!  A nice round number if I do say so myself!

 	Susan Bailey
 	09/19/2000 05:30 PM

 We have received an executed Master Agreement:
 Type of Contract:  ISDA Master Agreement (Multicurrency-Cross Border)
 Enron Entity:   Enron North America Corp.
 Counterparty:   Arizona Public Service Company
 Transactions Covered:  Approved for all products with the exception of: 
           Foreign Exchange
           Pulp & Paper
 Special Note:  The Counterparty has three (3) Local Business Days after the 
 receipt of a Confirmation from ENA to accept or dispute the Confirmation.  
 Also, ENA is the Calculation Agent unless it should become a Defaulting 
 Party, in which case the Counterparty shall be the Calculation Agent.
 Susan S. Bailey
 Enron North America Corp.
 1400 Smith Street, Suite 3806A
 Houston, Texas 77002
 Phone: (713) 853-4737
 Fax: (713) 646-3490

That one was very long, but there’s definitely some good business content in it (along with some happy banter about the contract that I guess was acquired).

All in all, I’d say that fixing those regex patterns that were supposed to filter out the caution/privacy messages at the ends of peoples’ emails was a big boon to the LDA analysis here.

Let that be a lesson: half the battle in LDA is in filtering out the noise!