1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
|
GETTING STARTED
============================================
Install necessary gems:
$ bundle
Create a config file with the necessary secret:
$ sed -e s/CHANGEME/$(pwgen -s 30)/ config/config.yml.example > config/config.yml
USAGE
============================================
rake reset
cat postfix.log.1 | bin/parse-email-logs
DB NOTES
============================================
There is one record for each message delivery. This means that there might be
many records with duplicate queue_ids. Some messages never get successfully
delivered, and have status of 'deferred' or 'bounced'.
A lot of messages never show up in the db, such as those rejected because they
had viruses or blocked by RBLs.
Emails addresses are cleaned and then hashed using HMAC. For example:
1. Elijah <elijah@riseup.net>
2. elijah@riseup.net
3. 31b8edad2227cc37ecead62bb14dcfe9@ff437a33d77574732ae1e09add6cfe49
The username and the domain parts are hashed separately.
The fields:
* id: sequence number. ignore it.
* message_id: a hash of the actual message id in the headers. It might be empty
or missing.
* queue_id: the id assigned to this delivery by postfix. messages with many
recipients might be spread across multiple queue_ids
* first_seen_at: the first time a log entry with this queue_id appeared in the
logs.
* date: the actual "Date" header.
* sent_at:
* for incoming: the date header
* for outgoing: first_seen_at
* received_at:
* for incoming: first_seen_at
* for outgoing: when the mx server logs status=sent
* sender: the envelope sender, hashed
* recipient: the envelope recipient, hashed.
* from, to, cc, bcc: the addresses in respective headers, hashed.
* message_size: the byte size of the entire message
* spam_score: not currently gathered
* subject_size: the number of characters in the "Subject" header
* is_list: true if message was sent by a mailing list
* is_outgoing: true if the message is outgoing
* re_message_id: message id of another message that this message is in reply to
* status: one of deferred, bounced, or sent. You can ignore all messages that
are not "sent". deferred messages might later get delivered, so we keep
these records when scanning the logs.
* delay: I am not sure exactly what this is, but postfix logs it and it seems
interesting.
* delays: again, not sure exactly what it is.
NOTES
============================================
encoded list sender
--------------------------------------------
The envelope "from" for mailing lists often encodes the recipient. For example:
bounce-debian-backports=micah=debian.org@lists.debian.org
This is an entry for the mailing list debian-backports@lists.debian.org
delivering mail to micah@debian.org.
So, the data will appear to have many more unique envelope from addresses than
there really are.
quota
--------------------------------------------
The way we have postfix configured, we reject messages for users who are over
quota very early on in the pipeline. By doing this, we radically reduce the
overhead that the mail servers have for dealing with users who are over quota.
One consequence of this is that incoming messages to users who are over quota
never get a queue ID and will never show up in the dataset.
TODO
============================================
handle over quota errors?
NOQUEUE: reject: RCPT from hotmail.com[0.0.0.0]: 450 4.7.1 <bob@riseup.net>: Recipient address rejected: Sorry, your message cannot be delivered to that person because their mailbox is full. If you can contact them another way, you may wish to tell them of this problem; from=<alice@hotmail.com> to=<bob@riseup.net> proto=ESMTP helo=<mx100.hotmail.com>
what is "resent-message-id"?
May 20 22:16:47 mx1 postfix/smtpd[23894]: 106FC1A1FCB: client=bendel.debian.org[0.0.0.0]
May 20 22:16:47 mx1 postfix/cleanup[21313]: 106FC1A1FCB: message-id=<20160520221607.GA5201@riseup.net>
May 20 22:16:47 mx1 postfix/cleanup[21313]: 106FC1A1FCB: resent-message-id=<36m1DyUlwO.A.vVC.Nz4PXB@bendel>
May 20 22:16:47 mx1 postfix/qmgr[5938]: 106FC1A1FCB: from=<bounce-debian-glibc=xxxx=debian.org@lists.debian.org>, size=32505, nrcpt=1 (queue active)
May 20 22:16:47 mx1 postfix/smtp[21920]: 106FC1A1FCB: to=<xxxx@riseup.net>, relay=0.0.0.0[0.0.0.0]:25, delay=1.1, delays=1.1/0/0.04/0.02, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 87B333F0)
May 20 22:16:47 mx1 postfix/qmgr[5938]: 106FC1A1FCB: removed
|