GETTING STARTED ============================================ Install necessary gems: $ bundle Create a config file with the necessary secret: $ sed -e s/CHANGEME/$(pwgen -s 30)/ config/config.yml.example > config/config.yml USAGE ============================================ rake reset cat postfix.log.1 | bin/parse-email-logs DB NOTES ============================================ There is one record for each message delivery. This means that there might be many records with duplicate queue_ids. Some messages never get successfully delivered, and have status of 'deferred' or 'bounced'. A lot of messages never show up in the db, such as those rejected because they had viruses or blocked by RBLs. Emails addresses are cleaned and then hashed using HMAC. For example: 1. Elijah 2. elijah@riseup.net 3. 31b8edad2227cc37ecead62bb14dcfe9@ff437a33d77574732ae1e09add6cfe49 The username and the domain parts are hashed separately. The fields: * id: sequence number. ignore it. * message_id: a hash of the actual message id in the headers. It might be empty or missing. * queue_id: the id assigned to this delivery by postfix. messages with many recipients might be spread across multiple queue_ids * first_seen_at: the first time a log entry with this queue_id appeared in the logs. * date: the actual "Date" header. * sent_at: * for incoming: the date header * for outgoing: first_seen_at * received_at: * for incoming: first_seen_at * for outgoing: when the mx server logs status=sent * sender: the envelope sender, hashed * recipient: the envelope recipient, hashed. * from, to, cc, bcc: the addresses in respective headers, hashed. * message_size: the byte size of the entire message * spam_score: not currently gathered * subject_size: the number of characters in the "Subject" header * is_list: true if message was sent by a mailing list * is_outgoing: true if the message is outgoing * re_message_id: message id of another message that this message is in reply to * status: one of deferred, bounced, or sent. You can ignore all messages that are not "sent". deferred messages might later get delivered, so we keep these records when scanning the logs. * delay: I am not sure exactly what this is, but postfix logs it and it seems interesting. * delays: again, not sure exactly what it is. NOTES ============================================ encoded list sender -------------------------------------------- The envelope "from" for mailing lists often encodes the recipient. For example: bounce-debian-backports=micah=debian.org@lists.debian.org This is an entry for the mailing list debian-backports@lists.debian.org delivering mail to micah@debian.org. So, the data will appear to have many more unique envelope from addresses than there really are. quota -------------------------------------------- The way we have postfix configured, we reject messages for users who are over quota very early on in the pipeline. By doing this, we radically reduce the overhead that the mail servers have for dealing with users who are over quota. One consequence of this is that incoming messages to users who are over quota never get a queue ID and will never show up in the dataset. TODO ============================================ handle over quota errors? NOQUEUE: reject: RCPT from hotmail.com[0.0.0.0]: 450 4.7.1 : Recipient address rejected: Sorry, your message cannot be delivered to that person because their mailbox is full. If you can contact them another way, you may wish to tell them of this problem; from= to= proto=ESMTP helo= what is "resent-message-id"? May 20 22:16:47 mx1 postfix/smtpd[23894]: 106FC1A1FCB: client=bendel.debian.org[0.0.0.0] May 20 22:16:47 mx1 postfix/cleanup[21313]: 106FC1A1FCB: message-id=<20160520221607.GA5201@riseup.net> May 20 22:16:47 mx1 postfix/cleanup[21313]: 106FC1A1FCB: resent-message-id=<36m1DyUlwO.A.vVC.Nz4PXB@bendel> May 20 22:16:47 mx1 postfix/qmgr[5938]: 106FC1A1FCB: from=, size=32505, nrcpt=1 (queue active) May 20 22:16:47 mx1 postfix/smtp[21920]: 106FC1A1FCB: to=, relay=0.0.0.0[0.0.0.0]:25, delay=1.1, delays=1.1/0/0.04/0.02, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 87B333F0) May 20 22:16:47 mx1 postfix/qmgr[5938]: 106FC1A1FCB: removed