-
-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parsing for .olm (Outlook for Mac) Files. #244
Comments
This is great, thank you. Of note:
|
Should also be noted that despite that filtering, it seems to have created (potentially) full folder structures as if it was going to write that data, but just didn't put the actual files. |
Oh my! It is leaking info like krazy! Typical microsoft "betrayal of their duty to users". |
I wasn't intending to add the test file you gave me to the official testing, so no worries there. However if I can adequatly build an instance of outlook that only exists for tests, where everything it touches is intended to be public anyways, then files from that should be able to be safely added to testing. Unfortunately this suddenly became a lot more complicated, as the format is not part of the Microsoft Open Specifications, meaning it does not have official documentation that is public nor is it guaranteed to be consistent. This will make any attempt at a parser much more of a challenge. |
Thank you! |
I wanted to see if I could make something dead simple to parse it, just grabbing each of the tags and making anything found accesible directly, but some of the tags are lists (categories, attachments, addresses), some have properties inside the tag... Would probably need some additional code to find some of the patterns and parse them correctly. I'll probably wait for that testing environment before really getting to work on this, so that I can get a much better idea of how this should look. |
I've built a "proof of concept", mostly to convince myself, and as a learning exercise - it is clear that some mails are more straightforward than others to handle. It is easy when bodies/contents, are plain text marked up with vanilla html. What stumped me were those where the body content was what looked like the format of .docx documents - maybe there are parsers for that already? Maybe there are other formats that are hard to parse too, but I have not encountered those yet. Have you got an idea how to set up a dedicated instance of outlook to generate tests? |
A lot of time they are built with a form of HTML that had additional tags for formatting in word and stuff, but it renders fine as plain HTML. Not sure if these are what you are talking about, but they also sometimes have branching conditions that actually will check for word to be there to even activate, having something to fall through if it's not available and render correctly in something like a browser. The way I would do it is to setup a computer with a fresh install of outlook designed to not have anything on it aside from information that can be public |
Add support for generally parsing and handling .olm files. Need to see if I can track down proper documentation of these, but from what I have observed they seem rather simple. They are a renamed zip file composed of folders and (mostly) xml files. If a directory has emails, it seems to use the following format:
__message_attachment__{id}.xml
(right now I have only seen the ID as a 6 digit number, unclear if hex or decimal).com.microsoft.__Attachments
in files using the message id for the name, followed by an underscore and a 4 digit number, presumably the id of the attachment for the specified message.<emails>
tag, presumably allowing them to store more than one email (which are denoted by the<email>
tag. Names of properties within it have, so far, been reliably observed to be in the formatOPFMessageCopy{name}
(soAttachmentList
would becomeOPFMessageCopyAttachmentList
,Body
would becomeOPFMessageCopyBody
, etc.).The text was updated successfully, but these errors were encountered: