-
Notifications
You must be signed in to change notification settings - Fork 32
Office Documents
There are two kinds of Office Documents: OLE2 and OOXML, aka "2007+". The two formats are totally different, and we treat them this way.
The parsing, and writing, is done by Apache POI.
The first format used by Office Documents is OLE, meaning "Object Linking and Embedding". It is a binary format that is able to embed objects (hence the name) including OLE2 objects. The structure, once parsed, looks like a file system: the header of the file tells us the paths, sizes and locations of blocks of data. Because of this, it is not possible to change the file in place: it would mean rewriting the header, and all its pointers. Instead, we read the file and copy what we want to preserve.
OLE2 files usually have a simple and short extension, like .doc
, .ppt
, .xls
Word and other Office apps use a convention to store dynamic contents: all the macros are in a "directory" named "Macros". We filter it.
That's all. Neither hard or strange.
According to Wikipedia:
Office Open XML (also informally known as OOXML or OpenXML) is a zipped, XML-based file format developed by Microsoft for representing spreadsheets, charts, presentations and word processing documents.
OOXML is the default format for files created with a 2007+ Office Software.
The file extension usually ends with x
or m
: .docx
, .docm
, .docxm
, and
the same goes for .xlsx
and .pptx
.
Two interesting properties are present in this format: a list of elements and a list of their relationships. Using the list of elements, we know exactly what is what: a macro has to tell us it is a macro. It makes it very easy to detect and remove them. The relationships are used to cleanup dependencies, and maybe improve the sanitation process in the future.
This approach is the one Microsoft recommends in its knowledge-base. See Developing Solutions Using the Office XML Formats > Document Security in Introducing the Office (2007) Open XML File Formats.
DocBleach might break Macros and ActiveX objects, and is the intended behavior.