This article describes several techniques that are available for extracting information from Excel workbooks, PowerPoint presentations, and Word documents.
Office Open XML
The Office Open XML Formats are designed so that multiple applications on multiple platforms can create and access Office Open XML documents. By using the Office Open XML Format, you can directly manipulate the file format. You do not have to use Microsoft Office applications to create or to access the files.
Benefits of Office Open XML
- It is open. Office Open XML is openly licensed and documented. It is refined in the open Ecma process so that it works across a wide variety of platforms, applications, and usages.
- It is XML. Office Open XML is a standard technology that many tools and applications can easily and transparently use.
- It is backward compatible and interoperable. This enables you to preserve documents in their original form while they are converted to an open, modern format. Additionally, different applications can use the Office Open XML Format with predictable results.
- It works with what you have through custom XML schema support, through free updates for existing versions of Office, and through support of important accessibility functions for disabled workers.
- It is ready for the future. With Office Open XML, you can use all the features in the 2007 Microsoft Office programs to create documents. Office Open XML provides ways to subset or to extend these features while it maintains conformity.
- It can help improve security. IT security procedures and applications can more easily discover and fix potential problems, while documents are less likely to be corrupted.
For more information about the Office Open XML Format, read the Office Open XML v1.0 draft on the following Ecma International Web site:
Additionally, visit the following OpenXMLDeveloper.org Web site:
The Office Open XML Formats use the Open Packaging Conventions to store the Office Open XML file information on disk. For more information about the Open Packaging Conventions as used by Office Open XML, see the Office Open XML v1.0 draft, part 2, "Open Packaging Conventions".
Office Application Programming Interfaces (APIs)
Office binary file formats are designed to be accessed through the Office Application Programming Interfaces (APIs), instead of by direct manipulation of the file format. Because of the complexity of the formats, direct manipulation can cause corruption and is strongly discouraged.
For more information about the Office APIs, visit the following Microsoft Web site:
The Office 97-2003 binary file formats use the Windows Structured Storage APIs. The Office-specific information is stored as streams in this more generalized format. Common elements, such as document properties, can be accessed through the Structured Storage APIs and do not require access to the Office binary file format documentation.
For more information about the Windows Structured Storage APIs, visit the following Microsoft Web site:
The Microsoft Excel 2007 binary format (*.xlsb) stores binary records. This format uses the same part and packaging technologies that are found in SpreadsheetML. SpreadsheetML is part of the Office Open XML Format. Important
Reading or manipulating the structure directly can cause corruption and is strongly discouraged.
XML is a plain-text, Unicode-based metalanguage (a language for defining markup languages). XML is not tied to any programming language, operating system, or software vendor. XML provides access to a plethora of technologies for manipulating, structuring, transforming, and querying data. As the use of XML has grown, it is now typically accepted that XML is not only useful for describing new document formats for the Web, but is also suitable to describe structured data. Examples of structured data include information that is typically contained in spreadsheets, program configuration files, and network protocols.
Microsoft Office includes support for XML schemas. Microsoft maintains a licensing program for certain Office XML schemas.
To learn more about Office XML schemas, visit the following Microsoft Web site to view the Microsoft Office System and XML: Bringing XML to the Desktop
Rich Text Format (RTF)
The Rich Text Format (RTF) specification is a method of encoding formatted text and graphics for easy transfer between programs. The RTF specification provides a format for text and graphics interchange that can be used with different output devices, operating environments, and operating systems. RTF uses the American National Standards Institute (ANSI), PC-8, Macintosh, or IBM PC character set to control the representation and the formatting of a document, both on the screen and in print. With the RTF specification, documents that are created under different operating systems and that are created by using different software programs can be transferred between those operating systems and those programs.
For more information about how to write or how to implement a sample RTF reader, visit the following Microsoft Web site, and then type RTF Reader
in the Search MSDN For
Visio XML schema
Through the Microsoft documentation and a royalty-free license, customers and partners can take advantage of the XML schema in its diagramming and data visualization tool. The availability of the Visio schema provides a complete and W3C-compliant description of the Visio Extensible Markup Language (XML) file format, enabling organizations to access information captured in their Visio diagrams and uses it with other XML-enabled programs, such as customer relationship management (CRM) and enterprise resource planning (ERP) systems, as part of their business processes. For more information and download capabilities, visit the following Microsoft Web site:
HTML files are text files that include the information that users will see, and tags that specify formatting information about how the information will be presented for display purposes. You can use HTML to store, distribute, and present Office documents and data in a format that can be viewed by using most Web browsers while retaining the rich content and functionality of Office documents. Note
In Microsoft Excel 2007, the HTML file format does not save features that are specific to Excel. Additionally, the HTML formal does not support or render all the features in Excel 2007 when you save a workbook as HTML.
For more information about how to edit HTML, visit the following Microsoft Web site:
For more information about how to work with code, HTML, and resource files, visit the following Microsoft Web site:
Royalty-Free File Format Programs
Microsoft Office Binary File Formats
Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and for forensic reference purposes.
Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program. The documentation that covers the binary file format specifications is cumulative and covers the most current form of the binary file formats as well as earlier versions.
Office Binary File Format specifications are available under the Open Specification Promise. To obtain documentation, visit the following Microsoft Web site: