Select the product you need help with
- Internet Explorer
- Windows Phone
- More products
How to extract information from Office files by using Office file formats and schemas
Article ID: 840817 - View products that this article applies to.
If you have to extract information from Microsoft Excel workbooks, Microsoft PowerPoint presentations, or Microsoft Word documents, you can use several methods. These methods include API programming calls, Office Open XML, XML, RTF, or HTML. If these methods do not address your needs, you may be eligible to participate in a Royalty-Free File Format Program and to receive technical documentation for certain Microsoft Office binary file formats.
This article describes several techniques that are available for extracting information from Excel workbooks, PowerPoint presentations, and Word documents.
Office Open XMLThe Office Open XML Formats are designed so that multiple applications on multiple platforms can create and access Office Open XML documents. By using the Office Open XML Format, you can directly manipulate the file format. You do not have to use Microsoft Office applications to create or to access the files.
Benefits of Office Open XML
http://www.ecma-international.org/news/TC45_current_work/TC45-2006-50_final_draft.htmAdditionally, visit the following OpenXMLDeveloper.org Web site:
http://openxmldeveloper.orgThe Office Open XML Formats use the Open Packaging Conventions to store the Office Open XML file information on disk. For more information about the Open Packaging Conventions as used by Office Open XML, see the Office Open XML v1.0 draft, part 2, "Open Packaging Conventions".
Office Application Programming Interfaces (APIs)Office binary file formats are designed to be accessed through the Office Application Programming Interfaces (APIs), instead of by direct manipulation of the file format. Because of the complexity of the formats, direct manipulation can cause corruption and is strongly discouraged.
For more information about the Office APIs, visit the following Microsoft Web site:
http://msdn2.microsoft.com/en-us/library/aa165081(office.10).aspxThe Office 97-2003 binary file formats use the Windows Structured Storage APIs. The Office-specific information is stored as streams in this more generalized format. Common elements, such as document properties, can be accessed through the Structured Storage APIs and do not require access to the Office binary file format documentation.
For more information about the Windows Structured Storage APIs, visit the following Microsoft Web site:
http://msdn2.microsoft.com/en-us/library/aa380369.aspxThe Microsoft Excel 2007 binary format (*.xlsb) stores binary records. This format uses the same part and packaging technologies that are found in SpreadsheetML. SpreadsheetML is part of the Office Open XML Format.
Important Reading or manipulating the structure directly can cause corruption and is strongly discouraged.
XMLXML is a plain-text, Unicode-based metalanguage (a language for defining markup languages). XML is not tied to any programming language, operating system, or software vendor. XML provides access to a plethora of technologies for manipulating, structuring, transforming, and querying data. As the use of XML has grown, it is now typically accepted that XML is not only useful for describing new document formats for the Web, but is also suitable to describe structured data. Examples of structured data include information that is typically contained in spreadsheets, program configuration files, and network protocols.
Microsoft Office includes support for XML schemas. Microsoft maintains a licensing program for certain Office XML schemas.
To learn more about Office XML schemas, visit the following Microsoft Web site to view the Microsoft Office System and XML: Bringing XML to the Desktop article:
Rich Text Format (RTF)The Rich Text Format (RTF) specification is a method of encoding formatted text and graphics for easy transfer between programs. The RTF specification provides a format for text and graphics interchange that can be used with different output devices, operating environments, and operating systems. RTF uses the American National Standards Institute (ANSI), PC-8, Macintosh, or IBM PC character set to control the representation and the formatting of a document, both on the screen and in print. With the RTF specification, documents that are created under different operating systems and that are created by using different software programs can be transferred between those operating systems and those programs.
For more information about how to write or how to implement a sample RTF reader, visit the following Microsoft Web site, and then type RTF Reader in the Search MSDN For box:
Visio XML schemaThrough the Microsoft documentation and a royalty-free license, customers and partners can take advantage of the XML schema in its diagramming and data visualization tool. The availability of the Visio schema provides a complete and W3C-compliant description of the Visio Extensible Markup Language (XML) file format, enabling organizations to access information captured in their Visio diagrams and uses it with other XML-enabled programs, such as customer relationship management (CRM) and enterprise resource planning (ERP) systems, as part of their business processes. For more information and download capabilities, visit the following Microsoft Web site:
HTMLHTML files are text files that include the information that users will see, and tags that specify formatting information about how the information will be presented for display purposes. You can use HTML to store, distribute, and present Office documents and data in a format that can be viewed by using most Web browsers while retaining the rich content and functionality of Office documents.
Note In Microsoft Excel 2007, the HTML file format does not save features that are specific to Excel. Additionally, the HTML formal does not support or render all the features in Excel 2007 when you save a workbook as HTML.
For more information about how to edit HTML, visit the following Microsoft Web site:
http://msdn2.microsoft.com/en-us/library/aa730778(vs.71).aspxFor more information about how to work with code, HTML, and resource files, visit the following Microsoft Web site:
Royalty-Free File Format Programs
Microsoft Office Binary File FormatsMicrosoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and for forensic reference purposes.
Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program. The documentation that covers the binary file format specifications is cumulative and covers the most current form of the binary file formats as well as earlier versions.
Office Binary File Format specifications are available under the Open Specification Promise. To obtain documentation, visit the following Microsoft Web site:
Article ID: 840817 - Last Review: February 26, 2008 - Revision: 8.1