How to extract information from Office files by using Office file formats and schemas

Article translations Article translations
Article ID: 840817 - View products that this article applies to.
Expand all | Collapse all

On This Page

SUMMARY

If you have to extract information from Microsoft Excel workbooks, Microsoft PowerPoint presentations, or Microsoft Word documents, you can use several methods. These methods include API programming calls, Office Open XML, XML, RTF, or HTML. If these methods do not address your needs, you may be eligible to participate in a Royalty-Free File Format Program and to receive technical documentation for certain Microsoft Office binary file formats.

INTRODUCTION

This article describes several techniques that are available for extracting information from Excel workbooks, PowerPoint presentations, and Word documents.

MORE INFORMATION

Office Open XML

The Office Open XML Formats are designed so that multiple applications on multiple platforms can create and access Office Open XML documents. By using the Office Open XML Format, you can directly manipulate the file format. You do not have to use Microsoft Office applications to create or to access the files.

Benefits of Office Open XML

  • It is open. Office Open XML is openly licensed and documented. It is refined in the open Ecma process so that it works across a wide variety of platforms, applications, and usages.
  • It is XML. Office Open XML is a standard technology that many tools and applications can easily and transparently use.
  • It is backward compatible and interoperable. This enables you to preserve documents in their original form while they are converted to an open, modern format. Additionally, different applications can use the Office Open XML Format with predictable results.
  • It works with what you have through custom XML schema support, through free updates for existing versions of Office, and through support of important accessibility functions for disabled workers.
  • It is ready for the future. With Office Open XML, you can use all the features in the 2007 Microsoft Office programs to create documents. Office Open XML provides ways to subset or to extend these features while it maintains conformity.
  • It can help improve security. IT security procedures and applications can more easily discover and fix potential problems, while documents are less likely to be corrupted.
For more information about the Office Open XML Format, read the Office Open XML v1.0 draft on the following Ecma International Web site:
http://www.ecma-international.org/news/TC45_current_work/TC45-2006-50_final_draft.htm
Additionally, visit the following OpenXMLDeveloper.org Web site:
http://openxmldeveloper.org
The Office Open XML Formats use the Open Packaging Conventions to store the Office Open XML file information on disk. For more information about the Open Packaging Conventions as used by Office Open XML, see the Office Open XML v1.0 draft, part 2, "Open Packaging Conventions".

Office Application Programming Interfaces (APIs)

Office binary file formats are designed to be accessed through the Office Application Programming Interfaces (APIs), instead of by direct manipulation of the file format. Because of the complexity of the formats, direct manipulation can cause corruption and is strongly discouraged.

For more information about the Office APIs, visit the following Microsoft Web site:
http://msdn2.microsoft.com/en-us/library/aa165081(office.10).aspx
The Office 97-2003 binary file formats use the Windows Structured Storage APIs. The Office-specific information is stored as streams in this more generalized format. Common elements, such as document properties, can be accessed through the Structured Storage APIs and do not require access to the Office binary file format documentation.

For more information about the Windows Structured Storage APIs, visit the following Microsoft Web site:
http://msdn2.microsoft.com/en-us/library/aa380369.aspx
The Microsoft Excel 2007 binary format (*.xlsb) stores binary records. This format uses the same part and packaging technologies that are found in SpreadsheetML. SpreadsheetML is part of the Office Open XML Format.

Important Reading or manipulating the structure directly can cause corruption and is strongly discouraged.

XML

XML is a plain-text, Unicode-based metalanguage (a language for defining markup languages). XML is not tied to any programming language, operating system, or software vendor. XML provides access to a plethora of technologies for manipulating, structuring, transforming, and querying data. As the use of XML has grown, it is now typically accepted that XML is not only useful for describing new document formats for the Web, but is also suitable to describe structured data. Examples of structured data include information that is typically contained in spreadsheets, program configuration files, and network protocols.

Microsoft Office includes support for XML schemas. Microsoft maintains a licensing program for certain Office XML schemas.

To learn more about Office XML schemas, visit the following Microsoft Web site to view the Microsoft Office System and XML: Bringing XML to the Desktop article:
http://msdn2.microsoft.com/en-us/library/aa159914(office.11).aspx

Rich Text Format (RTF)

The Rich Text Format (RTF) specification is a method of encoding formatted text and graphics for easy transfer between programs. The RTF specification provides a format for text and graphics interchange that can be used with different output devices, operating environments, and operating systems. RTF uses the American National Standards Institute (ANSI), PC-8, Macintosh, or IBM PC character set to control the representation and the formatting of a document, both on the screen and in print. With the RTF specification, documents that are created under different operating systems and that are created by using different software programs can be transferred between those operating systems and those programs.

For more information about how to write or how to implement a sample RTF reader, visit the following Microsoft Web site, and then type RTF Reader in the Search MSDN For box:
http://msdn.microsoft.com

Visio XML schema

Through the Microsoft documentation and a royalty-free license, customers and partners can take advantage of the XML schema in its diagramming and data visualization tool. The availability of the Visio schema provides a complete and W3C-compliant description of the Visio Extensible Markup Language (XML) file format, enabling organizations to access information captured in their Visio diagrams and uses it with other XML-enabled programs, such as customer relationship management (CRM) and enterprise resource planning (ERP) systems, as part of their business processes. For more information and download capabilities, visit the following Microsoft Web site:
http://www.microsoft.com/downloads/details.aspx?FamilyID=fe118952-3547-420a-a412-00a2662442d9

HTML

HTML files are text files that include the information that users will see, and tags that specify formatting information about how the information will be presented for display purposes. You can use HTML to store, distribute, and present Office documents and data in a format that can be viewed by using most Web browsers while retaining the rich content and functionality of Office documents.

Note In Microsoft Excel 2007, the HTML file format does not save features that are specific to Excel. Additionally, the HTML formal does not support or render all the features in Excel 2007 when you save a workbook as HTML.

For more information about how to edit HTML, visit the following Microsoft Web site:
http://msdn2.microsoft.com/en-us/library/aa730778(vs.71).aspx
For more information about how to work with code, HTML, and resource files, visit the following Microsoft Web site:
http://msdn2.microsoft.com/en-us/library/efc4xwkb(vs.71).aspx

Royalty-Free File Format Programs

Microsoft Office Binary File Formats

Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and for forensic reference purposes.

Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program. The documentation that covers the binary file format specifications is cumulative and covers the most current form of the binary file formats as well as earlier versions.

Office Binary File Format specifications are available under the Open Specification Promise. To obtain documentation, visit the following Microsoft Web site:
http://www.microsoft.com/interop/docs/officebinaryformats.mspx

Properties

Article ID: 840817 - Last Review: February 26, 2008 - Revision: 8.1
APPLIES TO
  • Microsoft Office Excel 2007
  • Microsoft Office Excel 2003
  • Microsoft Excel 2002 Standard Edition
  • Microsoft Excel 2000 Standard Edition
  • Microsoft Excel 97 Standard Edition
  • Microsoft Office PowerPoint 2007
  • Microsoft Office PowerPoint 2003
  • Microsoft PowerPoint 2002 Standard Edition
  • Microsoft PowerPoint 2000 Standard Edition
  • Microsoft PowerPoint 97 Standard Edition
  • Microsoft Office Word 2007
  • Microsoft Office Word 2003
  • Microsoft Word 2002 Standard Edition
  • Microsoft Word 2000 Standard Edition
  • Microsoft Word 97 Standard Edition
Keywords: 
kbhowto kbexpertiseinter kbinfo KB840817

Give Feedback

 

Contact us for more help

Contact us for more help
Connect with Answer Desk for expert help.
Get more support from smallbusiness.support.microsoft.com