Display Contents of Different File Formats Word/Excel/Powerpoint/PDF/RTF as HTML
This is a typical problem which I also raised on Stack Overflow (http://stackoverflow.com/questions/11061929/php-extract-text-from-different-file-formats-word-excel-powerpoint-pdf-rtf#comment14475398_11061929), but there seemed no single resource around the web to solve this particular problem, so since I have solved it I thought it would make sense to provide an approach and a solution, it can be refined better with time
Problem: We have a web application that allows different users to upload different files to share with others, the file types are limited. However just before downloading a file, the user needs to see a preview of the file contents, which is where it becomes tricky since each file type is different.
Approach:
1. Identify the extension of each file
- Display the text or HMTL from each file using the appropriate library for the file type, since there is no unified library
The base class is https://gist.github.com/2941076 which requires the path to the file and the extension (since the extension is already stored in the database I do not try to extract it from the file)
The libraries used for the different types of files are the key to the solution:
1. MS Word Documents - LiveDocx Service within Zend Framework (http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/) the steps are:
- Create a Mail Merge Document using the word document as a template
- Connect to the LiveDocx service (here you need SOAP and SSL enabled on your local LAMP installation)
- Save the generated document, and render it as HTML
2. MS Excel - PHPExcel from Codeplex so simple its scary (http://phpexcel.codeplex.com)
- Create a PHP Excel class to read the file
- Create an HTML writer to render the HTML
- Save the HTML to a file and read its contents
3. PDF text extract - http://pastebin.com/hRviHKp1
- Create an instance of the PDF2Text class holding a reference to the PDF file
- Decode the PDF which extracts the text out of the file
d) Powerpoint - Work in Progress to be added later
More as I work in the power point extractions