Display Contents of Different File Formats Word/Excel/Powerpoint/PDF/RTF as HTML

This is a typical problem which I also raised on Stack Overflow (http://stackoverflow.com/questions/11061929/php-extract-text-from-different-file-formats-word-excel-powerpoint-pdf-rtf#comment14475398_11061929), but there seemed no single resource around the web to solve this particular problem, so since I have solved it I thought it would make sense to provide an approach and a solution, it can be refined better with time

Problem: We have a web application that allows different users to upload different files to share with others, the file types are limited. However just before downloading a file, the user needs to see a preview of the file contents, which is where it becomes tricky since each  file type is different.

Approach:

1. Identify the extension of each file

2. Display the text or HMTL from each file using the appropriate library for the file type, since there is no unified library

The base class is https://gist.github.com/2941076 which requires the path to the file and the extension (since the extension is already stored in the database I do not try to extract it from the file)

The libraries used for the different types of files are the key to the solution:

1. MS Word Documents – LiveDocx Service within Zend Framework (http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/) the steps are:

- Create a Mail Merge Document using the word document as a template

- Connect to the LiveDocx service (here you need SOAP and SSL enabled on your local LAMP installation)

- Save the generated document, and render it as HTML

2. MS Excel – PHPExcel from Codeplex so simple its scary (http://phpexcel.codeplex.com)

- Create a PHP Excel class to read the file

- Create an HTML writer to render the HTML

- Save the HTML to a file and read its contents

3. PDF text extract – http://pastebin.com/hRviHKp1 

- Create an instance of the PDF2Text class holding a reference to the PDF file

- Decode the PDF which extracts the text out of the file

d) Powerpoint – Work in Progress to be added later

More as I work in the power point extractions

About these ads

5 responses to this post.

  1. how about reading the first few pages as images. saw someone suggest a soln at http://valokuva.org/?p=7 though havent tested it out

    Reply

  2. Hi, I use GhostScript (gs) available for all platforms and opensource to generate JPGs out of PDFs:

    JPEG_QUALITY = 80

    gs -sDEVICE=jpeg -o $outputFileName -dJPEGQ=JPEG_QUALITY -dFirstPage=1 -dLastPage=1 -r200x200 $sourcePDF

    call that using exec or however you think is best and it will generate a jpg of the first page, play around with the parameters to extract the whole thing, there is a trick about naming them in secuential order, google pdf to jpg using ghostscript and you’ll find it.

    Thanks for the info on Word, let me know if you ever find out how to do it in powerpoint, I’ve failed.

    Reply

  3. Posted by data management Johannesburg data management Johannesburg,application development Johannesburg,software applications Johannesburg,data warehousing Johannesburg,IT project management,business intelligence Johannesburg,business intelligence analyst,business on November 6, 2012 at 9:14 pm

    I really like your blog.. very nice colors &
    theme. Did you create this website yourself or did you
    hire someone to do it for you? Plz answer back as I’m looking to construct my own blog and would like to know where u got this from. cheers

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,205 other followers

%d bloggers like this: