This is a typical problem which I also raised on Stack Overflow (http://stackoverflow.com/questions/11061929/php-extract-text-from-different-file-formats-word-excel-powerpoint-pdf-rtf#comment14475398_11061929), but there seemed no single resource around the web to solve this particular problem, so since I have solved it I thought it would make sense to provide an approach and a solution, it can be refined better with time
Problem: We have a web application that allows different users to upload different files to share with others, the file types are limited. However just before downloading a file, the user needs to see a preview of the file contents, which is where it becomes tricky since each file type is different.
Approach:
1. Identify the extension of each file
2. Display the text or HMTL from each file using the appropriate library for the file type, since there is no unified library
The base class is https://gist.github.com/2941076 which requires the path to the file and the extension (since the extension is already stored in the database I do not try to extract it from the file)
The libraries used for the different types of files are the key to the solution:
1. MS Word Documents – LiveDocx Service within Zend Framework (http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/) the steps are:
– Create a Mail Merge Document using the word document as a template
– Connect to the LiveDocx service (here you need SOAP and SSL enabled on your local LAMP installation)
– Save the generated document, and render it as HTML
2. MS Excel – PHPExcel from Codeplex so simple its scary (http://phpexcel.codeplex.com)
– Create a PHP Excel class to read the file
– Create an HTML writer to render the HTML
– Save the HTML to a file and read its contents
3. PDF text extract – http://pastebin.com/hRviHKp1
– Create an instance of the PDF2Text class holding a reference to the PDF file
– Decode the PDF which extracts the text out of the file
d) Powerpoint – Work in Progress to be added later
More as I work in the power point extractions
how about reading the first few pages as images. saw someone suggest a soln at http://valokuva.org/?p=7 though havent tested it out
Thanks, I will look into that, currently just extracting the text from the PDF and displaying it
Hi, I use GhostScript (gs) available for all platforms and opensource to generate JPGs out of PDFs:
JPEG_QUALITY = 80
gs -sDEVICE=jpeg -o $outputFileName -dJPEGQ=JPEG_QUALITY -dFirstPage=1 -dLastPage=1 -r200x200 $sourcePDF
call that using exec or however you think is best and it will generate a jpg of the first page, play around with the parameters to extract the whole thing, there is a trick about naming them in secuential order, google pdf to jpg using ghostscript and you’ll find it.
Thanks for the info on Word, let me know if you ever find out how to do it in powerpoint, I’ve failed.
Thanks will look into that, and will update if I get a lead on Powerpoint
I really like your blog.. very nice colors &
theme. Did you create this website yourself or did you
hire someone to do it for you? Plz answer back as I’m looking to construct my own blog and would like to know where u got this from. cheers
I want to read korean language ppt file in php any can help me to solve this problem.