This is a typical problem which I also raised on Stack Overflow (http://stackoverflow.com/questions/11061929/php-extract-text-from-different-file-formats-word-excel-powerpoint-pdf-rtf#comment14475398_11061929), but there seemed no single resource around the web to solve this particular problem, so since I have solved it I thought it would make sense to provide an approach and a solution, it can be refined better with time
Problem: We have a web application that allows different users to upload different files to share with others, the file types are limited. However just before downloading a file, the user needs to see a preview of the file contents, which is where it becomes tricky since each file type is different.
Approach:
1. Identify the extension of each file
2. Display the text or HMTL from each file using the appropriate library for the file type, since there is no unified library
The base class is https://gist.github.com/2941076 which requires the path to the file and the extension (since the extension is already stored in the database I do not try to extract it from the file)
The libraries used for the different types of files are the key to the solution:
1. MS Word Documents – LiveDocx Service within Zend Framework (http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/) the steps are:
– Create a Mail Merge Document using the word document as a template
– Connect to the LiveDocx service (here you need SOAP and SSL enabled on your local LAMP installation)
– Save the generated document, and render it as HTML
2. MS Excel – PHPExcel from Codeplex so simple its scary (http://phpexcel.codeplex.com)
– Create a PHP Excel class to read the file
– Create an HTML writer to render the HTML
– Save the HTML to a file and read its contents
3. PDF text extract – http://pastebin.com/hRviHKp1
– Create an instance of the PDF2Text class holding a reference to the PDF file
– Decode the PDF which extracts the text out of the file
d) Powerpoint – Work in Progress to be added later
More as I work in the power point extractions
Posted by Herman on June 18, 2012 at 17:13
how about reading the first few pages as images. saw someone suggest a soln at http://valokuva.org/?p=7 though havent tested it out
LikeLike
Posted by ssmusoke on June 19, 2012 at 08:22
Thanks, I will look into that, currently just extracting the text from the PDF and displaying it
LikeLike
Posted by gandazgul on June 28, 2012 at 05:11
Hi, I use GhostScript (gs) available for all platforms and opensource to generate JPGs out of PDFs:
JPEG_QUALITY = 80
gs -sDEVICE=jpeg -o $outputFileName -dJPEGQ=JPEG_QUALITY -dFirstPage=1 -dLastPage=1 -r200x200 $sourcePDF
call that using exec or however you think is best and it will generate a jpg of the first page, play around with the parameters to extract the whole thing, there is a trick about naming them in secuential order, google pdf to jpg using ghostscript and you’ll find it.
Thanks for the info on Word, let me know if you ever find out how to do it in powerpoint, I’ve failed.
LikeLike
Posted by ssmusoke on June 28, 2012 at 08:13
Thanks will look into that, and will update if I get a lead on Powerpoint
LikeLike
Posted by data management Johannesburg data management Johannesburg,application development Johannesburg,software applications Johannesburg,data warehousing Johannesburg,IT project management,business intelligence Johannesburg,business intelligence analyst,business on November 6, 2012 at 21:14
I really like your blog.. very nice colors &
theme. Did you create this website yourself or did you
hire someone to do it for you? Plz answer back as I’m looking to construct my own blog and would like to know where u got this from. cheers
LikeLike
Posted by lalbahadur on November 7, 2014 at 08:43
I want to read korean language ppt file in php any can help me to solve this problem.
LikeLike