>> ASIAONE / DIGITAL / FEATURES / STORY
Fri, Oct 23, 2009
Digital Life, The Straits Times
Taxing texts

PULLING text from e-mail messages and PDF files can be as pleasant as a root canal treatment.

Here is a survival guide to text document formats and how to edit the copy.

File formats

PDF (portable document format)

From enticing product brochures to arty annual reports, this format introduced by software giant Adobe in 1990 preserves the design of a document, from fonts to layout.

So the document looks the same whether it is opened on a Mac, Linux PC or even a cellphone. You do not get contents running willy nilly.

As such a file is meant to protect it from unauthorised changes, tweaking it is out of the question, not even with Adobe Acrobat, the program that allows you to create PDF files.

HTML (HyperText Markup Language)

HTML is the bedrock format for Web pages. In its basic form, it is easy to extract its contents but when it is spiced up with animation like Adobe Flash objects and frames - multiple windows in one browser view - the difficulty of the task escalates.

DOC (Microsoft Word document)

This is the format of the most widely used word processing software. Word also produces large files.

Files from the latest version, Word 2007, are saved in a different format: docx.

Earlier Word versions need help to read and amend docx files. To convert to .doc files, get the Search for the Microsoft Office Compatibility Pack Go at microsoft.com.

TXT (text)

This is the universal plain-vanilla text format that is sans formatting and gives the smallest files. Notepad, the bare-bones word processor that comes with Windows, saves in txt files.

So can Word or any word processing software. Just choose Save As and TXT as the format.

ODT (open document text)

OpenOffice, the free office application suite from Sun Microsystems, saves its text files in this format by default.

While OpenOffice can read and save its documents as Word files, the reverse does not apply.

RTF (rich text format)

This is Microsoft's universal text format with formatting which lets you, among other things, change font, bold or set spacing between lines. RTF files can be read by most word processors.

However, compared to Word files, RTF files can be bloated in size - particularly those laden with pictures.

Editing text

Should there be a PDF file or a webpage that you want to extract some text from, here are some suggestions.

Pulling text from PDF files: Check if the file's contents can be copied.

Do not waste your time if the document has been intentionally crippled by its author. Usually the word "SECURED" will appear after the file name on the top of the Adobe Reader window.

Here is a simple test to confirm whether the contents of a PDF file can be copied: click and drag the cursor over a few words in the file. Go to Edit on the menu bar.

Look at the Paste function. If it is in grey, you will not be able to copy and paste a single word from the file.

If copying is allowed, attach the file in an e-mail and send it to pdf2html@adobe.com. Within minutes, you will get an e-mail message from Adobe with a zipped HTML file.

Unzip the attachment, get a free zip app from 7-zip.org if you need one, and open it with your browser.

Again, select all the text using the Ctrl A keys. Then type Crtl C to copy what you have selected.

Fire up your word program, go to Edit on the menu bar, select Paste Special and choose Unformatted Text.

The words will be pasted into your document in unadorned text, perfect for editing.

Mining HTML pages: First, realise that if you cannot click and select the text on a webpage, such as an animation, you cannot copy it.

If there is a printable or print version available, choose it.

Hit Crtl A to select all, then copy and paste as plain text as above.

The same steps apply if you want to copy the contents of an e-mail message.

Cleaning up: For long threads of e-mail messages, there are the dreaded >>>indentations.

Hitting Delete to clear every single sign manually is a real pain. A program called eClean 2000 will do the job for you. Try it for free but pay US$10 (S$14) for the shareware if you want to continue using it. Get it at: jd-software.com/eClean2000/ download.html.

Once downloaded, it installs into the system tray at the bottom right corner of the screen.

After you copy text from an e-mail message, simply right-click the "e" icon. Choose Clean Clipboard Contents and click the OK button.

The program removes all the pesky signs from that long e-mail message.

When you paste the results into, say, a Word document, you get clean text.

cytan@sph.com.sg

This story was first published in The Straits Times Digital Life.


For more The Straits Times stories, click here.

Bookmark and Share
 

 
STORY INDEX
 
  Unspoken rules govern cell phone etiquette
   
 
  Geek power
   
 
  Taxing texts
   
 
  Choose the right laser printer
   
 
  Give your phone the boot
   
 
  High-tech intimacy
   
 
  Help desk: Browsers do not work
   
 
  Split personalities
   
 
  Charging your phone or laptop overnight might be a bad idea
   
 
  Fake security software in millions of PCs
   
>> RELATED STORY
Geek power
New tech prodcuts this week
Choose the right laser printer
Windows Mobile 6.5 debuts here
Fewer books for international Kindle
We welcome contributions, comments and tips.
a1admin@sph.com.sg