Stripping the tags from an HTML file


Recently, I was moving my webpages created through FrontPage 2003 (please stop snickering -it was a great tool for sites with flat web files) to WordPress.

This required clearing of formatting of the HTML file and one could use http://StripHTML.com for doing so – “StripHTML.com gives you a quick, easy and satisfying way to transform your ugly formatted and/or HTMLified text into a clean and pretty text for you to enjoy.”  However, most of us want to preserve the URL links, titles and the paragraph tags.

This VBA program that can be used within Excel takes an input text file, reads it as a single string, removes all tags except the URLs, paragraphs and titles, and writes to a new text file.

The VBA program below has the input file at “C:\Users\Tranq\Documents\abc\inputhtml.txt”, and outputs to “C:\Users\Tranq\Documents\abc\outputhtml.txt” and you can change these locations in the TextFile_PullData and TextFile_Create functions, respectively.  You can modify the program to read several files in a directory and make it even more efficient.

How to Use: Just open an Excel file. Press Alt+F11, and open a new module. Cut and paste what is given below, or download the text file.  Go to a cell in the excel file and enter =CleanTags(A1) in cell B1.  Make sure you have the input file at C:\Users\Tranq\Documents\abc\inputhtml.txt”.   The location can be modified.

**********************************************

Function CleanTags(HTML As String) As String
‘PURPOSE: Clean HTML tags except the paragraph, weblinks and title
‘ You can comment out if you want these to supress these as well
‘ The instructions are given at the spots.
‘ SOURCE: Unknown and AutarKaw.org
Dim result As String, StripIt As Boolean, c As String, i As Long
HTML = TextFile_PullData()

‘StripIt is used to figure out to include or exclude in output
StripIt = False

‘Looking at each character in the HTML file
For i = 1 To Len(HTML)
‘c is each character
c = Mid(HTML, i, 1)
‘Some conditions to take care for end of input file
If i <= Len(HTML) – 1 Then
‘d is last two characters of file to capture <a and <p
‘Just in case
d = Mid(HTML, i, 2)
Else
d = “”
End If
If i <= Len(HTML) – 3 Then
‘e is last four characters of file to capture </a> and </p>
e = Mid(HTML, i, 4)
Else
e = “”
End If

‘Checking for < character that begins an HTML tag
If c = “<” Then StripIt = True

‘Comment this if you want to strip paragraphs
If d = “<p” Then StripIt = False
If e = “</p>” Then StripIt = False

‘Comment this if you want to strip URL tags and title tags as well.
If d = “<a” Then StripIt = False
If e = “</a>” Then StripIt = False

‘Adds to output or skips it
If StripIt = False Then result = result & c
‘Taking care of closing tag to change the StripIt Boolean
If c = “>” Then StripIt = False
Next i
CleanTags = result

‘Putting the output in a new file
abc = TextFile_Create(result)
‘Run the program by entering =CleanTags(A1) in a blank excel file
‘where you have this module. Puts Done in cell if it runs correctly
CleanTags = “Done”
‘This lets you know the work is done. Comment if you like.
‘MsgBox (“Done”)
End Function

Function TextFile_PullData()
‘PURPOSE: Send All Data From Text File To A String Variable
‘SOURCE: http://www.TheSpreadsheetGuru.com

Dim TextFile As Integer
Dim FilePath As String
Dim FileContent As String

‘File Path of Text File
FilePath = “C:\Users\Tranq\Documents\abc\inputhtml.txt”

‘Determine the next file number available for use by the FileOpen function
TextFile = FreeFile

‘Open the text file
Open FilePath For Input As TextFile

‘Store file content inside a variable
FileContent = Input(LOF(TextFile), TextFile)

‘Report Out Text File Contents
‘MsgBox FileContent

‘Close Text File
Close TextFile
TextFile_PullData = FileContent

End Function

Function TextFile_Create(HTML As String)
‘PURPOSE: Create A New Text File
‘SOURCE: http://www.TheSpreadsheetGuru.com

Dim TextFile As Integer
Dim FilePath As String

‘What is the file path and name for the new text file?
FilePath = “C:\Users\Tranq\Documents\abc\outputhtml.txt”

‘Determine the next file number available for use by the FileOpen function
TextFile = FreeFile

‘Open the text file
Open FilePath For Output As TextFile

‘Write some lines of text
Print #TextFile, HTML

‘Save & Close Text File
Close TextFile

End Function


This post is brought to you by

Advertisements

Author: Autar Kaw

Autar Kaw (http://autarkaw.com) is a Professor of Mechanical Engineering at the University of South Florida. He has been at USF since 1987, the same year in which he received his Ph. D. in Engineering Mechanics from Clemson University. He is a recipient of the 2012 U.S. Professor of the Year Award. With major funding from NSF, he is the principal and managing contributor in developing the multiple award-winning online open courseware for an undergraduate course in Numerical Methods. The OpenCourseWare (nm.MathForCollege.com) annually receives 1,000,000+ page views, 1,000,000+ views of the YouTube audiovisual lectures, and 150,000+ page views at the NumericalMethodsGuy blog. His current research interests include engineering education research methods, adaptive learning, open courseware, massive open online courses, flipped classrooms, and learning strategies. He has written four textbooks and 80 refereed technical papers, and his opinion editorials have appeared in the St. Petersburg Times and Tampa Tribune.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s