Stripping the tags from an HTML file

Recently, I was moving my webpages created through FrontPage 2003 (please stop snickering -it was a great tool for sites with flat web files) to WordPress.

This required clearing of formatting of the HTML file and one could use http://StripHTML.com for doing so – “StripHTML.com gives you a quick, easy and satisfying way to transform your ugly formatted and/or HTMLified text into a clean and pretty text for you to enjoy.”  However, most of us want to preserve the URL links, titles and the paragraph tags.

This VBA program that can be used within Excel takes an input text file, reads it as a single string, removes all tags except the URLs, paragraphs and titles, and writes to a new text file.

The VBA program below has the input file at “C:\Users\Tranq\Documents\abc\inputhtml.txt”, and outputs to “C:\Users\Tranq\Documents\abc\outputhtml.txt” and you can change these locations in the TextFile_PullData and TextFile_Create functions, respectively.  You can modify the program to read several files in a directory and make it even more efficient.

How to Use: Just open an Excel file. Press Alt+F11, and open a new module. Cut and paste what is given below, or download the text file.  Go to a cell in the excel file and enter =CleanTags(A1) in cell B1.  Make sure you have the input file at C:\Users\Tranq\Documents\abc\inputhtml.txt”.   The location can be modified.

**********************************************

Function CleanTags(HTML As String) As String
‘PURPOSE: Clean HTML tags except the paragraph, weblinks and title
‘ You can comment out if you want these to supress these as well
‘ The instructions are given at the spots.
‘ SOURCE: Unknown and AutarKaw.org
Dim result As String, StripIt As Boolean, c As String, i As Long
HTML = TextFile_PullData()

‘StripIt is used to figure out to include or exclude in output
StripIt = False

‘Looking at each character in the HTML file
For i = 1 To Len(HTML)
‘c is each character
c = Mid(HTML, i, 1)
‘Some conditions to take care for end of input file
If i <= Len(HTML) – 1 Then
‘d is last two characters of file to capture <a and <p
‘Just in case
d = Mid(HTML, i, 2)
Else
d = “”
End If
If i <= Len(HTML) – 3 Then
‘e is last four characters of file to capture </a> and </p>
e = Mid(HTML, i, 4)
Else
e = “”
End If

‘Checking for < character that begins an HTML tag
If c = “<” Then StripIt = True

‘Comment this if you want to strip paragraphs
If d = “<p” Then StripIt = False
If e = “</p>” Then StripIt = False

‘Comment this if you want to strip URL tags and title tags as well.
If d = “<a” Then StripIt = False
If e = “</a>” Then StripIt = False

‘Adds to output or skips it
If StripIt = False Then result = result & c
‘Taking care of closing tag to change the StripIt Boolean
If c = “>” Then StripIt = False
Next i
CleanTags = result

‘Putting the output in a new file
abc = TextFile_Create(result)
‘Run the program by entering =CleanTags(A1) in a blank excel file
‘where you have this module. Puts Done in cell if it runs correctly
CleanTags = “Done”
‘This lets you know the work is done. Comment if you like.
‘MsgBox (“Done”)
End Function

Function TextFile_PullData()
‘PURPOSE: Send All Data From Text File To A String Variable
‘SOURCE: http://www.TheSpreadsheetGuru.com

Dim TextFile As Integer
Dim FilePath As String
Dim FileContent As String

‘File Path of Text File
FilePath = “C:\Users\Tranq\Documents\abc\inputhtml.txt”

‘Determine the next file number available for use by the FileOpen function
TextFile = FreeFile

‘Open the text file
Open FilePath For Input As TextFile

‘Store file content inside a variable
FileContent = Input(LOF(TextFile), TextFile)

‘Report Out Text File Contents
‘MsgBox FileContent

‘Close Text File
Close TextFile
TextFile_PullData = FileContent

End Function

Function TextFile_Create(HTML As String)
‘PURPOSE: Create A New Text File
‘SOURCE: http://www.TheSpreadsheetGuru.com

Dim TextFile As Integer
Dim FilePath As String

‘What is the file path and name for the new text file?
FilePath = “C:\Users\Tranq\Documents\abc\outputhtml.txt”

‘Determine the next file number available for use by the FileOpen function
TextFile = FreeFile

‘Open the text file
Open FilePath For Output As TextFile

‘Write some lines of text
Print #TextFile, HTML

‘Save & Close Text File
Close TextFile

End Function


This post is brought to you by

Open Education Resource Repository Links

  • Open education resources (OER) or open course wares (OCW) are everywhere you look.  Many have spent valuable time to make repositories of these courses where one can find links to several courses and these are everywhere as well.  So, when I was answering a recent survey on OERs, they asked users which repositories they had used.  I thought it would be good to have links to all the mentioned repositories, and they are given below.

And if you are wondering what an open education resource is, here is how Open Education Consortium defines it. “An OpenCourseWare (OCW) is a free and open digital publication of high quality college and university‐level educational materials. These materials are organized as courses, and often include course planning materials and evaluation tools as well as thematic content. OpenCourseWare are free and openly licensed, accessible to anyone, anytime via the internet.” Acknowledgement:  I would like to thank USF student Brain G for finding and entering the links.


This post is brought to you by

WordPress does not update image

Till recently I used to place images used on my blog at my own website. I would do this as I was using the free WordPress.com site where storage was limited. Well, I found out that if I update the image on my personal website, WordPress will take an unknown time to update it at the blog.

Till recently I used to place images within my WordPress blog at my own website.  I would do this as I was using the free WordPress.com site where storage is limited.  Well, I found out that if I update the image on my personal website, WordPress will take an unknown time to update it at the blog.

As of this date, October 21, 2016, 2:50PM EDT, the two images are different (the image is updated now as of October 30, 2016).  It is because I updated the image on my website but it did not get updated on the wordpress.com site.  I have cleared my cache, used different computers and browsers in the house that do not belong to me, used my desktop computer at school, but the result is the same – the two images are still different after a week of updating.

WordPress support has been helpful and ultimately says that it is an image cacheing problem and there is no set time for it to be cleared.  I have resolved the problem by avoiding it.  I now upload images to the WordPress site as I started using the paid premium site.

What do you think?

_________________________________________________________

This post is brought to you by