Java HTML Tidy
Updated 3 Jun 2000
This is a Java version of HTML Tidy Release 30 Apr
2000 Copyright © 2000 W3C, see
Tidy.java
for the copyright notice.
Thanks to the people at Docuverse
for assistance with DOM support for Java HTML Tidy. Thanks to
Susan Levine
for assistance with DOM support and bug fixes.
I have made available:
To use the Tidy Java Bean, just include JTidy\lib\Tidy.jar
in
your classpath.
To build Tidy from the source, you need a Java compiler/runtime
environment, supporting Java 1.1 or higher. First, download and
expand the archive. For Win 9x/NT, build it using the batch file
JTidy\make\build.bat
as follows:
cd JTidy\make
build c: 30apr2000
Where c:
is the root where you expanded the JTidy archive,
and
30apr2000
is the directory under JTidy\src
where the source is located.
NOTE: build.bat
assumes that the environment variable
java_home
points to your JDK installation, and that the
JDK tools are in your path.
For Unix environments, either Cygwin or true, use the makefile
in JTidy\make
.
The main class is: org.w3c.tidy.Tidy
Java Tidy Support
The public email list devoted to HTML Tidy is: <html-tidy@w3.org>. To subscribe
send an email to <html-tidy-request@w3.org>
with the word subscribe
in the subject line (include the word unsubscribe if you want to
unsubscribe). The archive for
this list is accessible online. Please use this list to report
errors or enhancement requests for Java Tidy.
Note that I am responsible for the maintenance of Java Tidy,
not Dave Raggett. I will endeavour to monitor the HTML Tidy
list and maintain Java Tidy as best I can, given that I am not
financially supported for this work (and that is not my intention),
and so I cannot devote much of my time to it. It is my hope that
people interested in the ongoing development of Java Tidy will
contribute by fixing bugs and providing source code.
My email address is: Andy Quick.
What's New
- 3 Jun 2000 - Updated Java Tidy to match HTML Tidy Release 30 Apr 2000.
Added a makefile for UNIX.
- 22 Apr 2000 - Updated this page to clarify the situation with
support of Java Tidy.
- 20 Apr 2000 - Fixed
ArrayIndexOutOfBoundsException
that occurred when pretty-printing a large document.
- 13 Apr 2000 - Fixed bugs in
Node.clone()
and Node.insertNodeAtStart
.
- 11 Apr 2000 - Fixed bug in
DOMElementImpl.getAttribute(String)
. Fixed bugs in cloning Node's
.
Added properties
ParseErrors
and ParseWarnings
and
method createEmptyDocument
to the Tidy Bean.
- 4 Mar 2000 - Implemented more DOM functionality. The DOM tree
can now be modified by inserting new nodes, or deleting/replacing existing
nodes. Attributes and their values can now be set on elements.
- 12 Feb 2000 - Fixed bug in
Lexer.addGenerator().
Got rid of references to Lexer.lexbuf
in class PPRint
and other classes - used node.textarray
instead.
This is in preparation for more DOM implementation.
- 22 Jan 2000 - Updated Java Tidy to match HTML Tidy Release 13 Jan 2000.
Included
CM_HEAD
fix from Dave Raggett (posted to HTML tidy
mailing list) for object
entry in TagTable
.
Fixed cause of NullPointerException's in
Node.insertNodeAfterElement
.
- 29 Dec 1999 - More DOM support.
- 7 Dec 1999 - Fixed bug in
Lexer.getToken
.
- 6 Dec 1999 - Updated Java Tidy to match HTML Tidy Release 30 Nov 1999.
- 16 Nov 1999 - Minor bug fix with UTF8 encoding string. Added
makefile.
- 7 Nov 1999 - Changed
Lexer.lexbuf
to type
byte[]
.
Since Tidy stores lexbuf
as UTF-8 encoded bytes,
conversions of sequences of bytes of lexbuf
to
String's
need to take into account UTF-8 encoding, also it was a
waste of memory to represent it as a char[]
.
Thanks to Mark Diekhans for contributing this change.
- 1 Nov 1999 - Updated Java Tidy to match HTML Tidy Release 22 Oct 1999.
This seems to fix some severe bugs (such as infinite loops) that were
present in the 27 Sep 1999 update. However, my testing has revealed
that there are still some severe bugs, so I have included the
4sep1999dom source tree with this release in case you want to stay
at a stable release until the severe bugs are fixed.
- 23 Oct 1999 - Moved
TidyMessages.properties
to the
org.w3c.tidy
package. Propogated
MissingResourceException
in static initializer of
Report
as an Error
since it represents a
severe error.
- 23 Oct 1999 - Updated Java Tidy to match HTML Tidy Release 27 Sep 1999.
The following new features of C HTML Tidy are NOT supported by Java Tidy:
(1) the "keep-time" option for preserving file times, and (2) the new
command-line option parsing that supports parsing options prefixed by "--"
in the same way as parsing from the configuration file. The reason for
(1) is that the core Java API doesn't support altering file
modification times. The
reason for (2) is that I implemented the configuration file as a
properties file, and as such the option parsing code cannot be re-used
for command-line option parsing. This means I need an independent
method to parse options from the command line. It's on my list.
- 2 Oct 1999 - Added limited DOM support. Basically, all you
can do right now is read elements and attributes of the parse tree. You
cannot modify the parse tree in any way. I have made fields of
org.w3c.tidy.Node
protected, as I would like to phase out
external use of this class, and phase in DOM-style access to the parse
tree. Below is a code example of DOM-style parse tree traversal
and printing. Note that Java Tidy cannot yet be called DOM-compliant,
but it's getting there.
- 23 Sep 1999 - Applied bug fix from tidy mailing list (15 Aug 1999)
to ParserImpl.ParseList.
- 3 Sep 1999 - Fixed "thread-safeness" issue in ParserImpl class.
Added
InputStreamName
property to Bean.
Tried speed optimization
in Lexer.wstrcasecmp
.
- 28 Aug 1999 - Changed property
docTypeStr
to
docType
and handled the same as configuration file 'doctype'
string. Fixed potential IndexOutOfBoundsException's
in
Clean.createProps
.
- 30 Jul 1999 - Updated Java Tidy to match HTML Tidy Release 26 Jul 1999.
Repackaged Java Tidy.
- 27 Jul 1999 - Fixed bug in Node.clone()
- 17 Jul 1999 - Fixed some bugs. Added code examples to document.
- 10 Jul 1999 - Updated Java Tidy to match HTML Tidy Release 7 Jul 1999
- 18 Jun 1999 - Java Tidy is now a Java Bean.
Code example of how to use the Tidy Java Bean
import java.io.IOException;
import java.net.URL;
import java.io.BufferedInputStream;
import java.io.FileOutputStream;
import java.io.PrintWriter;
import java.io.FileWriter;
import org.w3c.tidy.Tidy;
/**
* This program shows how HTML could be tidied directly from
* a URL stream, and running on separate threads. Note the use
* of the 'parse' method to parse from an InputStream, and send
* the pretty-printed result to an OutputStream.
* In this example thread th1 outputs XML, and thread th2 outputs
* HTML. This shows that properties are per instance of Tidy.
*/
public class Test16 implements Runnable {
private String url;
private String outFileName;
private String errOutFileName;
private boolean xmlOut;
public Test16(String url, String outFileName,
String errOutFileName, boolean xmlOut)
{
this.url = url;
this.outFileName = outFileName;
this.errOutFileName = errOutFileName;
this.xmlOut = xmlOut;
}
public void run()
{
URL u;
BufferedInputStream in;
FileOutputStream out;
Tidy tidy = new Tidy();
tidy.setXmlOut(xmlOut);
try {
tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true));
u = new URL(url);
in = new BufferedInputStream(u.openStream());
out = new FileOutputStream(outFileName);
tidy.parse(in, out);
}
catch ( IOException e ) {
System.out.println( this.toString() + e.toString() );
}
}
public static void main( String[] args ) {
Test16 t1 = new Test16(args[0], args[1], args[2], true);
Test16 t2 = new Test16(args[3], args[4], args[5], false);
Thread th1 = new Thread(t1);
Thread th2 = new Thread(t2);
th1.start();
th2.start();
}
}
Code example of using Java Tidy as a parser
import java.io.PrintWriter;
import java.io.FileInputStream;
import java.io.IOException;
import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.tidy.Tidy;
/**
* A sample DOM writer. This sample program illustrates how to
* traverse a DOM tree in order to print a document that is parsed.
*
*/
public class TestDOM {
protected PrintWriter out;
public TestDOM() {
out = new PrintWriter(System.out);
}
/** Prints the specified node, recursively. */
public void print(Node node) {
if ( node == null ) {
return;
}
int type = node.getNodeType();
switch ( type ) {
case Node.DOCUMENT_NODE:
out.println("");
print(((Document)node).getDocumentElement());
out.flush();
break;
case Node.ELEMENT_NODE:
out.print('<');
out.print(node.getNodeName());
NamedNodeMap attrs = node.getAttributes();
for ( int i = 0; i < attrs.getLength(); i++ ) {
out.print(' ');
out.print(attrs.item(i).getNodeName());
out.print("=\"");
out.print(attrs.item(i).getNodeValue());
out.print('"');
}
out.print('>');
out.println(); // HACK
NodeList children = node.getChildNodes();
if ( children != null ) {
int len = children.getLength();
for ( int i = 0; i < len; i++ ) {
print(children.item(i));
}
}
break;
case Node.TEXT_NODE:
out.print(node.getNodeValue());
break;
}
if ( type == Node.ELEMENT_NODE ) {
out.print("");
out.print(node.getNodeName());
out.print('>');
out.println(); // HACK
}
out.flush();
}
public static void main(String args[]) {
if ( args.length == 0 ) {
System.exit(1);
}
System.err.println(args[0]);
FileInputStream in;
Tidy tidy = new Tidy();
TestDOM t = new TestDOM();
try {
in = new FileInputStream(args[0]);
tidy.setMakeClean(true);
tidy.setXmlTags(true);
t.print(tidy.parseDOM(in, null));
}
catch ( IOException e ) {
System.err.println( e.toString() );
}
}
}