![]() |
A Java DTD Parser |
|
DTD Parser News
4-1-2003DTDParser is now available under a dual license. You can either use it under the terms of LGPL as usual or you can use an Apache-style license. Using one license doesn't obligate you to the terms of the other.
10-1-2002Okay, I was wrong about one with with the PCDATA - it CAN have * after it if it appears by itself. Thanks to Steen Lehmann for pointing this out. This is fixed in version 1.20.
7-30-2002Wow! Even more bug fixes! I fixed the parsing of (#PCDATA) so it works as the XML spec says it should, which may mean that it fails to parse more DTDs. If you use (#PCDATA) by itself, it can't have * after if (the parser previously allowed it). If you use #PCDATA in the beginning of a list, like (#PCDATA|foo|bar|baz), you *must* have a * after the list - (#PCDATA|foo|bar|baz)*. I did check this in some of the more popular, large DTDs and it still parses them well. Also, identifiers may begin with _ or : now, before the parser wouldn't allow it.
7-29-2002Oops! I had uncommented a println while testing version 1.17 and forgot to recomment it. Also, the Ant build wasn't including the Main-Class attribute for the DTDParser JAR file so you weren't able to to the -jar option. That is fixed as well now.
7-28-2002Dmitriy Kulakov pointed out that although the parser can include entity definitions from files using a relative path, this doesn't work when using a URL. I fixed the entity class to work properly with URLs - it works with both absolute and relative URLs. Also, to help with testing the URLs, I changed the Tokenize program to look for "://" in the filename and if so, uses a URL instead of a File. While parsing the DOCBOOK DTD, I found that the ATTLIST parsing was forgetting to consume the > at the end of the list. At the top-level, the parser was ignoring unexpected tokens so this error never showed up before. I fixed the ATTLIST bug, then change the main parsing loop to report an error when it hits an unexpected token. This may result in the parser reporting errors that it never reported before, but these errors should really be errors. Also, since the parser now reports unexpected tokens, I had to change the scanner because it was returning unresolveable entities as identifiers. This wasn't a problem before because the parser would ignore them, but now the behavior should be correct (that is, you expand %foobar; and if there is no foobar, the scanner just keeps going rather than returning %foobar; as an identifier. Finally, I played with the Ant build for a while to get it to generate the ZIP and TGZ files with a root of dtdparser-m.n so you don't have to create a new directory before unpacking the files. I used to package the files this way, but after switching to Ant, I wasn't.
7-18-2002I finally got around to fixing some of the outstanding bugs. Here is a list of the changes I made: If encountering an empty list, choose a sequence as the default instead of a choice (as per XML DTD spec) When creating a list of objects matching a particular type, was adding the item type to the vector instead of the item itself. This may have been fixed in 1.15, but it doesn't look like I got it checked back into CVS correctly. Fixed infinite loop in case of unterminated string Fixed parsing of notation in ATTLIST The parser can parse both the DOCBOOK and the FIXML DTDs, which it had trouble with before. Thank you to everyone who sent in bug reports and potential fixes. Hopefully the library should be pretty stable now.
11-12-2001Someone hacked the server running wutka.com and wiped everything out. I seem to have lost a few news items from this file, but otherwise we are fine. Thanks to Elian Carsenat for sending a fix for getElementsByType, which was just returning a vector of types instead of elements. The fix is available in version 1.15.
9-18-2000Thanks to Paul Libbrecht for pointing out the few minor places where the DTD Parser used JDK 1.2 features where there was a JDK 1.1 equivalent. The DTD Parser should now be JDK 1.1 compliant. Also thanks to Peter Kriegesmann for pointing out a bug in the DTD.getItemsByType.
9-15-2000I've been working on a utility to print out the contents of a DTD as a tree. It's text-only, but it's pretty useful. The program is called ShowDTDTree and it can show all or part of a DTD tree. It can deal with some circular paths but it still seems to choke on a few.
9-01-2000I have updated the DTD parser to allow you to specify a File or a URL as opposed to a reader. When the the includes an entity with a relative path, the parser can then use the file or URL to figure out the path for the included entity.
8-26-2000I have released a new utility called BeanToDTD that uses introspection to create a DTD from a Java bean.
8-15-2000Thanks again to Bob Withers for his updates to keep track of the filenames for external entities that allows the parsing exceptions to tell you which file had the error. Bill La Forge also made a correction to make DTDMixed print the same way as sequences and choices. I also added getItemsByType to the DTD class that lets you fetch all the items of a specific type.
8-10-2000Added support for reading external entities(!). Added get/set methods to make the DTD data model properties work like Java Bean properties. The original public fields are still directly accessible to maintain backwards compatibility. Thanks to Bob Withers of BEA for pointing out a bug in the INCLUDE handling and for providing a fix.
8-08-2000Added line number and column to the parser exceptions thrown. Bill La Forge also updated the DTDEmpty tag to make it print properly.
8-06-2000Bill La Forge made a few changes to aid in the development of Quick.
8-05-2000Congratulations to Dr. Patricia Brown Graham for getting her Ph.D. in Sociology from U. of South Carolina. Way to go, Mom!
8-04-2000Probably the most requested feature in the DTD parser is to preserve the order of all the items in the DTD, both for printing and for general examination. The DTD class now includes a Vector called "items" that contains the items in the original DTD file in the order they were read. Also, you'll find that there are two new classes that may appear in the items vector - DTDComment and DTDProcessingInstruction.
7-29-2000Bill LaForge has kindly updated the DTD, DTDElement and DTDAttribute classes to sort the DTD elements before writing them to a PrintWriter.
7-26-2000The DTD object and its subordinate objects now have a write method that lets the objects write themselves out to a PrintWriter. This allows you to write out DTDs as well as read them. Added code to guess the root element of the DTD by process of elimination. By default, the DTDParser.parse method does NOT try to figure out the root. Instead, you must call DTDParser.parse(true) to make it guess. The dtdparser14.jar file includes a Main-Class manifest entry so you can now run the tokenizer with:
java -jar dtdparser14.jar somedtdfile.dtd
7-12-2000Fixed the parsing of enumerations to match what the specification calls "Nmtoken". Basically, an identifier must start with a letter and contain certain characters, an Nmtoken contains only valid identifier characters but doesn't need to start with a letter. that the parser will recognize more errors instead of ignoring them.
7-7-2000Fixed the parser and scanner to interpret ?> as two separate tokens. The only downsize to this is that the scanner may accept:
<? blah blah blah ? >
Also fixed the element parser to explicitly look for > at the end of the element definition. This shouldn't affect anything, except that the parser will recognize more errors instead of ignoring them.
7-3-2000After some great feedback from Bill La Forge and Pankaj Kamthan, I have been able to fix several parsing bugs and expand the parser to include other DTD items. Bill spent a lot time trying out things and giving me feedback about the object model. Thanks, Bill! The newly supported features are: I ran the parser against every DTD on my system and it only failed on two of them - both of which had legitimate errors. Got any questions or suggestions? Feel free to write me at mark@wutka.com |