JEP 105: DocTree API
Summary
Extend the Compiler Tree API to provide structured access to the content of javadoc comments.
Goals
Provide access to the syntactic elements of a javadoc comment.
Non-Goals
It is a non-goal to check the HTML tags in a javadoc comment for semantic well-formedness, that is, checking the HTML against a DTD or similar; however, it should be possible to use the API to provide such tools.
Motivation
This API will enable a new generation of doc comment tools to be provided. Such tools could either be written using the Compiler and Tree API, or could be written as annotation processors. One tool that is long long overdue is the updated equivalent of the old DocCheck doclet, to check simple rules and guidelines for the contents of doc comments, and which has never been updated for the language changes in Java 5 and later.
javadoc could also be rewritten to take advantage of the new structured doc comment objects, and to be able to use the additional information such as source positions in its error messages. The HTML parsing would also help make javadoc be able to generate valid XHTML. Although the work in JDK 7 javadoc makes it easy to generate XHTML for the sections generated by javadoc itself, javadoc does not currently have the means to check or verify the use of XHTML within the doc comments of the sources files it is processing.
Description
The problem(s) ...
In JDK 5, there was a single scanner, capable of reading doc comments as needed by javadoc. In JDK 6, the code was refactored into two scanners, one that was capable of reading doc comments, suitable for use by javadoc, and one that was not, suitable for use by javac. That is, until we added the various public APIs to javac, by which clients and annotation processors could get access to doc comment if they so desired.
This means there are 3 types of clients for doc comments:
- comments not required -- javac, when no annotation processors need to be run
- comments definitely required -- javadoc
- comments maybe required -- clients of javac public API, including annotation processors run by javac
The problem is that reading and maintaining the doc comment table is expensive, and in category 3, we have to support the doc comment table, for the off chance that clients might need it, even though most do not.
Another problem with the doc comment table is that it is very low-tech. It is simply a map of tree node to string, where the string is the doc comment as needed by javadoc, meaning that the beginning of each line (the white space and typical '*') has been stripped away. This makes it very difficult indeed to relate positions within the doc comment back to positions in the original source file, which is why you don't see any traditional "emacs-style" error messages coming from javadoc about "parameter name not found" or "exception not declared to be thrown".
The same low-tech doc comment table is exposed to clients of the Tree API. For any tree node, you can get the doc comment string. That's it, beyond that, you're on your own. The ideas(s) ...
First up is to upgrade the doc comment table stored in each compilation unit. Replace
Map<JCTree, String> docComments;
by
Map<JCTree, JCDocComment> docComments;
JCDocComment is a new object internal to javac, and provides lazy access to the doc comment string. At a minimum it contains the starting position of the doc comment in the source file: the position of the "/" character.
interface JCDocComment {
int getPosition();
String getComment();
}
This allows us to have possibly three different doc comment scanners, for the three different kinds of client. For javac with no annotation processors, we continue to use the standard Scanner and leave the docComments table empty, as now. For javadoc, we continue to read the doc comments as now, except that now we store them in JCDocComment objects. For javac when we don't know whether doc comments are required or not, we simply store the starting position of the doc comment. This saves us storing the text of all the doc comments when they are not required. The price is that when we do need the comments, we have to go back and recover the text of the comment from the source file. Different strategies are possible. If any doc comment in a source file is required, we could scan all of them. Note we don't have to scan the source text between the comments because we have the starting position of the comment available, so we can just skip the text between the comments. Or we could just read the comments as needed, and rely on the content cache to save us having to read the source file contents for each individual comment.
The next idea is a better, parsed, representation of doc comments.
A doc comment is comprised of
- an initial sentence
- the rest of the main description
- a list of tags each followed by a description
Each of these contains a sequence of fragments where each fragment can be one of
- plain text, including characters from malformed fragments like '<', '>', '&', '{', etc.
- an HTML start-entity, which contains a name and a list of name-value pairs, such as '<a href="Object.html">'
- an HTML end-entity, which contains a name, such as '</a>'
- an HTML character entity, such as '&'
- a taglet, such as {@link Object}
Obviously, these can be modeled with a simple hierarchy of tree nodes, so I suggest a new package, com.sun.source.doccomments to contain the interfaces for these tree nodes. It would best be a separate hierarchy from the existing com.sun.source.tree.Tree, so I suggest a new common super-interface com.sun.source.doccomments.DTree. We can then extend the utility methods in com.sun.source.util, to provide access to the parsed doc comment for any tree node, and to provide source position info for any DTree node.
Note, the HTML start-entity and end-entity are handled as separate items to avoid getting into issues of parsing HTML and knowing which tags require a closing tag and which not. That layer can be built on top of this abstraction by those applications that need it.
Parsing these comments will not be cheap; nothing involving lexing and parsing ever is. And so this is another reason to provide and use the lazy access to doc comments via the JCDocComment table described earlier. Except now, there is a more interesting method on it as well, to get the DTree for a comment -- which is another reason not to bother to keep the simple string that is currently provided.
Testing
langtools regression tests will be written to exercise the new API. One specific test will be to read and process all the JDK API comments.
There are no special platform or hardware requirements.
Dependences
This work has no dependences on other JEPs.
It is expected that other JEPs will depend on this one.
Impact
- Other JDK components: javadoc
- Compatibility: minimal
- Internationalization: minimal
- Localization: minimal