Geoffrey Hoffman

February 17, 1998

MEng. Progress Report #1

So Far:

The initial part of the Java code formatter came along pretty quickly. Mainly, I have been pulling together different parts of what will be the formatter. Now, the formatter actually has close to the same functionality as the 501 final project we did last semester, as a lot of it could be reused, and the parsing and analysis is done with a much simpler too.

The first part was to get Jlex running, which was not simple. First of all, Jlex is a simple Java Lexer that can take a grammar file, and will generate a Java parser that will throw out pre-defined tokens as it parses the file. Installation was a little confusing to see how tiles with MS Dev, as it was a packaged Java file, and it in turn would put files into different directories, so it was unclear where the grammar file was to be placed, and I kept getting assorted errors. Finally, I sorted that out, by tweaking all of the project directory settings in MS Dev, and now I have it set up to re-parse the grammar file when I want.

After getting Jlex to run (with some help from Max Khavin), I received Max's grammar file that he had been working on for the cs211 final project. This is a fairly thorough but comfortably short grammar file for most of the Java language specification. Running this on Jlex returned the parser class. This parser automatically eliminates white space, and will sequential throw tokens for each that it finds. Max also created a TokenType class which contains the constants for all of the Java keywords, allowing us to use constants for doing compares and the logic of the re-construction of the code. The Token class has two elements: the token type (which is equal to one of the pre-defined constants) and then the text that matched the token specification.

The next step was I took the user interface stuff from the Beautify 501 project last semester, and I cut out the Javacc calls that we were using, and replaced them with the new lexer that we had. The nice thing about this is that the UI that we had already made already has loading and saving source, already has settings boxes, and update buttons and a lot of code already done. The main difference between the Javacc version and the Jlex version is that we did the construction of the new “Beautified” code inside the Javacc parser, so the main code would just call the parser, and it would return a new string. This as much easier for us then because Javacc did a very robust analysis of the Java language, and it would have been too complicated to listen to all of the different tokens as the came up and deal with it there. In this case, there are only about 60 or so tokens, and a lot of them have similar behavior.

The main logic of the reconstruction occurs in the updateCode() function which takes the original code, starts the lexer class that we created with Jlex, and then loops through pulling tokens until the token is null. The lop then has switch statements for the token type, and pretty much reassembles the code, depending on what kind of tokens it receives, and the user specified settings. To deal with dependent tokens (some things behave differently depending on what is before or after) I made the loop pull the next token, and it will compare it at times, which is useful for things like the right brackets and other formatting issue. (I will explain in a sec)

As far as functions, there are some simple functions that will return the value of some of the user interface settings. I plan to add variables that will pull these values and store them once, to remove numerous function calls. There is a global variable that keeps the count of the indent which is incremented and decremented as is necessary. This allowed me to create a very useful function for dealing with the indent. There is a function called newLine() that returns a “\n” concatenated with the proper indent spacing, which it reads from the settings, and repeats as many times as the indent counts says. When it comes time for a new line, I just concatenate a call to this function onto the code. (One of the times I need the know next token is when doing a right bracket, because I need to decrement the indent count before calling newLine(), so the right bracket will be one indent less then the code it follow.)

Otherwise, the logic will turn out be a lot of busy work figuring out under what situations a space or an extra line will occur.

Problems/Plans/Questions

One thing that I noticed is that the grammar file needs to be filled out a little more in the case of comments and such. At the moment, it treats all comments the same, and throws the value into a comment token, but if it is a multiple line comment, it loses the “\n” and it does not deal with single line “//” comments very well. I also believe it does not support `x' type characters. I plan to meet with Max when he gets a chance and he and I are going to fill it out a little more, which in turn may make it better for 211.

Except for comments, the formatter actually does a pretty good job already. I added a bunch of conditions, looking at semicolons, brackets, parentheses, curly brackets, and operators.

Some future concerns include looking at the string manipulation. I do a lot of string operations, and it would probably be beneficial to optimize that if I can. One concern is that I believe Java creates a new string every time you cat two strings together, where as with string buffer, it will just append onto the old string. I do a lot of concatenating, about 2-3 times per token, so I want to figure out if I will get a big speed gain by changing it. I also want to change the code so that it pre-fetches the values of the UI components once before formatting, for speed, and could potentially create weird code if a setting is change in mid-beautify.

From here, I need to fill out the logic and try to get a thorough formatter that will handle most if not all Java combinations. I have a link to sun's Java language specification, that is some enormous document, but I will go through to see what situations I have missed.

After getting a full working version, I need to start looking into what I need to do to put a C or C++ wrapper on this, so that it can be invoked by Codewarrior, and start looking at the API for Codewarrior.

Following are some chunks of code, including the updateCode() function, and some others functions. The grammar file is fairly long and uninteresting.

 

// calls the parser, and updates the text area with the new code.
public void updateCode() {
  Yytoken token, next_token;
  show();
  output_field.setText("Updating...");
  Yylex lexAnalyzer = new Yylex(new StringBufferInputStream(Original_Frame.getText()));
  temp_code = "";

  try {
    if (parse_choice.getSelectedItem() == "Standard Formatting") {
      next_token = lexAnalyzer.yylex();
      while (next_token != null) {
        token = next_token;
        next_token = lexAnalyzer.yylex();
        switch (token.getType()) {
          case TokenType.LCURLY:
            if (bracketLineBox.getState())
              temp_code += newLine();
            indent_count++;
            break;
          case TokenType.COMMENT:
            temp_code += "/*";
            break;
          case TokenType.OPERATOR:
            if (opSpace())
              temp_code += " ";
            break;
        }

        temp_code += token.getText();

        if (next_token != null && next_token.getType() == TokenType.RCURLY)
          indent_count--;

        switch (token.getType()) {
          case TokenType.OPERATOR:
            if (opSpace() && next_token != null &&
                       next_token.getType() != TokenType.OPERATOR)
              temp_code += " ";
            break;
          case TokenType.COMMENT:
            temp_code += "*/" + newLine();
            break;
          case TokenType.RCURLY:
            if (extraLine())
              temp_code += newLine();
          case TokenType.LCURLY:
          case TokenType.SEMICOLON:
            temp_code += newLine();
            break;
          case TokenType.LPARAN:
            break;
          default:
            if (next_token != null) {
              switch (next_token.getType()) {
                case TokenType.SEMICOLON:
                case TokenType.LPARAN:
                case TokenType.RPARAN:
                case TokenType.LBRACKET:
                case TokenType.RBRACKET:
                case TokenType.RCURLY:
                case TokenType.LCURLY:
                case TokenType.OPERATOR:
                  break;
                default:
                  temp_code += " ";
              }

            }
        }
      } 
    } else if (parse_choice.getSelectedItem() == "Token List") {
      token = lexAnalyzer.yylex();
      while (token != null) {
        temp_code += "Token \""  + token.getText() + "\" has type " + 
                                               token.getType() + "\n";
        token = lexAnalyzer.yylex();
      }
    }

    updated_code = temp_code;
    textarea.setText(updated_code);
    output_field.setText("Updated Successfully");
  
  }
  catch (Exception e) {System.out.println("hello");}
}


// returns a new line and the proper indent
public String newLine() {
  String input = indent_choice.getSelectedItem();
  String temp = "\n";
  String indy = "";

  if (input == "1 spaces") {
    indy = " ";
  } else if (input == "2 spaces") {
    indy =  "  ";
  } else if (input == "3 spaces") {
    indy =  "   ";
  } else if (input == "4 spaces") {
    indy =  "    ";
  } else if (input == "6 spaces") {
    indy =  "      ";
  } else if (input == "8 spaces") {
    indy =  "        ";
  } else if (input == "1 tab") {
    indy =  "\t";
  } else if (input == "2 tabs") {
    indy =  "\t\t";
  }

  for (int i = 0; i < indent_count; i++)
    temp +="indy;" return temp;
}