1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
|
\documentclass[a4paper, oneside, 11pt]{report}
\usepackage{epsfig,pifont,float,multirow,amsmath,amssymb}
\newcommand{\mc}{\multicolumn{1}{c|}}
\newcommand{\mb}{\mathbf}
\newcommand{\mi}{\mathit}
\newcommand{\oa}{\overrightarrow}
\newcommand{\bs}{\boldsymbol}
\newcommand{\ra}{\rightarrow}
\newcommand{\la}{\leftarrow}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{amsmath}
\usepackage{mathtools}
\usepackage{graphicx}
\graphicspath{ {./images/} }
\topmargin = 0pt
\voffset = -80pt
\oddsidemargin = 15pt
\textwidth = 425pt
\textheight = 750pt
\begin{document}
\begin{titlepage}
\begin{center}
\rule{12cm}{1mm} \\
\vspace{1cm}
{\large CMP-6048A / CMP-7009A Advanced Programming Concepts and Techniques}
\vspace{7.5cm}
\\{\Large Project Report - 15 December 2021}
\vspace{1.5cm}
\\{\LARGE Esoteric Compiler / Interpreter Project}
\vspace{1.0cm}
\\{\Large Group members: \\ Eden Attenborough, Alfie Eagleton, Aiden Rushbrooke, Chris Sutcliffe}
\vspace{10.0cm}
\\{\large School of Computing Sciences, University of East Anglia}
\\ \rule{12cm}{0.5mm}
\\ \hspace{8.5cm} {\large Version 1.0}
\end{center}
\end{titlepage}
\setcounter{page}{1}
%\pagenumbering{roman}
%\newpage
\chapter{Introduction}
\label{chap:intro}
\section{Project Statement}
This report will detail work done on creating a compiler for an esoteric programming language, or Esolang. Both the language and the compiler will be designed and created as part of this project, with the final result being a working compiler for a Turing complete language. The plan for the project is to first research other compiler implementations, and existing esoteric languages to learn the basic structure and fundamentals. We will then take this knowledge and develop the compiler using advanced programming concepts, before running a series of designed experiments to test the compilers capabilities and effectiveness.
\section{Technology}
To develop the compiler we plan on using the programming language java for the majority of the implementation. We plan on either compiling directly into java bytecode, or into a secondary language such as C.
\section{MoSCoW Analysis}
Before implementation we have decided on a series of requirements, split into sections using the MoSCoW analysis method.
Must have:
\begin{itemize}
\item A working compiler without and crashes or unexpected behaviour
\item A Turing complete language with features including expressions, conditionals and loops
\item Basic error handling and reporting
\item Reading and executing single lines of code from command line
\item Output to command line
\item Basic numerical variable handing
\end{itemize}
Should have:
\begin{itemize}
\item Reading and executing from source files
\item Additional features including strings and logical expressions
\item Improved error handling, providing useful information to the user
\item Optimization steps during compilation and before runtime
\end{itemize}
Might have:
\begin{itemize}
\item Further features such as functions and recursion
\item Array handling
\item Further string handling such as indexing
\item Memory allocation and disposal
\end{itemize}
Won't have:
\begin{itemize}
\item User interface
\item Object oriented programming features
\item IDE features
\end{itemize}
\section{Report structure}
This report will first cover our research into compilers and esoteric languages in Chapter 2. Then, we will describe the overall design and structure of a compiler based on our research in Chapter 3. Chapter 4 will focus on the design of our own esoteric language, including grammar tables. Chapter 5 will describe the implementation and documentation of the compiler, explaining how it was implemented using agile methods. Chapter 6 will show evidence of testing the compiler, with both unit tests and overall test harnesses for each section. Finally, Chapter 7 will be the conclusion, where we will compare the final result to our initial requirements, and give any further work we would do if given time.
\chapter{Research of Similar Systems}
\section{Compilers vs Interpreters}
Within the field of computer programming languages, there are two main types: compiled languages and interpreted languages. Both of these styles have distinct advantages and disadvantages. Compilers and interpreters both convert code written in human readable higher-level languages to instructions that a computer can execute.
In the case of compiler, code is converted from high level program code to assembly language then to machine language. This process can be slow, especially when there is a large amount of code to compile, but execution is fast as only the instructions directly executed are required to be kept in memory.
In contrast to this, Interpreter directly runs the code given to it, or in other words, an interpreter is a separate program that reads and executes the instructions itself, rather than converting them into the assembly language of that particular architecture being run on. The principal advantage of this is that a computationally expensive compilation process does not have to be run every time, therefore leading to faster startup times. However, this approach carries with it a big disadvantage, namely that because every line has to be interpreted, execution is slower than the equivalent instructions run in a compiled language.
\section{Esoteric Languages (Esolangs)}
Esolangs are programming languages designed to be jokes or proof of concept languages, rather than those to be used in actual programming tasks. Typically, the reason they are created is for fun rather than any serious desire to solve a particular programming problem, and so most languages aim to be funny, to be as difficult to use as possible or to have few features while still being Turing complete. Some esoteric languages are also designed to adapt existing languages into simpler or more usable forms.
\section{Shakespearian Programming Language}
The Shakespeare Programming Language is esoteric code parodied on extracts from Romeo and Juliet. Here, code is written so it reads like how a script would, such as \texttt{[Enter Juliet]} where \texttt{Juliet} may be a variable name. Other examples include using \texttt{Acts} or \texttt{Scenes} as GOTO statements, where the interpreter can jump to a certain point in the code and read from there. Otherwise, these are ignored similar to the title or character descriptions. Other examples include LOLCODE, Hodor or White Space. This serves the purpose of writing code in a different manner to usual, sometimes for humour. The aim is to replace typical language features so that it can still be read by a compiler or interpreter but also look and read very differently by the user.
\section{Befunge}
Befunge is one of the most popular and well known esoteric languages, intentionally designed to be hard to compile. Befunge code is written in a 2d plane, where ASCII characters represent instructions. An instruction pointer starts in the top left of this plane, and begins travelling right. Data is stored in a stack, similar to many other esoteric languages. One other feature of Befunge is that it is self-modifying, meaning instructions can modify the plane the code is on. Since Befunge-98, there is no fixed size of the plane, meaning Befunge is a Turing complete language.
\section{Pyth}
Pyth is a language designed to be both concise, but also clear to read. Pyth is designed to be similar to python, but with every instruction a single character. When ran, a pyth code is first compiled into python before being executed. Unlike many other esoteric languages, Pyth is procedural based, not stack based. Print statements are also implicit in Pyth, and all constructs are end of line terminated.
\chapter{Compiler Design and Methodology}
This chapter will discuss how compilers work, and the design behind them. The goal of a compiler is to convert the source code into an from which the CPU can execute. As mentioned in chapter 2, a compiler works by converting the entire source code in one go, then executing, unlike a interpreter which executes each line at a time. However, before the code can be executed a number of steps have to be taken.
\section{The Lexer (Lexical Analysis)}
The first step for a compiler is to convert the initial source code into groups of characters which the computer can understand, generally known as tokens. This process is also known as tokenisation, particularly in the field of natural language processing (NLP). When designing a programming language, the task of lexical analysis or tokenisation is performed by a lexer or scanner. This lexer looks through each character and creates equivalent tokens based on the character values. For example, the character '+' may be converted into the token "plus", which in a programming language would likely represent addition. Some tokens may also be multiple lines long, such as strings, identifiers and keywords like "if" or "int". This stage can also detect some syntax errors, for any unrecognised characters. The end result of lexical analysis is a series of tokens representing the starting source code.
\section{The Parser}
The second stage of the compiler is to try and understand how the tokens link together to form a a program. The parser uses the list of tokens created and compares it to the language grammar. This allows it to understand how the tokens are linked, and can then use this information to create abstract syntax trees (AST), which represents the code in a logical fashion. The parser can also detect any syntax errors in the initial source code when comparing against the language grammar.
The specific parser we have chosen is a recursive descent parser, a Top-Down parser where each rule in the grammar maps to a specific function in the parser implementation. A recursive descent parser works by starting at the first rule in the grammar, then working downward, finding each statement or expression. We chose this parser design as it is both simple to implement and gives a lot of control to how the parser functions and error reporting compared to parsers created through a parser generator and grammar table. Many compiler implementations also use recursive descent parsers, such as GCC \cite{GCCParser} and Clang.
\section{Optimization}
This step involves making a variety of optimizations to the current program to speed up processing. This may involve simplifying loops, or pre-calculating complicated expressions which only contain constants. However, this step is not vital to the function of an effective compiler, with some deciding to ignore optimization entirely.
\section{Code Generation}
At this stage, the code is converted into a form which can be executed by the CPU, such as machine code. However, due to the complexity of converting directly to machine code, an alternative is commonly done. Here, the compiler writes to a alternate form designed to run on a created virtual machine. This form is known as bytecode. Java is a well known example which uses this method, where java programs are compiled to run in the Java virtual machine.
One other method is to compile into another, existing programming language. This method is known as a source-to-source compiler. Here, instead of writing to machine code or bytecode, the compiler creates a string of valid source code for the language being targeted, based on the initial source code for the main compiler. This can then be executed using other, pre-existing compilers. One common use of these compilers is for web browsers, where many languages compile to javascript.
\chapter{Our Esoteric Language}
\section{EsotericFortran}
For our esoteric language we decided to adapt an existing language into a modern form. Specifically, we chose the language Fortran. We chose Fortran as contained all the the features we were interested in implementing, but was also different enough from most currently popular languages.
Of course, we have decided to make some changes from Fortran. Most importantly, we decided to include keywords in our language unlike Fortran, as we thought that allowing variables to share names with keywords adds unnecessary confusion. We have also decided to make other, small changes from Fortran, either due to time constraints or to simplify to a more modern form.
\section{Language Grammar}
Before we can begin implementation, we need to design the grammar of our language. The grammar not only gives us and understanding of how our language will work and allow us to spot and issues, but also our chosen parser, a recursive descent parser, heavily relies on the language grammar for its implementation.
To define the language we are using formal grammars. Formal grammars are a way represent the language as a set of rules. In order to fully represent the language we are implementing we have a context-free grammar, a type 2 grammar on the Chomsky hierarchy. Context-free grammars are a type of formal grammar in the form
T $\rightarrow$ t
where A is a nonterminal, and t a series of terminals and nonterminals. Nonterminals are references to other rules in the language, and a terminal is an end point in the language. They are context-free as they do not look at the context of the nonterminal.
In order to define the language in a readable way, we have used a slightly adjusted from of the Backus-Naur form (BNF).
\begin{verbatim}<program> ::= <statement>*
\end{verbatim}
This example means that a program is defined as some number of statements. Here, we have used the '*' symbol to mean one or more and surrounded nonterminals with '<' and '>'. For terminals we have decided to surround any terminals with '"', such as for keywords or defining identifiers or numbers. If a nontermial has multiple possible rules we have seperated them using '|', meaning "or".
\newpage
\subsection{Statements Grammar}
\begin{table}[!htb]
\begin{center}
\begin{tabular}{|l|r|}
\hline
\textbf{Abbreviation} & \textbf{Term}\\
\hline
$<$program$>$ ::= & $<$function-call$>$* \\
\hline
$<$function-call$>$ ::= & $<$main-function$> |$ \\
& $<$function$> |$ \\
& $<$subroutine$>$\\
\hline
$<$main-function$>$ ::= & "program" $<$identifier$>$ $<$block$>$ "program "$<$identifier$>$\\
\hline
$<$function$>$::=&$<$type$>$$<$identifier$>$"("$<$expression$>$","*")"$<$block$>$\\
\hline
$<$subroutine$>$::=&$<$identifier$>$"("$<$expression$>$","*")"$<$block$>$\\
\hline
$<$block$>$::=& $<$statement$>$* "end" $|$\\
& $<$statement$>$* $<$return$>$ "end" $|$\\
\hline
$<$statement$>$ ::= & $<$declaration$>$ $|$ \\
& $<$expr-Statement$>$ $|$\\
& $<$print-Statement$>$ $|$ \\
& $<$if-Statement$>$ $|$\\
& $<$do-Statement$>$ $|$\\
& $<$do-while-Statement$>$ \\
\hline
$<$declaration$>$ ::= & "character (len = "$<$number$>$")::"$<$identifier$>$ $|$ \\
& "int::"$<$identifier$>$ $|$\\
& "real::"$<$identifier$>$ \\
& $<$type$>$ "dimension("$<$expression$>$*")"::$<$identifier$>$\\
\hline
$<$print-Statement$>$ ::= & "print *" (","$<$expression$>$)* \\
\hline
$<$if-Statement$>$ ::= & "if ("$<$expression$>$") then" $<$block$>$ \\
& ("else" $<$block$>$)?\\
& if"\\
\hline
$<$do-Statement$>$ ::= & "do" $<$identifier$>$ "=" $<$number$>$","$<$number$>$(","$<$number$>$)?\\
&$<$block$>$ \\
& "do"\\
\hline
$<$do-while-Statement$>$ ::= & "do while ("$<$expression$>$")"\\
& $<$block$>$ \\
& "do"\\
\hline
$<$return$>$ ::= & "return" $<$expression$>$\\
\hline
$<$expr-statement$>$ ::= & $<$expression$>$\\
\hline
\end{tabular}
\label{tab:table1}
\caption{Grammar table for EsotericFortran Statements}
\end{center}
\end{table}
\newpage
\subsection{Expression Grammar}
\begin{table}[!htb]
\begin{center}
\begin{tabular}{|l|r|}
\hline
\textbf{Abbreviation} & \textbf{Term}\\
\hline
\hline
$<$expression$>$ ::= & $<$assignment$>$\\
\hline
$<$assignment$>$ ::= & $<$identifier$>$"="$<$equality$>$$|$\\
& $<$equality$>$$|$\\
\hline
$<$equality$>$ ::= & $<$logical$>$("=="$|$"/=")$<$logical$>$$|$\\
& $<$logical$>$$|$\\
\hline
$<$logical$>$ ::= & $<$comparison$>$(".and."$|$".or.")$<$comparison$>$$|$\\
& $<$".not."$>$$<$comparison$>$$|$\\
& $<$comparison$>$$|$\\
\hline
$<$comparison$>$ ::= & $<$term$>$("$>$"$|$"$<$"$|$"$>=$"$|$"$>=$")$<$term$>$$|$\\
& $<$term$>$$|$\\
\hline
$<$term$>$ ::= & $<$factor$>$("+"$|$"-")$<$factor$>$$|$\\
& $<$factor$>$$|$\\
\hline
$<$factor$>$ ::= & $<$exponent$>$("*"$|$"/")$<$exponent$>$$|$\\
& $<$exponent$>$$|$\\
\hline
$<$exponent$>$ ::= & $<$primary$>$"**"$<$primary$>$$|$\\
& $<$primary$>$$|$\\
\hline
$<$primary$>$ ::= & $<$number$>$$|$\\
& $<$identifier$>$$|$\\
& $<$string$>$$|$\\
& $<$identifier$>$"("$<$expression$>$","*")"\\
\hline
$<$number$>$ ::= & $<$digit$>$*("."$<$digit$>$*)?\\
\hline
$<$identifier$>$ ::= & $<$alpha$>$*\\
\hline
$<$string$>$ ::= & """$<$alpha$>$*"""\\
\hline
$<$digit$>$::=&0\textbar1\textbar2\textbar3\textbar4\textbar5\textbar6\textbar7\textbar8\textbar9 \\
\hline
$<$alpha$>$::=&"a"..."z"$|$"A"..."Z"\\
\hline
$<$type$>$::=&"int"$|$"real"$|$"string"\\
\hline
\end{tabular}
\label{tab:table1}
\caption{Grammar table for EsotericFortran Expressions}
\end{center}
\end{table}
\chapter{Documentation}\label{MethLab}
We decided it would be a good idea to note down and record our application stage by stage while developing our Esolang so that we may refer back to previous parts for reference. It will also help track our progress throughout the project to ensure we stay on track.
\section{Language.java}
Of course the very first thing that is required when it comes to translating any kind of code into execution, is the ability to read it either from command line or file. In this case, we use \texttt{Language.java} to divide line by line, essentially a while loop until EOF, end-of-file, to read all from file. Essentially our program keeps dividing from an entire source text into lines into characters and then starts building into tokens and expressions and strings until it can recognise commands. \newline
\begin{figure}[h!]
\begin{center}
\includegraphics{Language.JPG}
\caption{A series of steps to examine source text by line using \texttt{while}}
\end{center}
\end{figure}
\section{TokenScanner.java \& TokenType.java}
This begins the lexical analysis of our program, by examining each string by character, and identifying what type of symbol it is, whether it is text or something further such as operations. When the program reads a particular character, such as ``+'', it needs to know that it is firstly, an operator and second, a plus operand; and so we can start to identify left and right expressions to calculate Math, but this is mentioned in more detail later in the document alongside the sprints and implementation. There are other examples other than the addition operator. We also have prepared for ``-'' or subtract, the lookahead function since ``=='' is conveyed differently to ``=''. All of these are converted to tokens one by one before being sent to our parser algorithms.
\begin{figure}[h!]
\begin{center}
\includegraphics{TokenScanner.JPG}
\caption{An example of using \texttt{switch} to identify operands, including looking ahead for multiple character tokens}
\end{center}
\end{figure}
In the image above we have identified a means for using switch statements to recognise strings and turning them into fixed tokens. This is also why we use \texttt{TokenType.java}, which uses a fixed number of enums so the program knows exactly what they have read. In t his case for example, case `*' is equal to \texttt{TokenType.STAR}, and so on. As mentioned, some tokens are multiple characters long and so we have further cases to compensate for these. They usually consist of double equals ==, or comparables such as $<$= and $>$=.
\section{Parser.java}
This class is the primary code of our program. It contains everything we need to start recognising the code given and actually \emph{do} something with it. It takes a list of tokens from our analyser, and goes through our list of defined grammar, attempting to work out what each token translates to, essentially breaking down each line to find out what it is. Is it an if statement? Is it a print? Is it an expression or Math equation? We have lots of conditions is this class and each one is tested until one of them is met, denoting the result. Each possible grammar rules must match one of the conditions in the parser, otherwise it does not make sense and the code may fail to compile. We have supported this via exceptions and default case error messages. \newline
Conditions could include ``have they written \texttt{int} before this keyword'' or ``Is this a single colon or a double colon?'' or ``Is there a string after said colons?'' If all of these conditions are met, we can identify that a variable can be declared, for example.
\begin{figure}[h!]
\begin{center}
\includegraphics{ParserStatement.jpg}
\caption{If the next line is a statement, this function identifies which statement it is using the enums from \texttt{TokenType.java}}
\end{center}
\end{figure}
The above code represents our decision making when it comes to identifying statements. We have two primary algorithms in this class; \texttt{matchAndAdvance()} and \texttt{matchOrError()}.
\subsection{matchAndAdvance()}
We have a function which tries to match the grammar of each token, and executes code depending on whether all the conditions were met. \texttt{matchAndAdvance()} takes a TokenType variable and checks if it matches what was expected. If so, it advances and keeps iterating until the end of the file is met. If not, it returns an error since something unexpected occurred. An example of where this function is used is in our Statement methods, where we check whether the following code is a statement or not. This is below, in Figure 5.3.
\subsection{matchOrError()}
Our \texttt{matchOrError()} function does very similar to our previously mentioned \texttt{matchAndAdvance()}, with the exception of if something unexpected happens during the program's analysis, it will throw an error and display on the UI of our program, giving us feedback on what is causing the problem in our code.
\section{Expression.java \& Statement.java}
Programs are generally made up of expressions and statements. This is why we made abstract classes Expression.java and Statement.java. In our Expression class, we create a binary expression, which is when left and right objects are created, separated by an operator. This for example could be \texttt{5 * 4 + 3 * 6}. Here, our left and right operands are 5 * 4 and 3 * 6, separated by a PLUS token type. This can further be broken down by the numbers separated by the multiplication sign. Our Statements are lists of blocks which have a certain type, whether they are Prints or conditionals, or even iterables such as Whiles. An expression is also a statement, so we also accounted for having an ExpressionStatement. Ultimately, we have many functions in these abstracts which return certain types for each statement or expression, and this can be used for identification when our Parser eventually reads input.
\section{Translator.java \& ExecuteC.java}
Our translator file takes our list of statements previously analysed by our chain process, going through out lexer and parser, and one by one is converted into C code which can then be executed as it would normally to provide an output. We store these staments in an ArrayList and simply iterate over them using functions such as compileToC(), where for each statement, we evaluate it and add it to a text file which is eventually printed to the UI.
\section{Agile \& Project Management}
We worked well as a team. There were little issues. We decided to use the Agile method since it would lead to not only increased team communication, but better organisation when it came to our code. It also meant our development was flexible; we could choose what we wanted to work on at a given time with the use of sprints. \newline
During our allocated project time, we did four major sprints. Our initial priority was creating the main framework for the compiler of our esoteric language. Then we handled variables and print statements. Then implemented conditionals and loops. Finally, we tackled arrays and functions. Once we tied up these loose ends, we could actually make a full Sieve of Eratosthenes algorithm to show the capability of the program. \\
\subsection{Sprint 1: Creating the framework for the compiler}
\begin{itemize}
\item Language.java
\item Token.java
\item TokenScanner.java
\item TokenType.java
\end{itemize}
\subsection{Sprint 2: Variables and Print statements}
\begin{itemize}
\item Parser.java
\item Expression.java
\item Statement.java
\end{itemize}
\subsection{Sprint 3: Conditionals and loops}
\begin{itemize}
\item Parser.java
\item Utils.java
\end{itemize}
\subsection{Sprint 4: Arrays \& functions}
\begin{itemize}
\item Parser.java
\item Environment.java
\item ExecuteC.java
\item Translator.java
\end{itemize}
\chapter{Implementation}\label{Impl}
In this chapter you cover the actual implementation of your project. Implementation will be done in sprints so you may wish to use different sub-sections for each sprint and then dedicate a section to your final deliverable. Section \ref{Figures} with figures should not remain in the final report.
The long string following the sprint number is the git commit at the point in the git repository at which the tests were conducted.
\section{Early sprints}
\subsection{Sprint 1 - \texttt{cb29252f1e0d29d555fb232f39d343930fc76105}}
The first sprint focused on developing the initial stages of the compiler. The main goal of the first sprint was to create and define the grammar for our custom esoteric language, then to create a working program to handle simple expressions through first applying lexical analysis, then using a parser. The first section of code implemented was the framework for the compiler, handing the users input and linking each section together. We decided to give the user the choice of either running the compiler with a single line of code, or running from a file containing source code, with the path given as an argument. Once completed, work on the lexical scanner could begin. The bulk of the scanner class is a function with takes the source code string as an argument, and returns an arraylist of tokens. This was done by looping through each character and assigning the correct token using a switch statement.The parser then takes this list of tokens and forms an abstract syntax tree, where each type of expression in our grammar is represented as its own class, extending a base expression class.
For example:
\begin{verbatim}
Code: 3-1+2
\end{verbatim}
Produces the output:
\begin{verbatim}
NUMBER 3 3.0
MINUS - null
NUMBER 1 1.0
PLUS + null
NUMBER 2 2.0
\end{verbatim}
A more complex example:
\begin{verbatim}
Code: (36/2 + 45.2) * 3
\end{verbatim}
Produces the result:
\begin{verbatim}
LEFT_PAREN ( null
NUMBER 36 36.0
SLASH / null
NUMBER 2 2.0
PLUS + null
NUMBER 45.2 45.2
RIGHT_PAREN ) null
STAR * null
NUMBER 3 3.0
\end{verbatim}
\subsection{Sprint 2 - \texttt{69b0ad07bac30beca1397ff187468e7597203c44}}
The goal of sprint 2 was to introduce statements to the language, starting with the simple print statement. The secondary goal was to add variable definition and assignment. Firstly text scanning was added to the lexer by first matching a alphabet character, then looking ahead until the last character. The extracted text was compared to a list of keywords, and either assigned a matching keyword, or set as an identifier. The parser was also improved, changing the return result to a list of statements. In order to more accurately check for undefined variables, an environment class was implemented, which stores any defined variables. The environment is then checked when a variable is used,which allows for more informative error messages.
For example, running \texttt{java Interpreter.Language example2.txt} where \texttt{example2.txt} is the following:
\begin{verbatim}
var :: a
a=5
a=a+1
print a
\end{verbatim}
Produces the result:
\begin{verbatim}
6.0
\end{verbatim}
Testing involved making sure the program worked with the rules of left-associativity; for example, the program:
\begin{verbatim}
var :: a
a=3-1+2
print a
\end{verbatim}
Produces the result \texttt{4.0}.
We also tested BIDMAS rules by using complex expressions:
\begin{verbatim}
var :: a
a=(36/2 + 45.2) * 3
print a
\end{verbatim}
Returns \texttt{189.60000000000002}, which is correct considering the inaccuracy of floating points in computers. Finally, a test of a more complex program:
\begin{verbatim}
var :: a
a=5
a=a+1
print a
a=7
a=a*2
print a
var :: b
b = 10
print a+b
\end{verbatim}
Produces the result:
\begin{verbatim}
6.0
14.0
24.0
\end{verbatim}
You can test the error handling in several ways. For example the program
\begin{verbatim}
a=3
print a
\end{verbatim}
Uses an undefined variable \texttt{a}. The output of running this is:
\begin{verbatim}
An error was encountered
Variable undefined
\end{verbatim}
You can also try and run a program with a syntax error, for example:
\begin{verbatim}
var :: a
a=3+
print a
\end{verbatim}
Produces:
\begin{verbatim}
An error was encountered
Expected Expression
\end{verbatim}
This is how it reacts to using an undefined token:
\begin{verbatim}
var :: a
a=3
pprint a
\end{verbatim}
Produces:
\begin{verbatim}
An error was encountered
Undefined Variable
\end{verbatim}
\subsection{Sprint 3 - \texttt{a12094123dcacee41a7472031db6fe6027b083a7}}
This version added full compilation to a binary by translating to a C program and using gcc to compile it straight away. It also added strings as lists of characters, and simple if statements. For example running \texttt{java Compiler.Language example.txt} where \texttt{example.txt} is the program:
\begin{verbatim}
character (len=10)::hello
hello="hello"
if 5==5 then
hello="goodbye "
endif
print *,hello,6," test" endprint
\end{verbatim}
Produces, in the \texttt{build/} folder, \texttt{example.c}:
\begin{verbatim}
#include <stdio.h>
#include <string.h>
int main(){
char hello[11];
strcpy(hello,"hello");
if(5==5){
strcpy(hello,"goodbye ");
}
printf("%s%d%s",hello,6," test");
}
\end{verbatim}
It also compiles this file automatically, checking for errors during the process, and produces the executable binary \texttt{example.exe} in windows, or \texttt{./example} if run in a distribution of GNU+Linux. It then runs the binary straight away, checking for runtime errors. The output of running this binary is:
\begin{verbatim}
goodbye 6 test
\end{verbatim}
Here's an example of a program which produces a runtime error:
\begin{verbatim}
character (len=3)::hello
hello="hello"
if 5==5 then
hello="goodbye "
endif
print *,hello,6," test" endprint
\end{verbatim}
It produces a runtime error since the character string is too short. When it's run, it says:
\begin{verbatim}
An error was encountered
Runtime Error
\end{verbatim}
This version also added better error messages. For example \texttt{hello="hello} produces:
\begin{verbatim}
An error was encountered
Strings must end with "
\end{verbatim}
Moreover, the program
\begin{verbatim}
character (len=10)::hello
hello="hello"
if 5==5 then
hello="goodbye "
print *,hello,6," test" endprint
\end{verbatim}
Produces the error message:
\begin{verbatim}
An error was encountered
endif missing
\end{verbatim}
Finally, lets test if statements with another program:
\begin{verbatim}
character (len=10)::hello
hello="hello"
if 4==5 then
hello="goodbye "
endif
print *,hello,6," world" endprint
\end{verbatim}
Produces the result:
\begin{verbatim}
include <stdio.h>
#include <string.h>
int main(){
char hello[11];
strcpy(hello,"hello");
if(4==5){
strcpy(hello,"goodbye ");
}
printf("%s%d%s",hello,6," world");
}
hello6 world
\end{verbatim}
\subsection{Sprint n}
\section{Final implementation}
\section{Figures, tables, etc.}
\label{Figures}
The purpose of this section is to just show you how to integrate figures, tables, etc.\ and should disappear in the final report. Figures and tables should be distributed across the document wherever they are needed but you should use an appendix if they are many figures/tables of the same kind. Fig.\ \ref{Pelvis_BVH} shows a bony pelvis.
\begin{figure}[htb]
%\begin{center}
\caption{The bony pelvis model with octree based AABBs (Axis Aligned Bounding Boxes).}
\label{Pelvis_BVH}
%\end{center}
\end{figure}
Fig.\ \ref{class} shows a UML class diagram (class, sequence and state diagrams are the most frequently used UML diagrams):
\begin{figure}[htb]
%\begin{center}
\caption{A UML class diagram.}
\label{class}
%\end{center}
\end{figure}
Algorithms can be either used in this chapter or alternatively in Chapter \ref{MethLab} if it is a more generic algorithm:
\begin{algorithm}[th]
\caption{ The projection based contact method algorithm }
\begin{algorithmic}[1]
\STATE Retrieve current node displacement $u$
\\ \texttt{float3 u = m\_U\_new[nodeIndex].xyz;}
\STATE Retrieve constraint plane equation
\\ \texttt{float4 plane = m\_constraintMagnitude[nodeIndex];}
\STATE Calculate dot product with plane normal
\\ \texttt{float d = dot(u, plane.xyz);}
\STATE Find node penetration into the plane's negative half-space
\\ \texttt{float penetration = plane.w - d;}
\IF {penetration is greater than zero}
\STATE Find projection onto the plane surface
\texttt{float3 proj = u + plane.xyz * penetration;}
\STATE Prescribe new nodal position to be on the surface
\texttt{m\_U\_new[nodeIndex] = (float4)(proj, 0.0f);}
\ENDIF
\end{algorithmic}
\end{algorithm}
Tables such as Table \ref{Res01} can also be useful here or in Chapter \ref{MethLab}.
\begin{table}[h]
\caption[]{Original diameters and diametral strains as reported by
Sorbe and Dahlgren \cite{Sorbe:1983} (columns 1-2), from a previous
experiment by Lapeer and Prager and reported in \cite{Lapeer:2001}
(columns 3-4), and from the current experiment (columns 5-6).}
\begin{center}
\begin{tabular}{|l|c|c||c|c||c|c|}\hline
& \multicolumn{2}{c||}{S-D} & \multicolumn{2}{c||}{L-P old} & \multicolumn{2}{c|}{L-P new} \\ \hline
Diameter & length & strain & length & strain & length & strain \\ \hline
$\mi{MaVD}$ & 140.5 & +1.90 & 129.3 & +0.30 & 129.3 & +1.43 \\
$\mi{OrOD}$ & 131.4 & +0.10 & - & - & 119.9 & +1.85 \\
$\mi{OrVD}$ & 126.9 & +2.20 & 119.3 & +0.25 & 119.3 & +1.24 \\
$\mi{OFD}$ & 134.0 & +0.40 & - & - & 119.7 & +1.82 \\
$\mi{SOFD}$ & - & - & - & - & 113.2 & -0.85 \\
$\mi{SOBD}$ & 117.1 & -1.70 & 88.7 & -1.07 & 88.7 & -2.52 \\
$\mi{BPD}$ & 105.0 & 0.00 & 89.7 & -0.21 & 89.7 & -0.83 \\ \hline
\end{tabular}
\label{Res01}
\end{center}
\end{table}
Note that code snippets or lists of crucial programming code or large UML diagrams should go in the Appendix/Appendices.
\chapter{Testing}
The testing was done in two main stages, first unit tests after each sprint were done in order to check correctness of the additions made, then large scale tests in order to check the compiler works as expected, is Turing complete and can handle any inputs.
\section{Sprint testing}
The testing that occurred after each sprint was done in two main stages. Firstly, the lexical scanner was unit tested independently from the rest of the code, then once we were confident that the scanner was working correctly, we tested the scanner and parser together. Due to the nature of the parsers output, we utilised a simple tree-walk interpreter for the first two sprints, then used the full compiler for the later sprints. The results of each test are shown in the tables below, giving the reason for the test, the input and output, and whether the test was successful.
\subsection{Sprint 1 testing}
\begin{table}[h]
\caption[]{Sprint 1 Tests and their results}
\begin{center}
\begin{tabular}{|l|l|l|l|}\hline
Reason & Input & Output & Success \\ \hline
Basic expression scanning & 1+1 & \parbox{2.5cm}{NUMBER 1 \\ PLUS + \\ NUMBER 1 \\EOF} & True \\ \hline
Decimal number scanning & 1.5 & \parbox{2.5cm}{NUMBER 1.5\\EOF} & True \\ \hline
Multiple character token scanning & 4==6 & \parbox{3.5cm}{NUMBER 4\\EQUALITY ==\\NUMBER 6\\EOF} & True \\ \hline
Invalid token character scanning & \# & \parbox{3cm}{Unexpected Character Error} & True \\ \hline
Literal expression parsing & 5 & 5 & True \\ \hline
Binary expression parsing & 1+1 & 2 & True \\ \hline
Bracketed expression parsing & (2*3) & 6 & True \\ \hline
Comparison parsing& 4$>$3 & true & True \\ \hline
Invalid syntax checking& 5+(3+2 & Error: Expected ')' & True \\ \hline
Invalid expression checking& 5++1 & Error: Syntax Error & True \\ \hline
Full expression test& 5*(3+1)-9/3 & 17 & True \\ \hline
\end{tabular}
\label{sprint1Test}
\end{center}
\end{table}
\newpage
\subsection{Sprint 2 testing}
\begin{table}[h]
\caption[]{Sprint 2 Tests and their results}
\begin{center}
\begin{tabular}{|l|l|l|l|}\hline
Reason & Input & Output & Success \\ \hline
Keyword scanning & print 5 & \parbox{2.5cm}{PRINT print \\ NUMBER 5 \\ EOF} & True \\ \hline
Identifier scanning & test & \parbox{3.5cm}{IDENTIFIER test\\EOF} & True \\ \hline
Print Expression test & print 5 & \parbox{3cm}{5} & True \\ \hline
Variable test & \parbox{3cm}{var::a\\a=5\\print a} & \parbox{3cm}{5} & True \\ \hline
Undefined variable test & \parbox{3cm}{a=5\\print a} & Error: Undefined Variable & True \\ \hline
Invalid syntax error & \parbox{3cm}{a=5\\prit 5} & Error: Syntax Error & True \\ \hline
\end{tabular}
\label{sprint2Test}
\end{center}
\end{table}
\subsection{Sprint 3 testing}
\begin{table}[h]
\caption[]{Sprint 3 Tests and their results}
\begin{center}
\begin{tabular}{|l|l|l|l|}\hline
Reason & Input & Output & Success \\ \hline
If statement scanning & \parbox{3.5cm}{if 5==5 then\\endif} & \parbox{4cm}{IF \\ NUMBER 5 \\ EQUALS == \\ NUMBER 5\\ THEN\\ ENDIF \\EOF} & True \\ \hline
Print statement scanning & print*,5 endprint & \parbox{3.5cm}{PRINT\\STAR *\\COMMA ,\\NUMBER 5\\ENDPRINT\\EOF} & True \\ \hline
String scanning test & "test" & \parbox{3cm}{STRING test} & True \\ \hline
If statement parsing & \parbox{3cm}{int::a\\if 5==5 then\\a=3\\endif\\print*, a endprint} & \parbox{3cm}{3} & True \\ \hline
String parsing & \parbox{3cm}{print*,"hello world" endprint} & \parbox{3cm}{hello world} & True \\ \hline
Multiple print values & \parbox{3cm}{print*,5,"hello",6} & \parbox{3cm}{5hello6} & True \\ \hline
\end{tabular}
\label{sprint3Test}
\end{center}
\end{table}
\chapter{Discussion, conclusion and future work}
Overall we are happy with the end product produced during this project. We have created a working, capable compiler able to work with our modified version of Fortran. From our initial MoSCoW analysis in Chapter 1 we have achieved all the "must have" requirements as the language is Turing Complete and can detect a wide variety of both syntax and runtime errors. From the "should have" section, we have functionally to read from files, and has all of the additional features. However, since we decided to compile to C, we excluded optimisation methods as the C compiler would perform these anyway. Again, from the "must have" section we implemented the additional features described, such as functions, but again decided against direct memory allocation as our language is still reasonably simple, so the memory handing from C suffices. Finally, even though we said the end product would not have a GUI, we decided to implement a simple interface to test the compiler and more clearly show the final product.
Future work would include further expanding the language to include further Fortran features. Most obviously, we would further develop string handling to include string indexing and slicing. We would also implement additional input and output features, including reading from files. Other implemented features would include modules, pointers and including some built-in functions. Finally, we would further improve the error handing, detecting more errors before the code is compiled to C and displaying additional error information.
\bibliographystyle{unsrt}
\bibliography{References}
\chapter*{Contributions}
State here the \% contribution to the project of each individual member of the group and describe in brief what each member has done (if this corresponds to particular sections in the report then please specify these).
\chapter*{Appendix A}
\begin{figure}[h!]
\begin{center}
\includegraphics[scale=0.5]{UMLsprint1.pdf}
\caption{UML diagram for Sprint 1}
\end{center}
\end{figure}
\begin{figure}[h!]
\begin{center}
\includegraphics[scale=0.35]{UMLsprint2.pdf}
\caption{UML diagram for Sprint 2}
\end{center}
\end{figure}
\begin{figure}[h!]
\begin{center}
\includegraphics[scale=0.35]{UMLsprint3.pdf}
\caption{UML diagram for Sprint 3}
\end{center}
\end{figure}
\begin{figure}[h!]
\begin{center}
\includegraphics[scale=0.35]{UMLsprint4.pdf}
\caption{UML diagram for Sprint 4, equivalent to the final compiler design}
\end{center}
\end{figure}
\end{document}
|