summaryrefslogtreecommitdiffstats
path: root/formal/promela/src/src/modules/comment_filter/README.md
blob: 0e0f15fb1a65f0e04e7e19361acd56a5f4bf1ed3 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
Comment Filter
==============

[![CircleCI](https://circleci.com/gh/codeauroraforum/comment-filter.svg?style=svg)](https://circleci.com/gh/codeauroraforum/comment-filter)

A Python library and command-line utility that filters comments from a source
code file, replacing each non-comment character with a space.

When run from the the command-line, the `comment` utility will generate an
output file in which any comment text will remain at the same line and column
as the input file.

The Python library provides one function `parse_file()`, which streams the
input file and returns the filtered file contents via a generator that yields
one line at a time.

Both interfaces support an 'only code' option that inverts the functionality
such that comment text is replaced by spaces and the code text is preserved in
its original location.

There is also a 'no tokens' option which will preserve the comments, but
replace the comment tokens with spaces (in addition to the code text).


hello.c:

```c
/* multi-line
   comment */
// single-line comment

int main() {
  return 0;
}
```

Example of getting the comments in a C file:

```bash
$ comments hello.c
/* multi-line
   comment */
// single-line comment
```

When filtering for comments, any character that is not the start of a comment
is replaced with a space.

To get comments without the comment tokens:

```bash
$ comments --notokens hello.c
   multi-line
   comment
   single-line comment
```

Filter out the comments:

```bash
$ comments --onlycode hello.c




int main() {
  return 0;
}
```


Python library
--------------

Alternatively, one can use the provided Python library directly.  It provides
one function `parse_file()`, which streams the input file and returns
the filtered file via a generator.  The generator yields one line at a time.


Implementation Notes
--------------------

A challenging requirement is that the parser is only fed one line at a time.
This means that we cannot leverage most Python parsing libraries, including
PyParsing, PyPEG, or even the Haskell Parsec-inspired funcparserlib.  Instead,
we need stream parsing combinators, like those provided by Haskell's Conduit
or Iteratee.  But in Python, and for this small parser, implementing that
infrastructure seemed like overkill.  Unlike Haskell, Python cannot optimize
out the additional abstraction layer.  So this library implements streaming,
recursive-decent parsers by hand.  Lots of ugly noise in the code, but lots
and lots of unit tests to keep complexity under control.


Grammar
-------

```antlr
file                 : declaration* ;

declaration          : line_comment | multiline_comment | code ;

line_comment         : line_comment_start (~endl)* endl ;

multiline_comment    : multiline_comment_start multiline_contents multiline_comment_end ;
multiline_contents   : (multiline_char | multiline_comment)* ;
multiline_char       : ~(multiline_comment_start | multiline_comment_end) ;

code                 : (string_literal | code_char)* ;
code_char            : ~(string_literal_start | line_comment_start | multiline_comment_start) ;

string_literal       : string_literal_start string_literal_char* string_literal_end ;
string_literal_char  : escape_char (string_literal_start | string_literal_end)
                     | ~(escape_char | string_literal_end) ;
```

The syntax for the following tokens are provided by the `language` module:

  * line_comment_start
  * multiline_comment_start
  * multiline_comment_end
  * string_literal_start
  * string_literal_end
  * escape_char


Recognized Languages
--------------------

  * C
  * C++
  * Go
  * Haskell
  * Java
  * Lua
  * Python
  * Perl
  * Ruby


Developing
----------

This assumes the following are installed and in your system path:

   * Python 2.7.x OR Python 3.4.x
   * tox

To build and test, run `tox`.

```bash
$ tox
```

To remove all files not registered with git.

```bash
$ git clean -Xdf
```