CodeParser Release Notes

0.7

Jan 9 2019

Separate Concrete Syntax Trees and Abstract Syntax Trees.

Moved unused generated files out of paclet layout.

Introduced expression depth and expression breadth warnings.

Added wl-ast as a Paclet Resource.

Added notes in README.md about Antivirus problems on Windows.

Added TernaryOperatorsToSymbol
Renamed LinearSyntaxBangPrefixLinearSyntaxBang
Added PrefixInvisiblePrefixScriptBase
Added PostfixInvisiblePostfixScriptBase
Renamed InfixImplicitPlusBinaryImplicitPlus
Added BinaryInvisibleTimes

Added WLCharacter

brenton2maclap:MacOSX-x86-64 brenton$ ./wl-ast -format characters
>>> 1+1
{
WLCharacter[49, "1", <|Source->{{1, 1}, {1, 1}}|>],
WLCharacter[43, "+", <|Source->{{1, 2}, {1, 2}}|>],
WLCharacter[49, "1", <|Source->{{1, 3}, {1, 3}}|>],
WLCharacter[-1, "", <|Source->{{2, 0}, {2, 0}}|>],
Nothing
}
brenton2maclap:MacOSX-x86-64 brenton$

Added logic to use PacletResources to find wl-ast executable.

Added ConcreteParse functions and start separating out concrete and abstract parse trees.

Added warning for input like:

1.2.3

0.8

Feb 4, 2019

Correct concrete and abstract (Derivative) parsing

Abstract parsing errors for a b c and a b c

Correctly abstract a b

Add a stop-gap for parsing large expressions containing + and -
The concrete syntax tree was originally treating exprs of + and exprs of - as separate infix nodes. And then when abstracted, they would be combined. However, this is a problem for expressions that heavily alternate between + and -, e.g., a + b - c + d - e + f - g ... This would create a deeply nested expression and then the internal call to ToExpression would fail, even though the kernel can parse the originaly expression. The limitation of ToExpression is understood. Introduce a stop-gap that treats a + b - c + d - e + f - g ... as a single infix node, with InternalMinusNode for minus nodes. Eventually when we move to something like LibraryLink / WSTP, then we can move back to separate parse trees for + and -.

Use FindFile to help with things like ParseFile["Foo`"]

Fix more cases for DeclarationName

Add support for all set relation operators

Fix certain infix longname operators being parsed as binary operators

Better error handling for ToInputFormString and ToFullFormString

Make sure to delete all new'd memory in wl-ast

Some work on documentation notebooks.

Add Boxes.wl, providing a rudimentary CSTToBoxes function

Make Divide binary instead of infix

Enforce NonASCII restriction on strings and files. This comes from a restriction of RunProcess and will be addressed in a future update.

Preserve the difference between characters providing as byte encoded, and characters provided as \ encoded
For example, the byte 0x0a and the bytes 0x5c 0x6e both encode the newline character. And there are times where the distinction is important and must be preserved. \ syntax for the newline character cannot be used outside of strings.

Also it is just a bit nicer to preserve the encoding that was provided. One caveat is that \ encoding does canonicalize to one form. For example, \n, \ :000a \ .0a and \012 all encode the newline character. But it is canonicalized to \n. The general pattern for canonicalization is to first prefer the short \x form if possible (e.g., \n \r \t), and then prefer long names \[Name] syntax, and then prefer \:xxxx syntax.

0.9

Apr 1, 2019

Remove explicit lists of letterlike long names, and add explicit list of uninterpretable long names.
All valid letterlike characters should work now, and any \:xxxx characters are flagged as strange.
Fill out the other lists of characters.
All characters should now be categorized (letterlike, operator, space, newline, comma, uninterpretable)

Add >>> operator.
Properly stringify args after ::, <<, >>, and >>>.

Attach comments to CST nodes.

Stop using an executable.
Build a shared library and use LibraryLink and MathLink.
Support building with earlier versions of Mathematica. Building with version 11.0 is supported.

Separate SyntaxIssues and AbstractSyntaxIssues.

Allow File[] wrappers to work.

Various bug fixes.

0.10

May 6, 2019

Put Source information on link as a packed array.

Properly abstract strings in operators that do their own stringification (::, >>, >>>, <<, #).

Handle parsing #123 and #a`b

Sandbox mode is now respected in the library.

Support abstracting VectorLess, VectorGreater, etc.

Updated build scripts.

Re-add remarks about invalid UTF-8 sequences.

Add a remark about stray \r characters.

Add more ambiguity remarks.

Support abstracting BeginPackage[]/EndPackage[], Begin[]/End[], and BeginStaticAnalysisIgnore[]/EndStaticAnalysisIgnore.

Various bug fixes.

0.11

June 10, 2019

Include operator tokens in concrete syntax trees.

Include comments as a separate list returned by concreteParse functions.

Include warnings about using unsupported and undocumented characters.

Include warnings about strange top-level expressions.

Various bug fixes.

0.12

August 5, 2019

Consolidate various atom Nodes into single LeafNode

\ [EmptySet] is not strange.

Fixed handling of runs of multiple ;;

Comments, whitespace, and newlines are now returned in concrete syntax.

Introduce new concept of aggregate syntax which is concrete syntax, but with comments, whitespace, and newlines removed.

Aborts are handled more gracefully.

Handle more operators.

Add CubeRoot character.

Various bug fixes.

0.13

September 16, 2019

Add ConcreteParseBox, ToStrandardFormBoxes, and ToSourceCharacterString functions.

Add progress monitoring.

Switch to sending MakeLeafNode calls over MathLink, for performance.

Add some missing operators, \ [Colon], \ [CupCap], etc.

Other performance improvements.

Add some CodeActions.

Add Did You Mean for /@ for / at top-level

Move to using a struct with bitfields for WLCharacter.

Read entire file into buffer and store in SourceManager

Add ParseLeaf function

Create single nodes for Inequalities and VectorInequalities

Preserve line continuations in concrete syntax

Introduce a token for =. which is needed for box support

The characters \" and \\ are mapped into special codes

Warn about line continuations inside comments

Rename implicit tokens ImplicitNull, ImplicitOne, ImplicitAll

Add InfixOperatorWithTrailingParselet, for commas and semis

Only report UnlikelyEscapeSequence if not a valid character

Report strange space characters

Report strange newline characters

Various bug fixes.

0.14

October 28, 2019

Switch to using unique_ptr implementation.

Work on messages

Simplify handling line continuation of just \r

No need to issue warning for errors being strange

Simplify handling \ at end of file

Return TOKEN_ERROR_EMPTYSTRING at EOF when appropriate

Prevent SourceManager from advancing past EOF

Do not count SyntaxErrorNodes as being strange at top-level

Do the favor of combining naked \ with next character for better error reporting

Tighter error handling

Call back to kernel for LongName suggestions

Add MultiBoxNode, for handling multiple inputs in an Input cell

Make sure that ParseLeafs remembers any SyntaxIssues

Support different SourceStyles

Space or Newline chars that are directly encoded are not strange

Remove string methods from SourceCharacter and WLCharacter, and provide iterators instead

Differentiate between line continuations with different newlines

Optimizing: Collect flag fields into single bitset fields

Optimizing: eliminate dynamic_cast

Develop system for using RAII to automatically handle queueing unhandled whitespace

Work on CodeActions

Distinguish between SyntaxIssues and FormatIssues

Add ConfidenceLevel to issues

Allow :: to work when parsing boxes

Allow ? to work when parsing boxes

Convert all syntax issues in C++ code to use CondeActions

Add LeafSeqNode and NodeSeqNode classes in order to alleviate the need to constantly iterate through vectors

Employ some strategies to reduce copying Tokens so much

Letterlike characters can be strange, or very strange (with higher confidence that it is a problem)

Better error handling for f[1\[Alpa]2]

Added Intra[] construct for specifying positions within tokens

Handle \[Alpa] being parsed as boxes

Start work to make TokenEnum contain other useful bits

Make sure that SyntaxErrorNodes always have children, and are not just a leaf node

Convert Token errors into appropriate SyntaxErrorNodes when abstracting

Allow e.g., { + } to be parsed correctly as boxes

Revamped the error handling in the parser, so that unexpected closing brackets and unexpected operators do not eat any unnecessary whitespace. Lot of work and kind of ugly. But maybe error handling has to be ugly

Use LongName for making characters graphical, if available

Add check for strange characters

Allow boxes with << to be parsed properly

Drop ImplicitNull when converting back to boxes

Add checks for strange Unicode characters

When parsing a leaf, treat multiple tokens of whitespace as a single token of whitespace, similar to how FE works

Remove special handling of NonAssociativity with DirectedEdge and UndirectedEdge. This was bug 206938 and is now fixed.

Add a Quirks mode

Introduce AbstractFormatIssues, to allow warning about unneeded line continuations

Have ScopedIFS manage the data buffer from a file, and pass it to SourceManager

Treat \r\n as a single Newline token

Fixes

Fix text mode error found on Windows

Fix bug where \\ in a string, at end of line, on Windows, gave an assert

Fix ToSourceCharacterString

Fix OptionalDefaultPatternNode

Fix when UnhandledDot can happen

Fix ParseLeaf of <<

Fix parsing TagSetDelayed and TagUnset

Fix when ReplaceNode CodeAction is the entire expression (ReplacePart does not work)

Fix reporting of EOF in escape sequences

Cleanup

Move NonAssociative error handling to Abstract.wl

Convert tokenQueue to a deque, since there are so many insertions in the front

Remove append from Parser, and only have prepend

Complete remove Metadatas

Move library-related stuff to Library.wl

Organize Token and LongName files

Remove unneeded use of unique_ptr

Remove LOOKAHEAD wrappers

0.15

January 15, 2020

Update Source of nodes to be half-open. This is a change from earlier versions, where the Source was always inclusive.

For example, here is the old Source for the integer 123:

In[2]:= ConcreteParseString[123]

Out[2]= ContainerNode[String, {LeafNode[Integer,
123, <|Source -> {{1, 1}, {1, 3}}|>]}, <||>]

And now here is the new Source:

In[3]:= ConcreteParseString[123]

Out[3]= ContainerNode[String, {LeafNode[Integer,
123, <|Source -> {{1, 1}, {1, 4}}|>]}, <||>]

This change has a number of nice qualities. It is now easy to determine the length of the token by subtracting the start from the end, and 0-length tokens can now be represented accurately.

Add Creator field to paclet.

Standardize on using ContainerNode as the outer-most node

Add InsertNodeAfter CodeAction command

Add ParseBytes function

Add InsertTextAfter CodeAction command

FormatIssues now explicitly supply their CodeActions

Do not complain about unexpected line continuations in comments

Add more operators:
DoubleRightTee
DoubleLeftTee
UpTee
DownTee
RoundImplies
Perpendicular
etc.

Disable treating BMP PUA as strange for now

REPLACEMENT CHARACTER 0xfffd is strange, this will allow flagging of bad UTF8 in the linter

Introduce \r\n as a single SourceCharacter. This greatly simplifies newline handling.

Standardize on using Whitespace as a token

Require using File[] wrapper

Add Listable version of Tokenize

Add SafeString function

InlinePart longname is unsupported

Fixes

Bring in several fixes found from fuzz testing

When parsing a - b + c, make sure to give the abstracted Times expression the correct Source.

Fix implicit Times in boxes by giving it the same Source as the RHS

Fix parsing single-digit precision

Treat prefix !! properly

Fix line continuations in # and % tokens

Abstract HermitianConjugate into ConjugateTranspose

Handle a-EOF and a/EOF

Handle Unicode non-characters and BOM

Fix precedence problem of ++a++ and --a--

Cleanup

Remove unused MissingOpener parts

Do not treat - as separate binary operator from +. Combine - parsing with +.

Do not check for strange characters in comments

0.15.1

January 23, 2020

Only load expr lib functions if expr lib exists

Fix assert when parsing 

Fix FileExistsQ::fstr that can happen with earlier versions

1.0

March 24, 2020

Properly abstract unquoted strings

Rename ParseLeaf -> ConcreteParseLeaf

Rename AST -> CodeParser

Add support for System`Private`NewContextPath / System`Private`RestoreContextPath

Do better job with ToStandardFormBoxes handling multiple inputs separated by newlines

Simplify newline handling so that newline tokens are contiguous and half-open, just like all other tokens

Enable multi-line mode for matching chunks.

Add SourceConvention option

Generate Parselet registrations at build-time

Convert various std::maps to sorted std::arrays

Add EncodingIssues

Template-ize UnderParselet

Make all parsing functions listable

Insert "Definition" metadata for functions

Allow + +a to parse as +a, the same as kernel

Change from old syntax CodeParse[str, h] to new Syntax CodeParse[str, ContainerNode -> h]

Call setupLibraries[] lazily

Transition to purely parselet-driven parsing

Remove implicit Times logic from parser, and properly handle inside parselets.

No longer need to pay the cost of checking implicitTimes boolean for every parse.

Fixes

Fix crash found by afl

Cleanup

Combine handling of Inequality and VectorInequality

Combine handling of Infix and Inequality

Cleanup several issues from fuzz testing