CodeParser Release Notes

0.7

Jan 9 2019

Separate Concrete Syntax Trees and Abstract Syntax Trees.

Moved unused generated files out of paclet layout.

Introduced expression depth and expression breadth warnings.

Added wl-ast as a Paclet Resource.

Added notes in README.md about Antivirus problems on Windows.

Added TernaryOperatorsToSymbol
Renamed LinearSyntaxBangPrefixLinearSyntaxBang
Added PrefixInvisiblePrefixScriptBase
Added PostfixInvisiblePostfixScriptBase
Renamed InfixImplicitPlusBinaryImplicitPlus
Added BinaryInvisibleTimes

Added WLCharacter

brenton2maclap:MacOSX-x86-64 brenton$ ./wl-ast -format characters
>>> 1+1
{
WLCharacter[49, "1", <|Source->{{1, 1}, {1, 1}}|>],
WLCharacter[43, "+", <|Source->{{1, 2}, {1, 2}}|>],
WLCharacter[49, "1", <|Source->{{1, 3}, {1, 3}}|>],
WLCharacter[-1, "", <|Source->{{2, 0}, {2, 0}}|>],
Nothing
}
brenton2maclap:MacOSX-x86-64 brenton$

Added logic to use PacletResources to find wl-ast executable.

Added ConcreteParse functions and start separating out concrete and abstract parse trees.

Added warning for input like:

1.2.3

0.8

Feb 4, 2019

Correct concrete and abstract (Derivative) parsing

Abstract parsing errors for a b c and a b c

Correctly abstract a b

Add a stop-gap for parsing large expressions containing + and -
The concrete syntax tree was originally treating exprs of + and exprs of - as separate infix nodes. And then when abstracted, they would be combined. However, this is a problem for expressions that heavily alternate between + and -, e.g., a + b - c + d - e + f - g ... This would create a deeply nested expression and then the internal call to ToExpression would fail, even though the kernel can parse the originaly expression. The limitation of ToExpression is understood. Introduce a stop-gap that treats a + b - c + d - e + f - g ... as a single infix node, with InternalMinusNode for minus nodes. Eventually when we move to something like LibraryLink / WSTP, then we can move back to separate parse trees for + and -.

Use FindFile to help with things like ParseFile["Foo`"]

Fix more cases for DeclarationName

Add support for all set relation operators

Fix certain infix longname operators being parsed as binary operators

Better error handling for ToInputFormString and ToFullFormString

Make sure to delete all new'd memory in wl-ast

Some work on documentation notebooks.

Add Boxes.wl, providing a rudimentary CSTToBoxes function

Make Divide binary instead of infix

Enforce NonASCII restriction on strings and files. This comes from a restriction of RunProcess and will be addressed in a future update.

Preserve the difference between characters providing as byte encoded, and characters provided as \ encoded
For example, the byte 0x0a and the bytes 0x5c 0x6e both encode the newline character. And there are times where the distinction is important and must be preserved. \ syntax for the newline character cannot be used outside of strings.

Also it is just a bit nicer to preserve the encoding that was provided. One caveat is that \ encoding does canonicalize to one form. For example, \n, \ :000a \ .0a and \012 all encode the newline character. But it is canonicalized to \n. The general pattern for canonicalization is to first prefer the short \x form if possible (e.g., \n \r \t), and then prefer long names \[Name] syntax, and then prefer \:xxxx syntax.

0.9

Apr 1, 2019

Remove explicit lists of letterlike long names, and add explicit list of uninterpretable long names.
All valid letterlike characters should work now, and any \:xxxx characters are flagged as strange.
Fill out the other lists of characters.
All characters should now be categorized (letterlike, operator, space, newline, comma, uninterpretable)

Add >>> operator.
Properly stringify args after ::, <<, >>, and >>>.

Attach comments to CST nodes.

Stop using an executable.
Build a shared library and use LibraryLink and MathLink.
Support building with earlier versions of Mathematica. Building with version 11.0 is supported.

Separate SyntaxIssues and AbstractSyntaxIssues.

Allow File[] wrappers to work.

Various bug fixes.

0.10

May 6, 2019

Put Source information on link as a packed array.

Properly abstract strings in operators that do their own stringification (::, >>, >>>, <<, #).

Handle parsing #123 and #a`b

Sandbox mode is now respected in the library.

Support abstracting VectorLess, VectorGreater, etc.

Updated build scripts.

Re-add remarks about invalid UTF-8 sequences.

Add a remark about stray \r characters.

Add more ambiguity remarks.

Support abstracting BeginPackage[]/EndPackage[], Begin[]/End[], and BeginStaticAnalysisIgnore[]/EndStaticAnalysisIgnore.

Various bug fixes.

0.11

June 10, 2019

Include operator tokens in concrete syntax trees.

Include comments as a separate list returned by concreteParse functions.

Include warnings about using unsupported and undocumented characters.

Include warnings about strange top-level expressions.

Various bug fixes.

0.12

August 5, 2019

Consolidate various atom Nodes into single LeafNode

\ [EmptySet] is not strange.

Fixed handling of runs of multiple ;;

Comments, whitespace, and newlines are now returned in concrete syntax.

Introduce new concept of aggregate syntax which is concrete syntax, but with comments, whitespace, and newlines removed.

Aborts are handled more gracefully.

Handle more operators.

Add CubeRoot character.

Various bug fixes.

0.13

September 16, 2019

Add ConcreteParseBox, ToStrandardFormBoxes, and ToSourceCharacterString functions.

Add progress monitoring.

Switch to sending MakeLeafNode calls over MathLink, for performance.

Add some missing operators, \ [Colon], \ [CupCap], etc.

Other performance improvements.

Add some CodeActions.

Add Did You Mean for /@ for / at top-level

Move to using a struct with bitfields for WLCharacter.

Read entire file into buffer and store in SourceManager

Add ParseLeaf function

Create single nodes for Inequalities and VectorInequalities

Preserve line continuations in concrete syntax

Introduce a token for =. which is needed for box support

The characters \" and \\ are mapped into special codes

Warn about line continuations inside comments

Rename implicit tokens ImplicitNull, ImplicitOne, ImplicitAll

Add InfixOperatorWithTrailingParselet, for commas and semis

Only report UnlikelyEscapeSequence if not a valid character

Report strange space characters

Report strange newline characters

Various bug fixes.

0.14

October 28, 2019

Switch to using unique_ptr implementation.

Work on messages

Simplify handling line continuation of just \r

No need to issue warning for errors being strange

Simplify handling \ at end of file

Return TOKEN_ERROR_EMPTYSTRING at EOF when appropriate

Prevent SourceManager from advancing past EOF

Do not count SyntaxErrorNodes as being strange at top-level

Do the favor of combining naked \ with next character for better error reporting

Tighter error handling

Call back to kernel for LongName suggestions

Add MultiBoxNode, for handling multiple inputs in an Input cell

Make sure that ParseLeafs remembers any SyntaxIssues

Support different SourceStyles

Space or Newline chars that are directly encoded are not strange

Remove string methods from SourceCharacter and WLCharacter, and provide iterators instead

Differentiate between line continuations with different newlines

Optimizing: Collect flag fields into single bitset fields

Optimizing: eliminate dynamic_cast

Develop system for using RAII to automatically handle queueing unhandled whitespace

Work on CodeActions

Distinguish between SyntaxIssues and FormatIssues

Add ConfidenceLevel to issues

Allow :: to work when parsing boxes

Allow ? to work when parsing boxes

Convert all syntax issues in C++ code to use CondeActions

Add LeafSeqNode and NodeSeqNode classes in order to alleviate the need to constantly iterate through vectors

Employ some strategies to reduce copying Tokens so much

Letterlike characters can be strange, or very strange (with higher confidence that it is a problem)

Better error handling for f[1\[Alpa]2]

Added Intra[] construct for specifying positions within tokens

Handle \[Alpa] being parsed as boxes

Start work to make TokenEnum contain other useful bits

Make sure that SyntaxErrorNodes always have children, and are not just a leaf node

Convert Token errors into appropriate SyntaxErrorNodes when abstracting

Allow e.g., { + } to be parsed correctly as boxes

Revamped the error handling in the parser, so that unexpected closing brackets and unexpected operators do not eat any unnecessary whitespace. Lot of work and kind of ugly. But maybe error handling has to be ugly

Use LongName for making characters graphical, if available

Add check for strange characters

Allow boxes with << to be parsed properly

Drop ImplicitNull when converting back to boxes

Add checks for strange Unicode characters

When parsing a leaf, treat multiple tokens of whitespace as a single token of whitespace, similar to how FE works

Remove special handling of NonAssociativity with DirectedEdge and UndirectedEdge. This was bug 206938 and is now fixed.

Add a Quirks mode

Introduce AbstractFormatIssues, to allow warning about unneeded line continuations

Have ScopedIFS manage the data buffer from a file, and pass it to SourceManager

Treat \r\n as a single Newline token

Fixes

Fix text mode error found on Windows

Fix bug where \\ in a string, at end of line, on Windows, gave an assert

Fix ToSourceCharacterString

Fix OptionalDefaultPatternNode

Fix when UnhandledDot can happen

Fix ParseLeaf of <<

Fix parsing TagSetDelayed and TagUnset

Fix when ReplaceNode CodeAction is the entire expression (ReplacePart does not work)

Fix reporting of EOF in escape sequences

Cleanup

Move NonAssociative error handling to Abstract.wl

Convert tokenQueue to a deque, since there are so many insertions in the front

Remove append from Parser, and only have prepend

Complete remove Metadatas

Move library-related stuff to Library.wl

Organize Token and LongName files

Remove unneeded use of unique_ptr

Remove LOOKAHEAD wrappers

0.15

January 15, 2020

Update Source of nodes to be half-open. This is a change from earlier versions, where the Source was always inclusive.

For example, here is the old Source for the integer 123:

In[2]:= ConcreteParseString[123]

Out[2]= ContainerNode[String, {LeafNode[Integer,
123, <|Source -> {{1, 1}, {1, 3}}|>]}, <||>]

And now here is the new Source:

In[3]:= ConcreteParseString[123]

Out[3]= ContainerNode[String, {LeafNode[Integer,
123, <|Source -> {{1, 1}, {1, 4}}|>]}, <||>]

This change has a number of nice qualities. It is now easy to determine the length of the token by subtracting the start from the end, and 0-length tokens can now be represented accurately.

Add Creator field to paclet.

Standardize on using ContainerNode as the outer-most node

Add InsertNodeAfter CodeAction command

Add ParseBytes function

Add InsertTextAfter CodeAction command

FormatIssues now explicitly supply their CodeActions

Do not complain about unexpected line continuations in comments

Add more operators:
DoubleRightTee
DoubleLeftTee
UpTee
DownTee
RoundImplies
Perpendicular
etc.

Disable treating BMP PUA as strange for now

REPLACEMENT CHARACTER 0xfffd is strange, this will allow flagging of bad UTF8 in the linter

Introduce \r\n as a single SourceCharacter. This greatly simplifies newline handling.

Standardize on using Whitespace as a token

Require using File[] wrapper

Add Listable version of Tokenize

Add SafeString function

InlinePart longname is unsupported

Fixes

Bring in several fixes found from fuzz testing

When parsing a - b + c, make sure to give the abstracted Times expression the correct Source.

Fix implicit Times in boxes by giving it the same Source as the RHS

Fix parsing single-digit precision

Treat prefix !! properly

Fix line continuations in # and % tokens

Abstract HermitianConjugate into ConjugateTranspose

Handle a-EOF and a/EOF

Handle Unicode non-characters and BOM

Fix precedence problem of ++a++ and --a--

Cleanup

Remove unused MissingOpener parts

Do not treat - as separate binary operator from +. Combine - parsing with +.

Do not check for strange characters in comments

0.15.1

January 23, 2020

Only load expr lib functions if expr lib exists

Fix assert when parsing 

Fix FileExistsQ::fstr that can happen with earlier versions

1.0

March 24, 2020

Properly abstract unquoted strings

Rename ParseLeaf -> ConcreteParseLeaf

Rename AST -> CodeParser

Add support for System`Private`NewContextPath / System`Private`RestoreContextPath

Do better job with ToStandardFormBoxes handling multiple inputs separated by newlines

Simplify newline handling so that newline tokens are contiguous and half-open, just like all other tokens

Enable multi-line mode for matching chunks.

Add SourceConvention option

Generate Parselet registrations at build-time

Convert various std::maps to sorted std::arrays

Add EncodingIssues

Template-ize UnderParselet

Make all parsing functions listable

Insert "Definition" metadata for functions

Allow + +a to parse as +a, the same as kernel

Change from old syntax CodeParse[str, h] to new Syntax CodeParse[str, ContainerNode -> h]

Call setupLibraries[] lazily

Transition to purely parselet-driven parsing

Remove implicit Times logic from parser, and properly handle inside parselets.

No longer need to pay the cost of checking implicitTimes boolean for every parse.

Fixes

Fix crash found by afl

Cleanup

Combine handling of Inequality and VectorInequality

Combine handling of Infix and Inequality

Cleanup several issues from fuzz testing

1.1

September 30, 2020

API changes

Return a Rational when parsing something like 1*^-2
This returned an Integer before.

Add Token`PercentPercent as a variable length token

Add CompoundNode

Add CodeSyntaxQ, CodeSyntaxCSTQ, CodeStructuralSyntaxQ

Add Token`Error`UnterminatedFileString

Add Token`Boxes`CommentContent

Add UnterminatedTokenErrorNeedsReparseNode

Add Token`Error`UnexpectedCloser

Add GroupMissingOpenerNode (only used in boxes)

Add CallMissingCloserNode

Add QuaternaryNode and FragmentNode

Add new nodes and tags: MemoizedSetDelayed, MemoizedTagSetDelayed, UnterminatedGroupNode, UnterminatedCallNode

Add "TabWidth" option to functions and correctly calculate columns using tab stops

Graduate SourceConvention option to be a symbol

Add -check flag to command-line parser

Remove OptionalDefaultNode and just use Token`UnderDot directly.

Remove Token`LineContinuation. The actual line continuation string is now always attached to the previous or next token.
Currently, it's always the next token, but that is an implementation detail.

Remove the 3 fake line continuation code points. These are no longer used.

Remove UnexpectedLineContinuation FormatIssue. This is no longer used.

NOTE: line continuations are NOT currently handled in places where SourceCharacters are used directly: comments and file stringification. This is not relevant for comments, but it is wrong to not handle line continuations during file stringification.

Propagate EncodingIssue in a few places

Treat EqualDot as a proper binary operator

Add new |-> operator for Function.

Performance improvements

Improve stack shim performance

Use results from some profiling to speed up tokenizing numbers.

Dynamically set GroupMissingCloserNeedsReparseNode and UnterminatedTokenErrorNeedsReparseNode while parsing, instead of doing Replace.

Use Normal[ReadByteArray[#]]& instead of Import[#, "Byte"]

Remove asserts from aggregate and try to not use local variables when possible

Error handling

Add better handling for syntax with .. and ... where backtracking is needed
Examples are 2^^.. and 123*^2..

Continue parsing numbers with invalid bases or unrecognized digits (because of the base)

Miscellaneous

A massive number of minor bug fixes, typos fixes, and reorganizations, tweaks and cleanup

1.1.1

December 8, 2020

Included in Mathematica 12.2

Fixes

The parser side of what needs to be fixed for 398836: implicit Times tokens do need to be present in the box cst.

I had been remiss in treating unhandled boxes as implicit Times.

So start doing that now. But do the easy thing and simply riffle in ImplicitTimes tokens.

This is not strictly correct, but the analysis needed to properly insert ImplicitTokens tokens is too complicated in the current location.

Note: Mixing explicit Times operators and implicit Times does not currently work

Fix LinearSyntaxBlob in wrong context

1.2

March 25, 2021

Provide more ASCII approximations for long names

Handle linear syntax as tokens, but they are unsupported

A workflow is removing the \( \) from a LinearSyntaxBlob and then tokenizing.

>> and >>> should not give top-level lints

Use the source of just the ; when reporting top-level CompoundExpression

Add CodeStructuralSyntaxAggQ

Move stray comma handling back to concrete parser and teach -check flag to fail with stray commas

For unexpected characters, display both the actual encoded representation and the escaped form.

This will help with diagnosing problems with invisible characters

Add a new Kernel file RowBox.wl that is generated at build-time.

Hard-code more values for parseBox

In the fall-through cases of parseBox, be more efficient with testing different string cases

Remove StringifyMode 3 (passthrough). Just hard-code these cases

Allow "FileFormat" -> "Script" for ignoring shebang in .wls files

Provide better error reporting for \Alpha]

Add ScopingData[ast] function for gathering information about scoping constructs

Introduce DefinitionSymbols function and start using it

This is a saner function than the older DeclarationName

Update ASCII replacements

Provide some selectors for CodeAction objects

Move ToInputFormString

Compile` also has a ToInputFormString symbol, so to prevent shadowing messages,
move CodeParser`ToInputFormString to CodeParser`ToString`ToInputFormString
and rely on CodeParser`ToString` not being on $ContextPath (unless explicitly loaded, of course)

Insert "FileName" into CSTs

Allow CodeConcreteParse of Cells, Notebooks, CellObjects, and NotebookObjects

Lower confidence of package-related issues that are not 100% errors

Fixes

Fix warning from MSVC: using universal-character-name \u2423 for was technically not correct, since MSVC is setup with a different code page by default

Fix handling #"foo"

Fix bug 404282, and related issues

CodeConcreteParse["\"\r"] would give messages because the \r newline was not being handled correctly

Fix that, and also fix issues with \r\n newlines where the indices from the SourceCharacterIndex were wrong because of treating \r\n as a single source character

Handle TernaryTilde at top-level

Add UpSet and UpSetDelayed to top-level definitions

Syntax such as Attributes[foo] = {HoldAll} is a definition for foo and not for Attributes

.wl files may be scripts with #! or not, so must handle

In the doc page:
tutorial/FilesStreamsAndExternalOperations

there is a paragraph that starts:

"Scripts may be stored either in normal .wl package files or in dedicated .wls script files."

Fix 406933: specific comment makes CodeParser give messages and go into infinite loop

1.3

August 30, 2021

Notes on compatibility have been added to docs/compatibility.md

Introduce BatchMode option for Abstract
BatchMode -> True where Begin[] and End[] nodes will be at top-level (e.g., .wl files)
PackageNodes[] and ContextNodes[] WILL be created
Issues about unbalanced directives WILL be created
BatchMode -> False otherwise, i.e., where Begin[] and End[] nodes are separate or not easily scanned together (e.g., cells in notebooks)
PackageNodes[] and ContextNodes[] will NOT be created
Issues about unbalanced directives will NOT be created

Fixes

Fix 409304: Teach parser about TernaryOptionalPattern
TernaryOptionalPattern comes from boxes

Multi-pronged fix for 409210:
\ [Prime] is not strange
Add SubscriptBox and SuperscriptBox as not strange for head of calls

Fix 409216: Teach abstract and scoping about GridBox

Fix 409472: The abstracting code assumed that the Source was always LineColumn convention, and this was breaking for boxes
So introduce some functions to handle the different source conventions

410337: Fix DefinitionSymbols for errors

410404: Fix parsing \ [Integral] a + 2
Trivia2 was not scoped properly and was not being cleaned up at the correct time

Fix 405322: Symbols occurring in options to DynamicModule have same scope as body