CodeParser Release Notes
0.7
Separate Concrete Syntax Trees and Abstract Syntax Trees.
Moved unused generated files out of paclet layout.
Introduced expression depth and expression breadth warnings.
Added wl-ast as a Paclet Resource.
Added notes in README.md about Antivirus problems on Windows.
Added TernaryOperatorsToSymbol
Renamed LinearSyntaxBangPrefixLinearSyntaxBang
Added PrefixInvisiblePrefixScriptBase
Added PostfixInvisiblePostfixScriptBase
Renamed InfixImplicitPlusBinaryImplicitPlus
Added BinaryInvisibleTimes
brenton2maclap:MacOSX-x86-64 brenton$ ./wl-ast -format characters
>>> 1+1
{
WLCharacter[49, "1", <|Source->{{1, 1}, {1, 1}}|>],
WLCharacter[43, "+", <|Source->{{1, 2}, {1, 2}}|>],
WLCharacter[49, "1", <|Source->{{1, 3}, {1, 3}}|>],
WLCharacter[-1, "", <|Source->{{2, 0}, {2, 0}}|>],
Nothing
}
brenton2maclap:MacOSX-x86-64 brenton$
Added logic to use PacletResources to find wl-ast executable.
Added ConcreteParse functions and start separating out concrete and abstract parse trees.
1.2.3
0.8
Correct concrete and abstract ‘ (Derivative) parsing
Abstract parsing errors for a b c and a b c
Add a stop-gap for parsing large expressions containing + and -
The concrete syntax tree was originally treating exprs of + and exprs of - as separate infix nodes. And then when abstracted, they would be combined. However, this is a problem for expressions that heavily alternate between + and -, e.g., a + b - c + d - e + f - g ... This would create a deeply nested expression and then the internal call to ToExpression would fail, even though the kernel can parse the originaly expression. The limitation of ToExpression is understood. Introduce a stop-gap that treats a + b - c + d - e + f - g ... as a single infix node, with InternalMinusNode for minus nodes. Eventually when we move to something like LibraryLink / WSTP, then we can move back to separate parse trees for + and -.
Use FindFile to help with things like ParseFile["Foo`"]
Fix more cases for DeclarationName
Add support for all set relation operators
Fix certain infix longname operators being parsed as binary operators
Better error handling for ToInputFormString and ToFullFormString
Make sure to delete all new'd memory in wl-ast
Some work on documentation notebooks.
Add Boxes.wl, providing a rudimentary CSTToBoxes function
Make Divide binary instead of infix
Enforce NonASCII restriction on strings and files. This comes from a restriction of RunProcess and will be addressed in a future update.
Preserve the difference between characters providing as byte encoded, and characters provided as \ encoded
For example, the byte 0x0a and the bytes 0x5c 0x6e both encode the newline character. And there are times where the distinction is important and must be preserved. \ syntax for the newline character cannot be used outside of strings.
Also it is just a bit nicer to preserve the encoding that was provided. One caveat is that \ encoding does canonicalize to one form. For example, \n, \ :000a \ .0a and \012 all encode the newline character. But it is canonicalized to \n. The general pattern for canonicalization is to first prefer the short \x form if possible (e.g., \n \r \t), and then prefer long names \[Name] syntax, and then prefer \:xxxx syntax.
0.9
Remove explicit lists of letterlike long names, and add explicit list of uninterpretable long names.
All valid letterlike characters should work now, and any \:xxxx characters are flagged as strange.
Fill out the other lists of characters.
All characters should now be categorized (letterlike, operator, space, newline, comma, uninterpretable)
Add >>> operator.
Properly stringify args after ::, <<, >>, and >>>.
Stop using an executable.
Build a shared library and use LibraryLink and MathLink.
Support building with earlier versions of Mathematica. Building with version 11.0 is supported.
Separate SyntaxIssues and AbstractSyntaxIssues.
Allow File[] wrappers to work.
0.10
Put Source information on link as a packed array.
Properly abstract strings in operators that do their own stringification (::, >>, >>>, <<, #).
Handle parsing #”123” and #a`b
Sandbox mode is now respected in the library.
Support abstracting VectorLess, VectorGreater, etc.
Re-add remarks about invalid UTF-8 sequences.
Add a remark about stray \r characters.
Support abstracting BeginPackage[]/EndPackage[], Begin[]/End[], and BeginStaticAnalysisIgnore[]/EndStaticAnalysisIgnore.
0.11
Include operator tokens in concrete syntax trees.
Include comments as a separate list returned by concreteParse functions.
Include warnings about using unsupported and undocumented characters.
Include warnings about strange top-level expressions.
0.12
Consolidate various atom Nodes into single LeafNode
Fixed handling of runs of multiple ;;
Comments, whitespace, and newlines are now returned in concrete syntax.
Introduce new concept of aggregate syntax which is concrete syntax, but with comments, whitespace, and newlines removed.
Aborts are handled more gracefully.
0.13
Add ConcreteParseBox, ToStrandardFormBoxes, and ToSourceCharacterString functions.
Switch to sending MakeLeafNode calls over MathLink, for performance.
Add some missing operators, \ [Colon], \ [CupCap], etc.
Other performance improvements.
Add Did You Mean for /@ for / at top-level
Move to using a struct with bitfields for WLCharacter.
Read entire file into buffer and store in SourceManager
Create single nodes for Inequalities and VectorInequalities
Preserve line continuations in concrete syntax
Introduce a token for =. which is needed for box support
The characters \" and \\ are mapped into special codes
Warn about line continuations inside comments
Rename implicit tokens ImplicitNull, ImplicitOne, ImplicitAll
Add InfixOperatorWithTrailingParselet, for commas and semis
Only report UnlikelyEscapeSequence if not a valid character
Report strange space characters
Report strange newline characters
0.14
Switch to using unique_ptr implementation.
Simplify handling line continuation of just \r
No need to issue warning for errors being strange
Simplify handling \ at end of file
Return TOKEN_ERROR_EMPTYSTRING at EOF when appropriate
Prevent SourceManager from advancing past EOF
Do not count SyntaxErrorNodes as being strange at top-level
Do the favor of combining naked \ with next character for better error reporting
Call back to kernel for LongName suggestions
Add MultiBoxNode, for handling multiple inputs in an Input cell
Make sure that ParseLeafs remembers any SyntaxIssues
Support different SourceStyles
Space or Newline chars that are directly encoded are not strange
Remove string methods from SourceCharacter and WLCharacter, and provide iterators instead
Differentiate between line continuations with different newlines
Optimizing: Collect flag fields into single bitset fields
Optimizing: eliminate dynamic_cast
Develop system for using RAII to automatically handle queueing unhandled whitespace
Distinguish between SyntaxIssues and FormatIssues
Allow :: to work when parsing boxes
Allow ? to work when parsing boxes
Convert all syntax issues in C++ code to use CondeActions
Add LeafSeqNode and NodeSeqNode classes in order to alleviate the need to constantly iterate through vectors
Employ some strategies to reduce copying Tokens so much
Letterlike characters can be strange, or very strange (with higher confidence that it is a problem)
Better error handling for f[1\[Alpa]2]
Added Intra[] construct for specifying positions within tokens
Handle \[Alpa] being parsed as boxes
Start work to make TokenEnum contain other useful bits
Make sure that SyntaxErrorNodes always have children, and are not just a leaf node
Convert Token errors into appropriate SyntaxErrorNodes when abstracting
Allow e.g., { + } to be parsed correctly as boxes
Revamped the error handling in the parser, so that unexpected closing brackets and unexpected operators do not eat any unnecessary whitespace. Lot of work and kind of ugly. But maybe error handling has to be ugly
Use LongName for making characters graphical, if available
Add check for strange characters
Allow boxes with << to be parsed properly
Drop ImplicitNull when converting back to boxes
Add checks for strange Unicode characters
When parsing a leaf, treat multiple tokens of whitespace as a single token of whitespace, similar to how FE works
Remove special handling of NonAssociativity with DirectedEdge and UndirectedEdge. This was bug 206938 and is now fixed.
Introduce AbstractFormatIssues, to allow warning about unneeded line continuations
Have ScopedIFS manage the data buffer from a file, and pass it to SourceManager
Treat \r\n as a single Newline token
Fixes
Fix text mode error found on Windows
Fix bug where \\ in a string, at end of line, on Windows, gave an assert
Fix OptionalDefaultPatternNode
Fix when UnhandledDot can happen
Fix parsing TagSetDelayed and TagUnset
Fix when ReplaceNode CodeAction is the entire expression (ReplacePart does not work)
Fix reporting of EOF in escape sequences
Cleanup
Move NonAssociative error handling to Abstract.wl
Convert tokenQueue to a deque, since there are so many insertions in the front
Remove append from Parser, and only have prepend
Move library-related stuff to Library.wl
Organize Token and LongName files
Remove unneeded use of unique_ptr
0.15
Update Source of nodes to be half-open. This is a change from earlier versions, where the Source was always inclusive.
For example, here is the old Source for the integer 123:
In[2]:= ConcreteParseString[“123”]
Out[2]= ContainerNode[String, {LeafNode[Integer,
“123”, <|Source -> {{1, 1}, {1, 3}}|>]}, <||>]
And now here is the new Source:
In[3]:= ConcreteParseString[“123”]
Out[3]= ContainerNode[String, {LeafNode[Integer,
“123”, <|Source -> {{1, 1}, {1, 4}}|>]}, <||>]
This change has a number of nice qualities. It is now easy to determine the length of the token by subtracting the start from the end, and 0-length tokens can now be represented accurately.
Standardize on using ContainerNode as the outer-most node
Add InsertNodeAfter CodeAction command
Add InsertTextAfter CodeAction command
FormatIssues now explicitly supply their CodeActions
Do not complain about unexpected line continuations in comments
Add more operators:
DoubleRightTee
DoubleLeftTee
UpTee
DownTee
RoundImplies
Perpendicular
etc.
Disable treating BMP PUA as strange for now
REPLACEMENT CHARACTER 0xfffd is strange, this will allow flagging of bad UTF8 in the linter
Introduce \r\n as a single SourceCharacter. This greatly simplifies newline handling.
Standardize on using Whitespace as a token
Add Listable version of Tokenize
InlinePart longname is unsupported
Fixes
Bring in several fixes found from fuzz testing
When parsing a - b + c, make sure to give the abstracted Times expression the correct Source.
Fix implicit Times in boxes by giving it the same Source as the RHS
Fix parsing single-digit precision
Fix line continuations in # and % tokens
Abstract HermitianConjugate into ConjugateTranspose
Handle Unicode non-characters and BOM
Fix precedence problem of ++a++ and --a--
Cleanup
Remove unused MissingOpener parts
Do not treat - as separate binary operator from +. Combine - parsing with +.
Do not check for strange characters in comments
0.15.1
Only load expr lib functions if expr lib exists
Fix FileExistsQ::fstr that can happen with earlier versions
1.0
Properly abstract unquoted strings
Rename ParseLeaf -> ConcreteParseLeaf
Add support for System`Private`NewContextPath / System`Private`RestoreContextPath
Do better job with ToStandardFormBoxes handling multiple inputs separated by newlines
Simplify newline handling so that newline tokens are contiguous and half-open, just like all other tokens
Enable multi-line mode for matching chunks.
Generate Parselet registrations at build-time
Convert various std::maps to sorted std::arrays
Make all parsing functions listable
Insert "Definition" metadata for functions
Allow + +a to parse as +a, the same as kernel
Change from old syntax CodeParse[str, h] to new Syntax CodeParse[str, ContainerNode -> h]
Transition to purely parselet-driven parsing
Remove implicit Times logic from parser, and properly handle inside parselets.
No longer need to pay the cost of checking implicitTimes boolean for every parse.
Fixes
Cleanup
Combine handling of Inequality and VectorInequality
Combine handling of Infix and Inequality
Cleanup several issues from fuzz testing