CodeParser Release Notes
0.7
Separate Concrete Syntax Trees and Abstract Syntax Trees.
Moved unused generated files out of paclet layout.
Introduced expression depth and expression breadth warnings.
Added wl-ast as a Paclet Resource.
Added notes in README.md about Antivirus problems on Windows.
Added TernaryOperatorsToSymbol
Renamed LinearSyntaxBangPrefixLinearSyntaxBang
Added PrefixInvisiblePrefixScriptBase
Added PostfixInvisiblePostfixScriptBase
Renamed InfixImplicitPlusBinaryImplicitPlus
Added BinaryInvisibleTimes
brenton2maclap:MacOSX-x86-64 brenton$ ./wl-ast -format characters
>>> 1+1
{
WLCharacter[49, "1", <|Source->{{1, 1}, {1, 1}}|>],
WLCharacter[43, "+", <|Source->{{1, 2}, {1, 2}}|>],
WLCharacter[49, "1", <|Source->{{1, 3}, {1, 3}}|>],
WLCharacter[-1, "", <|Source->{{2, 0}, {2, 0}}|>],
Nothing
}
brenton2maclap:MacOSX-x86-64 brenton$
Added logic to use PacletResources to find wl-ast executable.
Added ConcreteParse functions and start separating out concrete and abstract parse trees.
1.2.3
0.8
Correct concrete and abstract ‘ (Derivative) parsing
Abstract parsing errors for a b c and a b c
Add a stop-gap for parsing large expressions containing + and -
The concrete syntax tree was originally treating exprs of + and exprs of - as separate infix nodes. And then when abstracted, they would be combined. However, this is a problem for expressions that heavily alternate between + and -, e.g., a + b - c + d - e + f - g ... This would create a deeply nested expression and then the internal call to ToExpression would fail, even though the kernel can parse the originaly expression. The limitation of ToExpression is understood. Introduce a stop-gap that treats a + b - c + d - e + f - g ... as a single infix node, with InternalMinusNode for minus nodes. Eventually when we move to something like LibraryLink / WSTP, then we can move back to separate parse trees for + and -.
Use FindFile to help with things like ParseFile["Foo`"]
Fix more cases for DeclarationName
Add support for all set relation operators
Fix certain infix longname operators being parsed as binary operators
Better error handling for ToInputFormString and ToFullFormString
Make sure to delete all new'd memory in wl-ast
Some work on documentation notebooks.
Add Boxes.wl, providing a rudimentary CSTToBoxes function
Make Divide binary instead of infix
Enforce NonASCII restriction on strings and files. This comes from a restriction of RunProcess and will be addressed in a future update.
Preserve the difference between characters providing as byte encoded, and characters provided as \ encoded
For example, the byte 0x0a and the bytes 0x5c 0x6e both encode the newline character. And there are times where the distinction is important and must be preserved. \ syntax for the newline character cannot be used outside of strings.
Also it is just a bit nicer to preserve the encoding that was provided. One caveat is that \ encoding does canonicalize to one form. For example, \n, \ :000a \ .0a and \012 all encode the newline character. But it is canonicalized to \n. The general pattern for canonicalization is to first prefer the short \x form if possible (e.g., \n \r \t), and then prefer long names \[Name] syntax, and then prefer \:xxxx syntax.
0.9
Remove explicit lists of letterlike long names, and add explicit list of uninterpretable long names.
All valid letterlike characters should work now, and any \:xxxx characters are flagged as strange.
Fill out the other lists of characters.
All characters should now be categorized (letterlike, operator, space, newline, comma, uninterpretable)
Add >>> operator.
Properly stringify args after ::, <<, >>, and >>>.
Stop using an executable.
Build a shared library and use LibraryLink and MathLink.
Support building with earlier versions of Mathematica. Building with version 11.0 is supported.
Separate SyntaxIssues and AbstractSyntaxIssues.
Allow File[] wrappers to work.
0.10
Put Source information on link as a packed array.
Properly abstract strings in operators that do their own stringification (::, >>, >>>, <<, #).
Handle parsing #”123” and #a`b
Sandbox mode is now respected in the library.
Support abstracting VectorLess, VectorGreater, etc.
Re-add remarks about invalid UTF-8 sequences.
Add a remark about stray \r characters.
Support abstracting BeginPackage[]/EndPackage[], Begin[]/End[], and BeginStaticAnalysisIgnore[]/EndStaticAnalysisIgnore.
0.11
Include operator tokens in concrete syntax trees.
Include comments as a separate list returned by concreteParse functions.
Include warnings about using unsupported and undocumented characters.
Include warnings about strange top-level expressions.
0.12
Consolidate various atom Nodes into single LeafNode
Fixed handling of runs of multiple ;;
Comments, whitespace, and newlines are now returned in concrete syntax.
Introduce new concept of aggregate syntax which is concrete syntax, but with comments, whitespace, and newlines removed.
Aborts are handled more gracefully.
0.13
Add ConcreteParseBox, ToStrandardFormBoxes, and ToSourceCharacterString functions.
Switch to sending MakeLeafNode calls over MathLink, for performance.
Add some missing operators, \ [Colon], \ [CupCap], etc.
Other performance improvements.
Add Did You Mean for /@ for / at top-level
Move to using a struct with bitfields for WLCharacter.
Read entire file into buffer and store in SourceManager
Create single nodes for Inequalities and VectorInequalities
Preserve line continuations in concrete syntax
Introduce a token for =. which is needed for box support
The characters \" and \\ are mapped into special codes
Warn about line continuations inside comments
Rename implicit tokens ImplicitNull, ImplicitOne, ImplicitAll
Add InfixOperatorWithTrailingParselet, for commas and semis
Only report UnlikelyEscapeSequence if not a valid character
Report strange space characters
Report strange newline characters
0.14
Switch to using unique_ptr implementation.
Simplify handling line continuation of just \r
No need to issue warning for errors being strange
Simplify handling \ at end of file
Return TOKEN_ERROR_EMPTYSTRING at EOF when appropriate
Prevent SourceManager from advancing past EOF
Do not count SyntaxErrorNodes as being strange at top-level
Do the favor of combining naked \ with next character for better error reporting
Call back to kernel for LongName suggestions
Add MultiBoxNode, for handling multiple inputs in an Input cell
Make sure that ParseLeafs remembers any SyntaxIssues
Support different SourceStyles
Space or Newline chars that are directly encoded are not strange
Remove string methods from SourceCharacter and WLCharacter, and provide iterators instead
Differentiate between line continuations with different newlines
Optimizing: Collect flag fields into single bitset fields
Optimizing: eliminate dynamic_cast
Develop system for using RAII to automatically handle queueing unhandled whitespace
Distinguish between SyntaxIssues and FormatIssues
Allow :: to work when parsing boxes
Allow ? to work when parsing boxes
Convert all syntax issues in C++ code to use CondeActions
Add LeafSeqNode and NodeSeqNode classes in order to alleviate the need to constantly iterate through vectors
Employ some strategies to reduce copying Tokens so much
Letterlike characters can be strange, or very strange (with higher confidence that it is a problem)
Better error handling for f[1\[Alpa]2]
Added Intra[] construct for specifying positions within tokens
Handle \[Alpa] being parsed as boxes
Start work to make TokenEnum contain other useful bits
Make sure that SyntaxErrorNodes always have children, and are not just a leaf node
Convert Token errors into appropriate SyntaxErrorNodes when abstracting
Allow e.g., { + } to be parsed correctly as boxes
Revamped the error handling in the parser, so that unexpected closing brackets and unexpected operators do not eat any unnecessary whitespace. Lot of work and kind of ugly. But maybe error handling has to be ugly
Use LongName for making characters graphical, if available
Add check for strange characters
Allow boxes with << to be parsed properly
Drop ImplicitNull when converting back to boxes
Add checks for strange Unicode characters
When parsing a leaf, treat multiple tokens of whitespace as a single token of whitespace, similar to how FE works
Remove special handling of NonAssociativity with DirectedEdge and UndirectedEdge. This was bug 206938 and is now fixed.
Introduce AbstractFormatIssues, to allow warning about unneeded line continuations
Have ScopedIFS manage the data buffer from a file, and pass it to SourceManager
Treat \r\n as a single Newline token
Fixes
Fix text mode error found on Windows
Fix bug where \\ in a string, at end of line, on Windows, gave an assert
Fix OptionalDefaultPatternNode
Fix when UnhandledDot can happen
Fix parsing TagSetDelayed and TagUnset
Fix when ReplaceNode CodeAction is the entire expression (ReplacePart does not work)
Fix reporting of EOF in escape sequences
Cleanup
Move NonAssociative error handling to Abstract.wl
Convert tokenQueue to a deque, since there are so many insertions in the front
Remove append from Parser, and only have prepend
Move library-related stuff to Library.wl
Organize Token and LongName files
Remove unneeded use of unique_ptr
0.15
Update Source of nodes to be half-open. This is a change from earlier versions, where the Source was always inclusive.
For example, here is the old Source for the integer 123:
In[2]:= ConcreteParseString[“123”]
Out[2]= ContainerNode[String, {LeafNode[Integer,
“123”, <|Source -> {{1, 1}, {1, 3}}|>]}, <||>]
And now here is the new Source:
In[3]:= ConcreteParseString[“123”]
Out[3]= ContainerNode[String, {LeafNode[Integer,
“123”, <|Source -> {{1, 1}, {1, 4}}|>]}, <||>]
This change has a number of nice qualities. It is now easy to determine the length of the token by subtracting the start from the end, and 0-length tokens can now be represented accurately.
Standardize on using ContainerNode as the outer-most node
Add InsertNodeAfter CodeAction command
Add InsertTextAfter CodeAction command
FormatIssues now explicitly supply their CodeActions
Do not complain about unexpected line continuations in comments
Add more operators:
DoubleRightTee
DoubleLeftTee
UpTee
DownTee
RoundImplies
Perpendicular
etc.
Disable treating BMP PUA as strange for now
REPLACEMENT CHARACTER 0xfffd is strange, this will allow flagging of bad UTF8 in the linter
Introduce \r\n as a single SourceCharacter. This greatly simplifies newline handling.
Standardize on using Whitespace as a token
Add Listable version of Tokenize
InlinePart longname is unsupported
Fixes
Bring in several fixes found from fuzz testing
When parsing a - b + c, make sure to give the abstracted Times expression the correct Source.
Fix implicit Times in boxes by giving it the same Source as the RHS
Fix parsing single-digit precision
Fix line continuations in # and % tokens
Abstract HermitianConjugate into ConjugateTranspose
Handle Unicode non-characters and BOM
Fix precedence problem of ++a++ and --a--
Cleanup
Remove unused MissingOpener parts
Do not treat - as separate binary operator from +. Combine - parsing with +.
Do not check for strange characters in comments
0.15.1
Only load expr lib functions if expr lib exists
Fix FileExistsQ::fstr that can happen with earlier versions
1.0
Properly abstract unquoted strings
Rename ParseLeaf -> ConcreteParseLeaf
Add support for System`Private`NewContextPath / System`Private`RestoreContextPath
Do better job with ToStandardFormBoxes handling multiple inputs separated by newlines
Simplify newline handling so that newline tokens are contiguous and half-open, just like all other tokens
Enable multi-line mode for matching chunks.
Generate Parselet registrations at build-time
Convert various std::maps to sorted std::arrays
Make all parsing functions listable
Insert "Definition" metadata for functions
Allow + +a to parse as +a, the same as kernel
Change from old syntax CodeParse[str, h] to new Syntax CodeParse[str, ContainerNode -> h]
Transition to purely parselet-driven parsing
Remove implicit Times logic from parser, and properly handle inside parselets.
No longer need to pay the cost of checking implicitTimes boolean for every parse.
Fixes
Cleanup
Combine handling of Inequality and VectorInequality
Combine handling of Infix and Inequality
Cleanup several issues from fuzz testing
1.1
September 30, 2020
API changes
Return a Rational when parsing something like 1*^-2
This returned an Integer before.
Add Token`PercentPercent as a variable length token
Add CompoundNode
Add CodeSyntaxQ, CodeSyntaxCSTQ, CodeStructuralSyntaxQ
Add Token`Error`UnterminatedFileString
Add Token`Boxes`CommentContent
Add UnterminatedTokenErrorNeedsReparseNode
Add Token`Error`UnexpectedCloser
Add GroupMissingOpenerNode (only used in boxes)
Add CallMissingCloserNode
Add QuaternaryNode and FragmentNode
Add new nodes and tags: MemoizedSetDelayed, MemoizedTagSetDelayed, UnterminatedGroupNode, UnterminatedCallNode
Add "TabWidth" option to functions and correctly calculate columns using tab stops
Graduate SourceConvention option to be a symbol
Add -check flag to command-line parser
Remove OptionalDefaultNode and just use Token`UnderDot directly.
Remove Token`LineContinuation. The actual line continuation string is now always attached to the previous or next token.
Currently, it's always the next token, but that is an implementation detail.
Remove the 3 fake line continuation code points. These are no longer used.
Remove UnexpectedLineContinuation FormatIssue. This is no longer used.
NOTE: line continuations are NOT currently handled in places where SourceCharacters are used directly: comments and file stringification. This is not relevant for comments, but it is wrong to not handle line continuations during file stringification.
Propagate EncodingIssue in a few places
Treat EqualDot as a proper binary operator
Add new |-> operator for Function.
Performance improvements
Improve stack shim performance
Use results from some profiling to speed up tokenizing numbers.
Dynamically set GroupMissingCloserNeedsReparseNode and UnterminatedTokenErrorNeedsReparseNode while parsing, instead of doing Replace.
Use Normal[ReadByteArray[#]]& instead of Import[#, "Byte"]
Remove asserts from aggregate and try to not use local variables when possible
Error handling
Add better handling for syntax with .. and ... where backtracking is needed
Examples are 2^^.. and 123*^2..
Continue parsing numbers with invalid bases or unrecognized digits (because of the base)
Miscellaneous
A massive number of minor bug fixes, typos fixes, and reorganizations, tweaks and cleanup
1.1.1
December 8, 2020
Fixes
The parser side of what needs to be fixed for 398836: implicit Times tokens do need to be present in the box cst.
I had been remiss in treating unhandled boxes as implicit Times.
So start doing that now. But do the easy thing and simply riffle in ImplicitTimes tokens.
This is not strictly correct, but the analysis needed to properly insert ImplicitTokens tokens is too complicated in the current location.
Note: Mixing explicit Times operators and implicit Times does not currently work
Fix LinearSyntaxBlob in wrong context
1.2
Provide more ASCII approximations for long names
Handle linear syntax as tokens, but they are unsupported
A workflow is removing the \( \) from a LinearSyntaxBlob and then tokenizing.
>> and >>> should not give top-level lints
Use the source of just the ; when reporting top-level CompoundExpression
Move stray comma handling back to concrete parser and teach -check flag to fail with stray commas
For unexpected characters, display both the actual encoded representation and the escaped form.
This will help with diagnosing problems with invisible characters
Add a new Kernel file RowBox.wl that is generated at build-time.
Hard-code more values for parseBox
In the fall-through cases of parseBox, be more efficient with testing different string cases
Remove StringifyMode 3 (passthrough). Just hard-code these cases
Allow "FileFormat" -> "Script" for ignoring shebang in .wls files
Provide better error reporting for \Alpha]
Add ScopingData[ast] function for gathering information about scoping constructs
Introduce DefinitionSymbols function and start using it
This is a saner function than the older DeclarationName
Provide some selectors for CodeAction objects
Move ToInputFormString
Compile` also has a ToInputFormString symbol, so to prevent shadowing messages,
move CodeParser`ToInputFormString to CodeParser`ToString`ToInputFormString
and rely on CodeParser`ToString` not being on $ContextPath (unless explicitly loaded, of course)
Allow CodeConcreteParse of Cells, Notebooks, CellObjects, and NotebookObjects
Lower confidence of package-related issues that are not 100% errors
Fixes
Fix warning from MSVC: using universal-character-name \u2423 for ␣ was technically not correct, since MSVC is setup with a different code page by default
Fix bug 404282, and related issues
CodeConcreteParse["\"\r"] would give messages because the \r newline was not being handled correctly
Fix that, and also fix issues with \r\n newlines where the indices from the SourceCharacterIndex were wrong because of treating \r\n as a single source character
Handle TernaryTilde at top-level
Add UpSet and UpSetDelayed to top-level definitions
Syntax such as Attributes[foo] = {HoldAll} is a definition for foo and not for Attributes
.wl files may be scripts with #! or not, so must handle
In the doc page:
tutorial/FilesStreamsAndExternalOperations
there is a paragraph that starts:
"Scripts may be stored either in normal .wl package files or in dedicated .wls script files."
Fix 406933: specific comment makes CodeParser give messages and go into infinite loop
1.3
Notes on compatibility have been added to docs/compatibility.md
Introduce BatchMode option for Abstract
BatchMode -> True where Begin[] and End[] nodes will be at top-level (e.g., .wl files)
PackageNodes[] and ContextNodes[] WILL be created
Issues about unbalanced directives WILL be created
BatchMode -> False otherwise, i.e., where Begin[] and End[] nodes are separate or not easily scanned together (e.g., cells in notebooks)
PackageNodes[] and ContextNodes[] will NOT be created
Issues about unbalanced directives will NOT be created
Fixes
Fix 409304: Teach parser about TernaryOptionalPattern
TernaryOptionalPattern comes from boxes
Multi-pronged fix for 409210:
\ [Prime] is not strange
Add SubscriptBox and SuperscriptBox as not strange for head of calls
Fix 409216: Teach abstract and scoping about GridBox
Fix 409472: The abstracting code assumed that the Source was always LineColumn convention, and this was breaking for boxes
So introduce some functions to handle the different source conventions
410337: Fix DefinitionSymbols for errors
410404: Fix parsing \ [Integral] a + 2
Trivia2 was not scoped properly and was not being cleaned up at the correct time
Fix 405322: Symbols occurring in options to DynamicModule have same scope as body