-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
8871aa9
to
12b9a65
Compare
12b9a65
to
1ea2cc4
Compare
Isn't that the {
"language": "Yul",
"sources":
{
"C":
{
"content": "{ let x := mload(0) sstore(add(x, 0), 0) }"
}
},
"settings":
{
"outputSelection":
{
"*": { "*": ["evm.assembly"], "": [ "*" ] }
}
}
} |
The equivalent that is showing the |
1ea2cc4
to
c20a453
Compare
evmasm::AssemblyItem addSubroutine(evmasm::AssemblyPointer const& _assembly) { return m_asm->appendSubroutine(_assembly); } | ||
/// Pushes the size of the subroutine. | ||
void pushSubroutineSize(size_t _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); } | ||
void pushSubroutineSize(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); } | ||
/// Pushes the offset of the subroutine. | ||
void pushSubroutineOffset(size_t _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); } | ||
void pushSubroutineOffset(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's the first time I see assemblies referred to as subroutines. It really stretches the definition. From Wikipedia:
In computer programming, a function (also procedure, method, subroutine, routine, or subprogram) is a callable unit[1] of software logic that has a well-defined interface and behavior and can be invoked multiple times.
It's going to become very confusing with EOF, where we have both functions (code sections) and assemblies (containers). I'd by default interpret "subroutine" to mean to former (there was even a competing "simple subroutines" EIP). We should really rename this at some point.
@@ -88,6 +88,6 @@ class EthAssemblyAdapter: public AbstractAssembly | |||
|
|||
evmasm::Assembly& m_assembly; | |||
std::map<SubID, u256> m_dataHashBySubId; | |||
size_t m_nextDataCounter = std::numeric_limits<size_t>::max() / 2; | |||
SubID::value_type m_nextDataCounter = std::numeric_limits<SubID::value_type>::max() / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I thought I understood what's happening and then I ran into this. WTF are we doing? It just makes my head hurt. If I understand correctly:
Assembly
keeps track of chunks of data and subassemblies separately.- Assemblies are stored in
m_subs
vector. - Data is stored by hash in
m_data
.- Except for metadata, which is singled out and stored in
m_auxiliaryData
.
- Except for metadata, which is singled out and stored in
- Assemblies are stored in
EthAssemblyAdapter
holds the assembly and independently keeps track of all the appended data chunks inm_dataHashBySubId
.- It gives them fake IDs in the upper half of the subassembly ID space
- When a Yul object is being compiled,
EVMObjectCompiler
adds subassemblies and data chunks via the adapter, receiving both real and fake IDs.- It stores the IDs in
BuiltinContext
, each one associated with a Yul node by name. - This is done only for children of the object. Anything nested deeper does not get an ID.
- It stores the IDs in
BuiltinContext
is passed to Yul code transform and later to builtins defined inEVMDialect
.- Builtins like
datasize
then useBuiltinContext
to map Yul node names to sub IDs.- If the name is present in
BuiltinContext
, they take the ID stored there. - If not, they assume the name is a dotted path and convert it to a sequence of IDs.
- In either case the sequence is passed into
EthAssemblyAdapter::appendDataSize()
.- It assumes that if the first ID in the sequence is present in
m_dataHashBySubId
, it must be a fake ID assigned to a data chunk. - Otherwise it converts the path to a sub ID using
Assembly::encodeSubPath()
and assumes it must be a subassembly.- There seems to be an assumption that Yul
Object
tree structure matches theAssembly
tree, which I'm not sure is true. For example we sometimes we move things (e.g. long Yul strings) to data chunks and the transition happens in evmasm optimizer, which means that they do not have correspondingdata
nodes at Yul level.
- There seems to be an assumption that Yul
- I actually don't understand how paths to nested data are handled. I don't think they're are added to
BuiltinContext
, but then how do they not end up treated as assemblies byappendDataSize()
?
- It assumes that if the first ID in the sequence is present in
- If the name is present in
- Builtins like
Overall, looks like we have several kinds of sub IDs sharing the same ID space, without any checks for conflicts (just assuming the space is large enough that they won't happen):
- Normal IDs assigned to subassemblies, starting at
0
.- EOF
ContainerID
s are a subset of this and are limited touint8_t
.
- EOF
- Empty ID at
max()
. - Fake data IDs starting at
max() / 2
. - "Negative" IDs representing deeply nested paths, starting at
max()
and going down.
I think we should assign a non-overlapping region to each of them and have asserts enforcing that.
The way we assign IDs should also be more prominently documented. For example m_subPaths
does not even say that the "negative" IDs mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up?
Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up? Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.
Yes, it is here: https://github.com/ethereum/solidity/wiki/Assembly-Indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a comment to m_subPaths
that the subassembly IDs referenced there occupy the negative index space. I think a more thorough documentation of things should occur once we have defined non-overlapping regions. Then it's also straightforward where to put the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r0qs I see that the doc does not mention the fake data IDs. Not sure if they can ever show up in assembly output, but even if they can't, they still affect it, because we can never safely allocate other kinds of IDs in the region they take up. So they effectively limit the range of those other kinds IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that was not the initial intention of such documentation. However, I can extend it to cover your suggestion of defining non-overlapping regions for IDs.
@@ -47,9 +48,9 @@ using AssemblyPointer = std::shared_ptr<Assembly>; | |||
|
|||
class Assembly | |||
{ | |||
using TagRefs = std::map<size_t, std::pair<size_t, size_t>>; | |||
using TagRefs = std::map<size_t, std::pair<SubAssemblyID, size_t>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use size_t
for tag ID but that should really be uint64_t
as well. Would be nice to change that too at some point.
The tag+sub pair would also be better off as a proper struct with methods for converting to/from u256
. We wouldn't then have to hard-code all those masks and 64s all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it would also make it easier to reason about the whole program and data flow.
6d3ec74
to
7881236
Compare
7881236
to
8e524b6
Compare
cf85d6d
to
3cb8070
Compare
3cb8070
to
bc503b2
Compare
0f05e3c
to
9115fbc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main things needed here are to replace validations with asserts and adjust tests a bit. Other than that mostly minor stuff.
This should be enough for the fix, but it needs to be followed by a PR that properly splits the index space into regions (#15955 (comment)) and documents it. And introduce a type for the tag ID (#15955 (comment)).
libevmasm/Assembly.cpp
Outdated
@@ -798,20 +799,20 @@ std::map<u256, u256> const& Assembly::optimiseInternal( | |||
|
|||
// Run optimisation for sub-assemblies. | |||
// TODO: verify and double-check this for EOF. | |||
for (size_t subId = 0; subId < m_subs.size(); ++subId) | |||
for (SubAssemblyID subID {0}; subID.value < m_subs.size(); ++subID.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (SubAssemblyID subID {0}; subID.value < m_subs.size(); ++subID.value) | |
for (SubAssemblyID subID{0}; subID.value < m_subs.size(); ++subID.value) |
Also in AssemblyItem::splitForeignPushTag()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really wish we could just settle on one style that has clang format support (or find a close enough approximation of the current style) and be done with it :) the amount of time and energy things like this eat up is unreasonable
libevmasm/JumpdestRemover.cpp
Outdated
@@ -45,7 +45,7 @@ bool JumpdestRemover::optimise(std::set<size_t> const& _tagsReferencedFromOutsid | |||
if (_item.type() != Tag) | |||
return false; | |||
auto asmIdAndTag = _item.splitForeignPushTag(); | |||
assertThrow(asmIdAndTag.first == std::numeric_limits<size_t>::max(), OptimizerException, "Sub-assembly tag used as label."); | |||
solRequire(asmIdAndTag.first.empty(), OptimizerException, "Sub-assembly tag used as label."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this one as well is just a general assumption rather than something that we expect to fail at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tags were actually always confusing to me, because I could not see in what situation we'd ever jump into another subassembly, so I dug into this now to find out. Turns out we never do. This mechanism seems to exist only to support internal function pointers in legacy codegen. Since we support storage variables of such pointer types and storage can be accessed both at runtime and at creation time, we actually combine two pointers in the slot: to a copy of the function that exists in the creation assembly and to one in the runtime assembly. This not new to me, but I only now realized that to support this we have to be able to refer to tags from the runtime assembly inside the creation assembly. I think that's the only situation where they'll appear.
Here's what it looks like in practice:
contract C {
function() fptr = f;
function f() internal {}
}
EVM assembly:
mstore(0x40, 0x80)
or(tag_0_2, shl(0x20, tag_1))
0x00
...
sstore
...
tag_1:
jump // out
...
stop
sub_0: assembly {
...
tag_2:
jump // out
...
}
Note that there's no tag named tag_0_2
. It refers to tag_2
in sub_0
. I don't think we ever insert such tags into the assembly item stream (and that's what the assertThrow()
is guarding against). We only push them on the stack using PushTag
, but this does not trigger the assert.
Such tags could, of course, appear in assembly import - we should have a validation against them there - but here it should be only an assert.
@@ -61,20 +61,21 @@ AssemblyItem AssemblyItem::toSubAssemblyTag(size_t _subId) const | |||
return r; | |||
} | |||
|
|||
std::pair<size_t, size_t> AssemblyItem::splitForeignPushTag() const | |||
std::pair<SubAssemblyID, size_t> AssemblyItem::splitForeignPushTag() const | |||
{ | |||
solAssert(m_type == PushTag || m_type == Tag || m_type == RelativeJump || m_type == ConditionalRelativeJump); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my investigation, I think we could limit it to just PushTag
:
solAssert(m_type == PushTag || m_type == Tag || m_type == RelativeJump || m_type == ConditionalRelativeJump); | |
solAssert(m_type == PushTag); |
And similarly in setPushTagSubIdAndTag()
and toSubAssemblyTag()
.
Though maybe it's better done in a separate PR. There could be other places like this and it would be good to also have assembly import validations against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for investigating this!
@@ -292,10 +292,10 @@ std::vector<std::optional<BuiltinFunctionForEVM>> createBuiltins(langutil::EVMVe | |||
_assembly.appendAssemblySize(); | |||
else | |||
{ | |||
std::vector<size_t> subIdPath = | |||
std::vector<AbstractAssembly::SubID> subIdPath = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just import SubassemblyID.h
and use the type directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way because they are used as argument to an method in AbstractAssembly
- which expects AbstractAssembly::SubID
. We could of course get rid of that typedef entirely.
}))); | ||
st.ok(output); | ||
|
||
function containsTarget(obj, target) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function containsTarget(obj, target) { | |
function containsAssemblyItem(assemblyJSON, item) { |
Also, I think it would be best to limit the search to the legacyAssembly
subobject. It seems that now you're searching the whole object and there could be random stuff there. It would also make this check easier to understand - it took me a moment to understand what's happening here, because this is not at all what I expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also as a general remark for the future - we should at some point create a CI job that runs the emscripten binary with those tests that provide Standard JSON input/output. It would make this whole JS mess unnecessary...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think it would be best to limit the search to the
legacyAssembly
subobject. It seems that now you're searching the whole object and there could be random stuff there. It would also make this check easier to understand - it took me a moment to understand what's happening here, because this is not at all what I expected.
Fair enough, what did you expect then? In my mind it was fairly straightforward to search the result json for some target object and it also seemed reasonably unlikely to me to find the target in our specific case anywhere else in the output. It is not big deal in any case, I am changing it to only consider the legacyAssembly
subobject.
Also as a general remark for the future - we should at some point create a CI job that runs the emscripten binary with those tests that provide Standard JSON input/output. It would make this whole JS mess unnecessary...
Yes please!!
Changelog.md
Outdated
@@ -17,6 +17,7 @@ Compiler Features: | |||
|
|||
|
|||
Bugfixes: | |||
* Commandline Interface: Fix possible inconsistency in subassembly IDs between target architectures in `--asm-json` output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not just the interface:
* Commandline Interface: Fix possible inconsistency in subassembly IDs between target architectures in `--asm-json` output. | |
* Assembler: Fix different IDs being assigned to subassemblies nested more than one level away, resulting in inconsistent `--asm-json` output between target architectures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this needs to be moved to 0.8.31 now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not just the interface:
Well yeah, although assuming the ID space isn't exhausted, it does behave consistently inside of the assembler. I don't mind either way so let's go with your phrasing. Although I would minimally amend it:
* Assembler: Fix not using a fixed-width type for IDs being assigned to subassemblies nested more than one level away, resulting in inconsistent
--asm-json output between target architectures.
b033262
to
2a41349
Compare
2a41349
to
2cdb82f
Compare
2cdb82f
to
de0a985
Compare
When computing object ids for referencing subassemblies, these ids are currently determined as negative DFS enumeration based on
size_t
:solidity/libevmasm/Assembly.cpp
Line 1825 in a652292
This PR changes this to
uint64_t
, so that assembly text - in particularPUSH #[$]
instructions - is consistent over different type sizes ofsize_t
.Since these indices are stored in a map and later referenced for exporting and importing, as well as (currently)
numeric_limits<size_t>::max()
is used to indicate the root object or an empty state which is used in, eg, various asserts, I found it helpful to wrap theuint64_t
into a struct so that these places become apparent and dealing with the sizes is explicit. This caused a whole avalanche of changes, as such object ids (or more generally:SubAssemblyID
s) are used in many places. In particular, alsoyul::Object::subId
s are subject to the type.We were lacking a cmdline test that has non-zero
PUSH #[$]
es, so I added one. Since emscripten builds do not execute them, we have no direct regression test by this but I have added one to thesolc-js
tests. While doing that I realized that it is currently not possible (to me at least :P) to request ASM json output via standard json interface if the language is Yul.Fixes #15953