Skip to content

Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

clonker
Copy link
Member

@clonker clonker commented Mar 19, 2025

When computing object ids for referencing subassemblies, these ids are currently determined as negative DFS enumeration based on size_t:

size_t objectId = std::numeric_limits<size_t>::max() - m_subPaths.size();

This PR changes this to uint64_t, so that assembly text - in particular PUSH #[$] instructions - is consistent over different type sizes of size_t.

Since these indices are stored in a map and later referenced for exporting and importing, as well as (currently) numeric_limits<size_t>::max() is used to indicate the root object or an empty state which is used in, eg, various asserts, I found it helpful to wrap the uint64_t into a struct so that these places become apparent and dealing with the sizes is explicit. This caused a whole avalanche of changes, as such object ids (or more generally: SubAssemblyIDs) are used in many places. In particular, also yul::Object::subIds are subject to the type.

We were lacking a cmdline test that has non-zero PUSH #[$]es, so I added one. Since emscripten builds do not execute them, we have no direct regression test by this but I have added one to the solc-js tests. While doing that I realized that it is currently not possible (to me at least :P) to request ASM json output via standard json interface if the language is Yul.

Fixes #15953

@clonker clonker force-pushed the fix_asm_subpath_obj_id branch 2 times, most recently from 8871aa9 to 12b9a65 Compare March 19, 2025 11:29
@ethereum ethereum deleted a comment from stackenbotten Mar 19, 2025
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch from 12b9a65 to 1ea2cc4 Compare March 19, 2025 11:32
@r0qs
Copy link
Member

r0qs commented Mar 19, 2025

While doing that I realized that it is currently not possible (to me at least :P) to request ASM json output via standard json interface if the language is Yul.

Isn't that the evm.assembly option?

{
	"language": "Yul",
	"sources":
	{
		"C":
		{
			"content": "{ let x := mload(0) sstore(add(x, 0), 0) }"
		}
	},
	"settings":
	{
		"outputSelection":
		{
			"*": { "*": ["evm.assembly"], "": [ "*" ] }
		}
	}
}

@clonker clonker requested a review from aarlt March 19, 2025 11:45
@clonker
Copy link
Member Author

clonker commented Mar 19, 2025

Isn't that the evm.assembly option?

The equivalent that is showing the PUSH #[$]es would be evm.legacyAssembly

@clonker clonker requested a review from r0qs March 19, 2025 11:47
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch from 1ea2cc4 to c20a453 Compare March 21, 2025 08:09
Comment on lines 230 to +232
evmasm::AssemblyItem addSubroutine(evmasm::AssemblyPointer const& _assembly) { return m_asm->appendSubroutine(_assembly); }
/// Pushes the size of the subroutine.
void pushSubroutineSize(size_t _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); }
void pushSubroutineSize(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); }
/// Pushes the offset of the subroutine.
void pushSubroutineOffset(size_t _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); }
void pushSubroutineOffset(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); }
Copy link
Member

@cameel cameel Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's the first time I see assemblies referred to as subroutines. It really stretches the definition. From Wikipedia:

In computer programming, a function (also procedure, method, subroutine, routine, or subprogram) is a callable unit[1] of software logic that has a well-defined interface and behavior and can be invoked multiple times.

It's going to become very confusing with EOF, where we have both functions (code sections) and assemblies (containers). I'd by default interpret "subroutine" to mean to former (there was even a competing "simple subroutines" EIP). We should really rename this at some point.

@@ -88,6 +88,6 @@ class EthAssemblyAdapter: public AbstractAssembly

evmasm::Assembly& m_assembly;
std::map<SubID, u256> m_dataHashBySubId;
size_t m_nextDataCounter = std::numeric_limits<size_t>::max() / 2;
SubID::value_type m_nextDataCounter = std::numeric_limits<SubID::value_type>::max() / 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I thought I understood what's happening and then I ran into this. WTF are we doing? It just makes my head hurt. If I understand correctly:

  • Assembly keeps track of chunks of data and subassemblies separately.
    • Assemblies are stored in m_subs vector.
    • Data is stored by hash in m_data.
      • Except for metadata, which is singled out and stored in m_auxiliaryData.
  • EthAssemblyAdapter holds the assembly and independently keeps track of all the appended data chunks in m_dataHashBySubId.
    • It gives them fake IDs in the upper half of the subassembly ID space
  • When a Yul object is being compiled, EVMObjectCompiler adds subassemblies and data chunks via the adapter, receiving both real and fake IDs.
    • It stores the IDs in BuiltinContext, each one associated with a Yul node by name.
    • This is done only for children of the object. Anything nested deeper does not get an ID.
  • BuiltinContext is passed to Yul code transform and later to builtins defined in EVMDialect.
    • Builtins like datasize then use BuiltinContext to map Yul node names to sub IDs.
      • If the name is present in BuiltinContext, they take the ID stored there.
      • If not, they assume the name is a dotted path and convert it to a sequence of IDs.
        • BTW, this is a pretty questionable assumption: #13794, #15540.
      • In either case the sequence is passed into EthAssemblyAdapter::appendDataSize().
        • It assumes that if the first ID in the sequence is present in m_dataHashBySubId, it must be a fake ID assigned to a data chunk.
        • Otherwise it converts the path to a sub ID using Assembly::encodeSubPath() and assumes it must be a subassembly.
          • There seems to be an assumption that Yul Object tree structure matches the Assembly tree, which I'm not sure is true. For example we sometimes we move things (e.g. long Yul strings) to data chunks and the transition happens in evmasm optimizer, which means that they do not have corresponding data nodes at Yul level.
        • I actually don't understand how paths to nested data are handled. I don't think they're are added to BuiltinContext, but then how do they not end up treated as assemblies by appendDataSize()?

Overall, looks like we have several kinds of sub IDs sharing the same ID space, without any checks for conflicts (just assuming the space is large enough that they won't happen):

  • Normal IDs assigned to subassemblies, starting at 0.
    • EOF ContainerIDs are a subset of this and are limited to uint8_t.
  • Empty ID at max().
  • Fake data IDs starting at max() / 2.
  • "Negative" IDs representing deeply nested paths, starting at max() and going down.

I think we should assign a non-overlapping region to each of them and have asserts enforcing that.

The way we assign IDs should also be more prominently documented. For example m_subPaths does not even say that the "negative" IDs mean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up?
Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up? Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.

Yes, it is here: https://github.com/ethereum/solidity/wiki/Assembly-Indices

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment to m_subPaths that the subassembly IDs referenced there occupy the negative index space. I think a more thorough documentation of things should occur once we have defined non-overlapping regions. Then it's also straightforward where to put the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r0qs I see that the doc does not mention the fake data IDs. Not sure if they can ever show up in assembly output, but even if they can't, they still affect it, because we can never safely allocate other kinds of IDs in the region they take up. So they effectively limit the range of those other kinds IDs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that was not the initial intention of such documentation. However, I can extend it to cover your suggestion of defining non-overlapping regions for IDs.

@@ -47,9 +48,9 @@ using AssemblyPointer = std::shared_ptr<Assembly>;

class Assembly
{
using TagRefs = std::map<size_t, std::pair<size_t, size_t>>;
using TagRefs = std::map<size_t, std::pair<SubAssemblyID, size_t>>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use size_t for tag ID but that should really be uint64_t as well. Would be nice to change that too at some point.

The tag+sub pair would also be better off as a proper struct with methods for converting to/from u256. We wouldn't then have to hard-code all those masks and 64s all over the place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it would also make it easier to reason about the whole program and data flow.

@clonker clonker force-pushed the fix_asm_subpath_obj_id branch 7 times, most recently from 6d3ec74 to 7881236 Compare March 24, 2025 11:43
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch from 7881236 to 8e524b6 Compare April 11, 2025 08:33
@clonker clonker requested a review from cameel April 14, 2025 11:46
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch 3 times, most recently from cf85d6d to 3cb8070 Compare April 23, 2025 11:57
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch from 3cb8070 to bc503b2 Compare April 28, 2025 15:39
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch 2 times, most recently from 0f05e3c to 9115fbc Compare May 9, 2025 09:37
Copy link
Member

@cameel cameel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main things needed here are to replace validations with asserts and adjust tests a bit. Other than that mostly minor stuff.

This should be enough for the fix, but it needs to be followed by a PR that properly splits the index space into regions (#15955 (comment)) and documents it. And introduce a type for the tag ID (#15955 (comment)).

@@ -798,20 +799,20 @@ std::map<u256, u256> const& Assembly::optimiseInternal(

// Run optimisation for sub-assemblies.
// TODO: verify and double-check this for EOF.
for (size_t subId = 0; subId < m_subs.size(); ++subId)
for (SubAssemblyID subID {0}; subID.value < m_subs.size(); ++subID.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (SubAssemblyID subID {0}; subID.value < m_subs.size(); ++subID.value)
for (SubAssemblyID subID{0}; subID.value < m_subs.size(); ++subID.value)

Also in AssemblyItem::splitForeignPushTag().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish we could just settle on one style that has clang format support (or find a close enough approximation of the current style) and be done with it :) the amount of time and energy things like this eat up is unreasonable

@@ -45,7 +45,7 @@ bool JumpdestRemover::optimise(std::set<size_t> const& _tagsReferencedFromOutsid
if (_item.type() != Tag)
return false;
auto asmIdAndTag = _item.splitForeignPushTag();
assertThrow(asmIdAndTag.first == std::numeric_limits<size_t>::max(), OptimizerException, "Sub-assembly tag used as label.");
solRequire(asmIdAndTag.first.empty(), OptimizerException, "Sub-assembly tag used as label.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this one as well is just a general assumption rather than something that we expect to fail at runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tags were actually always confusing to me, because I could not see in what situation we'd ever jump into another subassembly, so I dug into this now to find out. Turns out we never do. This mechanism seems to exist only to support internal function pointers in legacy codegen. Since we support storage variables of such pointer types and storage can be accessed both at runtime and at creation time, we actually combine two pointers in the slot: to a copy of the function that exists in the creation assembly and to one in the runtime assembly. This not new to me, but I only now realized that to support this we have to be able to refer to tags from the runtime assembly inside the creation assembly. I think that's the only situation where they'll appear.

Here's what it looks like in practice:

contract C {
    function() fptr = f;
    function f() internal {}
}
EVM assembly:
  mstore(0x40, 0x80)
  or(tag_0_2, shl(0x20, tag_1))
  0x00
  ...
  sstore
  ...
tag_1:
  jump  // out
  ...
stop

sub_0: assembly {
    ...
    tag_2:
      jump      // out
    ...
}

Note that there's no tag named tag_0_2. It refers to tag_2 in sub_0. I don't think we ever insert such tags into the assembly item stream (and that's what the assertThrow() is guarding against). We only push them on the stack using PushTag, but this does not trigger the assert.

Such tags could, of course, appear in assembly import - we should have a validation against them there - but here it should be only an assert.

@@ -61,20 +61,21 @@ AssemblyItem AssemblyItem::toSubAssemblyTag(size_t _subId) const
return r;
}

std::pair<size_t, size_t> AssemblyItem::splitForeignPushTag() const
std::pair<SubAssemblyID, size_t> AssemblyItem::splitForeignPushTag() const
{
solAssert(m_type == PushTag || m_type == Tag || m_type == RelativeJump || m_type == ConditionalRelativeJump);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my investigation, I think we could limit it to just PushTag:

Suggested change
solAssert(m_type == PushTag || m_type == Tag || m_type == RelativeJump || m_type == ConditionalRelativeJump);
solAssert(m_type == PushTag);

And similarly in setPushTagSubIdAndTag() and toSubAssemblyTag().

Though maybe it's better done in a separate PR. There could be other places like this and it would be good to also have assembly import validations against it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for investigating this!

@@ -292,10 +292,10 @@ std::vector<std::optional<BuiltinFunctionForEVM>> createBuiltins(langutil::EVMVe
_assembly.appendAssemblySize();
else
{
std::vector<size_t> subIdPath =
std::vector<AbstractAssembly::SubID> subIdPath =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just import SubassemblyID.h and use the type directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because they are used as argument to an method in AbstractAssembly - which expects AbstractAssembly::SubID. We could of course get rid of that typedef entirely.

})));
st.ok(output);

function containsTarget(obj, target) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function containsTarget(obj, target) {
function containsAssemblyItem(assemblyJSON, item) {

Also, I think it would be best to limit the search to the legacyAssembly subobject. It seems that now you're searching the whole object and there could be random stuff there. It would also make this check easier to understand - it took me a moment to understand what's happening here, because this is not at all what I expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as a general remark for the future - we should at some point create a CI job that runs the emscripten binary with those tests that provide Standard JSON input/output. It would make this whole JS mess unnecessary...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it would be best to limit the search to the legacyAssembly subobject. It seems that now you're searching the whole object and there could be random stuff there. It would also make this check easier to understand - it took me a moment to understand what's happening here, because this is not at all what I expected.

Fair enough, what did you expect then? In my mind it was fairly straightforward to search the result json for some target object and it also seemed reasonably unlikely to me to find the target in our specific case anywhere else in the output. It is not big deal in any case, I am changing it to only consider the legacyAssembly subobject.

Also as a general remark for the future - we should at some point create a CI job that runs the emscripten binary with those tests that provide Standard JSON input/output. It would make this whole JS mess unnecessary...

Yes please!!

Changelog.md Outdated
@@ -17,6 +17,7 @@ Compiler Features:


Bugfixes:
* Commandline Interface: Fix possible inconsistency in subassembly IDs between target architectures in `--asm-json` output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not just the interface:

Suggested change
* Commandline Interface: Fix possible inconsistency in subassembly IDs between target architectures in `--asm-json` output.
* Assembler: Fix different IDs being assigned to subassemblies nested more than one level away, resulting in inconsistent `--asm-json` output between target architectures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this needs to be moved to 0.8.31 now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not just the interface:

Well yeah, although assuming the ID space isn't exhausted, it does behave consistently inside of the assembler. I don't mind either way so let's go with your phrasing. Although I would minimally amend it:

* Assembler: Fix not using a fixed-width type for IDs being assigned to subassemblies nested more than one level away, resulting in inconsistent --asm-json output between target architectures.

@clonker clonker force-pushed the fix_asm_subpath_obj_id branch 4 times, most recently from b033262 to 2a41349 Compare May 21, 2025 08:03
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch from 2a41349 to 2cdb82f Compare May 21, 2025 08:04
@clonker clonker requested a review from cameel May 21, 2025 08:05
@clonker clonker force-pushed the fix_asm_subpath_obj_id branch from 2cdb82f to de0a985 Compare May 21, 2025 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent subassembly object ID for different sizes of size_t
3 participants