Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

clonker · 2025-03-19T09:58:05Z

When computing object ids for referencing subassemblies, these ids are currently determined as negative DFS enumeration based on size_t:

solidity/libevmasm/Assembly.cpp

Line 1825 in a652292

size_t objectId = std::numeric_limits<size_t>::max() - m_subPaths.size();

This PR changes this to uint64_t, so that assembly text - in particular PUSH #[$] instructions - is consistent over different type sizes of size_t.

Since these indices are stored in a map and later referenced for exporting and importing, as well as (currently) numeric_limits<size_t>::max() is used to indicate the root object or an empty state which is used in, eg, various asserts, I found it helpful to wrap the uint64_t into a struct so that these places become apparent and dealing with the sizes is explicit. This caused a whole avalanche of changes, as such object ids (or more generally: SubAssemblyIDs) are used in many places. In particular, also yul::Object::subIds are subject to the type.

We were lacking a cmdline test that has non-zero PUSH #[$]es, so I added one. Since emscripten builds do not execute them, we have no direct regression test by this but I have added one to the solc-js tests. While doing that I realized that it is currently not possible (to me at least :P) to request ASM json output via standard json interface if the language is Yul.

Fixes #15953

r0qs · 2025-03-19T11:44:36Z

While doing that I realized that it is currently not possible (to me at least :P) to request ASM json output via standard json interface if the language is Yul.

Isn't that the evm.assembly option?

{
	"language": "Yul",
	"sources":
	{
		"C":
		{
			"content": "{ let x := mload(0) sstore(add(x, 0), 0) }"
		}
	},
	"settings":
	{
		"outputSelection":
		{
			"*": { "*": ["evm.assembly"], "": [ "*" ] }
		}
	}
}

clonker · 2025-03-19T11:46:49Z

Isn't that the evm.assembly option?

The equivalent that is showing the PUSH #[$]es would be evm.legacyAssembly

libevmasm/Assembly.h

libevmasm/AssemblyItem.cpp

libevmasm/Assembly.cpp

libevmasm/AssemblyItem.cpp

cameel · 2025-03-21T22:57:54Z

libsolidity/codegen/CompilerContext.h

 	evmasm::AssemblyItem addSubroutine(evmasm::AssemblyPointer const& _assembly) { return m_asm->appendSubroutine(_assembly); }
 	/// Pushes the size of the subroutine.
-	void pushSubroutineSize(size_t _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); }
+	void pushSubroutineSize(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineSize(_subRoutine); }
 	/// Pushes the offset of the subroutine.
-	void pushSubroutineOffset(size_t _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); }
+	void pushSubroutineOffset(evmasm::SubAssemblyID _subRoutine) { m_asm->pushSubroutineOffset(_subRoutine); }


Oh, it's the first time I see assemblies referred to as subroutines. It really stretches the definition. From Wikipedia:

In computer programming, a function (also procedure, method, subroutine, routine, or subprogram) is a callable unit[1] of software logic that has a well-defined interface and behavior and can be invoked multiple times.

It's going to become very confusing with EOF, where we have both functions (code sections) and assemblies (containers). I'd by default interpret "subroutine" to mean to former (there was even a competing "simple subroutines" EIP). We should really rename this at some point.

cameel · 2025-03-22T00:20:34Z

libyul/backends/evm/EthAssemblyAdapter.h

@@ -88,6 +88,6 @@ class EthAssemblyAdapter: public AbstractAssembly

 	evmasm::Assembly& m_assembly;
 	std::map<SubID, u256> m_dataHashBySubId;
-	size_t m_nextDataCounter = std::numeric_limits<size_t>::max() / 2;
+	SubID::value_type m_nextDataCounter = std::numeric_limits<SubID::value_type>::max() / 2;


Ugh, I thought I understood what's happening and then I ran into this. WTF are we doing? It just makes my head hurt. If I understand correctly:

Assembly keeps track of chunks of data and subassemblies separately.

Assemblies are stored in m_subs vector.

Data is stored by hash in m_data.

Except for metadata, which is singled out and stored in m_auxiliaryData.

EthAssemblyAdapter holds the assembly and independently keeps track of all the appended data chunks in m_dataHashBySubId.

It gives them fake IDs in the upper half of the subassembly ID space

When a Yul object is being compiled, EVMObjectCompiler adds subassemblies and data chunks via the adapter, receiving both real and fake IDs.

It stores the IDs in BuiltinContext, each one associated with a Yul node by name.

This is done only for children of the object. Anything nested deeper does not get an ID.

BuiltinContext is passed to Yul code transform and later to builtins defined in EVMDialect.

Builtins like datasize then use BuiltinContext to map Yul node names to sub IDs.

If the name is present in BuiltinContext, they take the ID stored there.

If not, they assume the name is a dotted path and convert it to a sequence of IDs.

BTW, this is a pretty questionable assumption: #13794, #15540.

In either case the sequence is passed into EthAssemblyAdapter::appendDataSize().

It assumes that if the first ID in the sequence is present in m_dataHashBySubId, it must be a fake ID assigned to a data chunk.

Otherwise it converts the path to a sub ID using Assembly::encodeSubPath() and assumes it must be a subassembly.

There seems to be an assumption that Yul Object tree structure matches the Assembly tree, which I'm not sure is true. For example we sometimes we move things (e.g. long Yul strings) to data chunks and the transition happens in evmasm optimizer, which means that they do not have corresponding data nodes at Yul level.

I actually don't understand how paths to nested data are handled. I don't think they're are added to BuiltinContext, but then how do they not end up treated as assemblies by appendDataSize()?

Overall, looks like we have several kinds of sub IDs sharing the same ID space, without any checks for conflicts (just assuming the space is large enough that they won't happen):

Normal IDs assigned to subassemblies, starting at 0.

EOF ContainerIDs are a subset of this and are limited to uint8_t.

Empty ID at max().

Fake data IDs starting at max() / 2.

"Negative" IDs representing deeply nested paths, starting at max() and going down.

I think we should assign a non-overlapping region to each of them and have asserts enforcing that.

The way we assign IDs should also be more prominently documented. For example m_subPaths does not even say that the "negative" IDs mean.

Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up?
Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.

Sounds like a good idea to define non-overlapping regions. That way each of the regions can also be independently and cleanly documented. Perhaps in a follow up? Regarding documenting subpaths and negative IDs: @r0qs had written something up about it and iirc wanted to add it in some form to the documentation.

Yes, it is here: https://github.com/ethereum/solidity/wiki/Assembly-Indices

I have added a comment to m_subPaths that the subassembly IDs referenced there occupy the negative index space. I think a more thorough documentation of things should occur once we have defined non-overlapping regions. Then it's also straightforward where to put the documentation.

@r0qs I see that the doc does not mention the fake data IDs. Not sure if they can ever show up in assembly output, but even if they can't, they still affect it, because we can never safely allocate other kinds of IDs in the region they take up. So they effectively limit the range of those other kinds IDs.

Indeed, that was not the initial intention of such documentation. However, I can extend it to cover your suggestion of defining non-overlapping regions for IDs.

cameel · 2025-03-22T01:34:48Z

libevmasm/Assembly.h

@@ -47,9 +48,9 @@ using AssemblyPointer = std::shared_ptr<Assembly>;

 class Assembly
 {
-	using TagRefs = std::map<size_t, std::pair<size_t, size_t>>;
+	using TagRefs = std::map<size_t, std::pair<SubAssemblyID, size_t>>;


We use size_t for tag ID but that should really be uint64_t as well. Would be nice to change that too at some point.

The tag+sub pair would also be better off as a proper struct with methods for converting to/from u256. We wouldn't then have to hard-code all those masks and 64s all over the place.

Agree, it would also make it easier to reason about the whole program and data flow.

cameel

Main things needed here are to replace validations with asserts and adjust tests a bit. Other than that mostly minor stuff.

This should be enough for the fix, but it needs to be followed by a PR that properly splits the index space into regions (#15955 (comment)) and documents it. And introduce a type for the tag ID (#15955 (comment)).

libevmasm/SubAssemblyID.h

cameel · 2025-05-16T09:12:25Z

libevmasm/Assembly.cpp

@@ -798,20 +799,20 @@ std::map<u256, u256> const& Assembly::optimiseInternal(

 	// Run optimisation for sub-assemblies.
 	// TODO: verify and double-check this for EOF.
-	for (size_t subId = 0; subId < m_subs.size(); ++subId)
+	for (SubAssemblyID subID {0}; subID.value < m_subs.size(); ++subID.value)


Suggested change

for (SubAssemblyID subID {0}; subID.value < m_subs.size(); ++subID.value)

for (SubAssemblyID subID{0}; subID.value < m_subs.size(); ++subID.value)

Also in AssemblyItem::splitForeignPushTag().

I really wish we could just settle on one style that has clang format support (or find a close enough approximation of the current style) and be done with it :) the amount of time and energy things like this eat up is unreasonable

libevmasm/Assembly.cpp

libevmasm/BlockDeduplicator.h

cameel · 2025-05-16T12:00:28Z

libevmasm/JumpdestRemover.cpp

@@ -45,7 +45,7 @@ bool JumpdestRemover::optimise(std::set<size_t> const& _tagsReferencedFromOutsid
 			if (_item.type() != Tag)
 				return false;
 			auto asmIdAndTag = _item.splitForeignPushTag();
-			assertThrow(asmIdAndTag.first == std::numeric_limits<size_t>::max(), OptimizerException, "Sub-assembly tag used as label.");
+			solRequire(asmIdAndTag.first.empty(), OptimizerException, "Sub-assembly tag used as label.");


I think that this one as well is just a general assumption rather than something that we expect to fail at runtime.

These tags were actually always confusing to me, because I could not see in what situation we'd ever jump into another subassembly, so I dug into this now to find out. Turns out we never do. This mechanism seems to exist only to support internal function pointers in legacy codegen. Since we support storage variables of such pointer types and storage can be accessed both at runtime and at creation time, we actually combine two pointers in the slot: to a copy of the function that exists in the creation assembly and to one in the runtime assembly. This not new to me, but I only now realized that to support this we have to be able to refer to tags from the runtime assembly inside the creation assembly. I think that's the only situation where they'll appear.

Here's what it looks like in practice:

contract C { function() fptr = f; function f() internal {} }

EVM assembly: mstore(0x40, 0x80) or(tag_0_2, shl(0x20, tag_1)) 0x00 ... sstore ... tag_1: jump // out ... stop sub_0: assembly { ... tag_2: jump // out ... }

Note that there's no tag named tag_0_2. It refers to tag_2 in sub_0. I don't think we ever insert such tags into the assembly item stream (and that's what the assertThrow() is guarding against). We only push them on the stack using PushTag, but this does not trigger the assert.

Such tags could, of course, appear in assembly import - we should have a validation against them there - but here it should be only an assert.

cameel · 2025-05-16T12:20:48Z

libevmasm/AssemblyItem.cpp

@@ -61,20 +61,21 @@ AssemblyItem AssemblyItem::toSubAssemblyTag(size_t _subId) const
 	return r;
 }

-std::pair<size_t, size_t> AssemblyItem::splitForeignPushTag() const
+std::pair<SubAssemblyID, size_t> AssemblyItem::splitForeignPushTag() const
 {
 	solAssert(m_type == PushTag || m_type == Tag || m_type == RelativeJump || m_type == ConditionalRelativeJump);


Based on my investigation, I think we could limit it to just PushTag:

Suggested change

solAssert(m_type == PushTag || m_type == Tag || m_type == RelativeJump || m_type == ConditionalRelativeJump);

solAssert(m_type == PushTag);

And similarly in setPushTagSubIdAndTag() and toSubAssemblyTag().

Though maybe it's better done in a separate PR. There could be other places like this and it would be good to also have assembly import validations against it.

Makes sense, thanks for investigating this!

cameel · 2025-05-16T12:34:01Z

libyul/backends/evm/EVMDialect.cpp

@@ -292,10 +292,10 @@ std::vector<std::optional<BuiltinFunctionForEVM>> createBuiltins(langutil::EVMVe
 				_assembly.appendAssemblySize();
 			else
 			{
-			std::vector<size_t> subIdPath =
+			std::vector<AbstractAssembly::SubID> subIdPath =


Why not just import SubassemblyID.h and use the type directly?

I did it this way because they are used as argument to an method in AbstractAssembly - which expects AbstractAssembly::SubID. We could of course get rid of that typedef entirely.

test/cmdlineTests/asm_json_nested_structure/args

cameel · 2025-05-16T13:11:22Z

test/externalTests/solc-js/asm-json.js

+        })));
+        st.ok(output);
+
+        function containsTarget(obj, target) {


Suggested change

function containsTarget(obj, target) {

function containsAssemblyItem(assemblyJSON, item) {

Also, I think it would be best to limit the search to the legacyAssembly subobject. It seems that now you're searching the whole object and there could be random stuff there. It would also make this check easier to understand - it took me a moment to understand what's happening here, because this is not at all what I expected.

Also as a general remark for the future - we should at some point create a CI job that runs the emscripten binary with those tests that provide Standard JSON input/output. It would make this whole JS mess unnecessary...

Also, I think it would be best to limit the search to the legacyAssembly subobject. It seems that now you're searching the whole object and there could be random stuff there. It would also make this check easier to understand - it took me a moment to understand what's happening here, because this is not at all what I expected.

Fair enough, what did you expect then? In my mind it was fairly straightforward to search the result json for some target object and it also seemed reasonably unlikely to me to find the target in our specific case anywhere else in the output. It is not big deal in any case, I am changing it to only consider the legacyAssembly subobject.

Also as a general remark for the future - we should at some point create a CI job that runs the emscripten binary with those tests that provide Standard JSON input/output. It would make this whole JS mess unnecessary...

Yes please!!

cameel · 2025-05-16T13:15:57Z

Changelog.md

@@ -17,6 +17,7 @@ Compiler Features:


 Bugfixes:
+* Commandline Interface: Fix possible inconsistency in subassembly IDs between target architectures in `--asm-json` output.


This is not just the interface:

Suggested change

* Commandline Interface: Fix possible inconsistency in subassembly IDs between target architectures in `--asm-json` output.

* Assembler: Fix different IDs being assigned to subassemblies nested more than one level away, resulting in inconsistent `--asm-json` output between target architectures.

Also, this needs to be moved to 0.8.31 now.

This is not just the interface:

Well yeah, although assuming the ID space isn't exhausted, it does behave consistently inside of the assembler. I don't mind either way so let's go with your phrasing. Although I would minimally amend it:

* Assembler: Fix not using a fixed-width type for IDs being assigned to subassemblies nested more than one level away, resulting in inconsistent --asm-json output between target architectures.

clonker force-pushed the fix_asm_subpath_obj_id branch 2 times, most recently from 8871aa9 to 12b9a65 Compare March 19, 2025 11:29

ethereum deleted a comment from stackenbotten Mar 19, 2025

clonker force-pushed the fix_asm_subpath_obj_id branch from 12b9a65 to 1ea2cc4 Compare March 19, 2025 11:32

clonker requested a review from aarlt March 19, 2025 11:45

clonker requested a review from r0qs March 19, 2025 11:47

clonker force-pushed the fix_asm_subpath_obj_id branch from 1ea2cc4 to c20a453 Compare March 21, 2025 08:09

cameel reviewed Mar 22, 2025

View reviewed changes

clonker force-pushed the fix_asm_subpath_obj_id branch 7 times, most recently from 6d3ec74 to 7881236 Compare March 24, 2025 11:43

clonker force-pushed the fix_asm_subpath_obj_id branch from 7881236 to 8e524b6 Compare April 11, 2025 08:33

clonker requested a review from cameel April 14, 2025 11:46

clonker force-pushed the fix_asm_subpath_obj_id branch 3 times, most recently from cf85d6d to 3cb8070 Compare April 23, 2025 11:57

clonker force-pushed the fix_asm_subpath_obj_id branch from 3cb8070 to bc503b2 Compare April 28, 2025 15:39

clonker force-pushed the fix_asm_subpath_obj_id branch 2 times, most recently from 0f05e3c to 9115fbc Compare May 9, 2025 09:37

cameel reviewed May 16, 2025

View reviewed changes

clonker force-pushed the fix_asm_subpath_obj_id branch 4 times, most recently from b033262 to 2a41349 Compare May 21, 2025 08:03

clonker force-pushed the fix_asm_subpath_obj_id branch from 2a41349 to 2cdb82f Compare May 21, 2025 08:04

clonker requested a review from cameel May 21, 2025 08:05

clonker added 3 commits May 21, 2025 10:52

Add tests with non-zero PUSH #[$] asm instructions

a7d9eb2

Make subassembly IDs based on fixed-size 64 bit uint

64df8ff

Add test with non-zero PUSH #[$] to solc-js tests

de0a985

clonker force-pushed the fix_asm_subpath_obj_id branch from 2cdb82f to de0a985 Compare May 21, 2025 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

clonker commented Mar 19, 2025 •

edited

Loading

r0qs commented Mar 19, 2025

clonker commented Mar 19, 2025

cameel Mar 21, 2025 •

edited

Loading

cameel Mar 22, 2025

clonker Mar 24, 2025

r0qs Mar 28, 2025

clonker Apr 23, 2025

cameel May 16, 2025

r0qs May 19, 2025

cameel Mar 22, 2025

clonker Mar 24, 2025

cameel left a comment

cameel May 16, 2025

clonker May 19, 2025

cameel May 16, 2025

cameel May 16, 2025

cameel May 16, 2025

clonker May 19, 2025

cameel May 16, 2025

clonker May 19, 2025

cameel May 16, 2025

cameel May 16, 2025

clonker May 19, 2025

cameel May 16, 2025

cameel May 16, 2025

clonker May 19, 2025

	for (SubAssemblyID subID {0}; subID.value < m_subs.size(); ++subID.value)
	for (SubAssemblyID subID{0}; subID.value < m_subs.size(); ++subID.value)

	solAssert(m_type == PushTag \|\| m_type == Tag \|\| m_type == RelativeJump \|\| m_type == ConditionalRelativeJump);
	solAssert(m_type == PushTag);

	function containsTarget(obj, target) {
	function containsAssemblyItem(assemblyJSON, item) {

		@@ -17,6 +17,7 @@ Compiler Features:


		Bugfixes:
		* Commandline Interface: Fix possible inconsistency in subassembly IDs between target architectures in `--asm-json` output.

	* Commandline Interface: Fix possible inconsistency in subassembly IDs between target architectures in `--asm-json` output.
	* Assembler: Fix different IDs being assigned to subassemblies nested more than one level away, resulting in inconsistent `--asm-json` output between target architectures.

Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

Are you sure you want to change the base?

Fix inconsistent negative subassembly indices between different sizeof(size_t) #15955

Conversation

clonker commented Mar 19, 2025 • edited Loading

r0qs commented Mar 19, 2025

clonker commented Mar 19, 2025

cameel Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cameel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clonker commented Mar 19, 2025 •

edited

Loading

cameel Mar 21, 2025 •

edited

Loading