You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use `StreamingDataFrame.join_asof()` to join two topics into a new stream where each left record
6
+
is merged with the right record with the same key whose timestamp is less than or equal to the left timestamp.
7
+
8
+
This join is built with the timeseries enrichment use cases in mind, where the left side represents some measurements and the right side represents events.
9
+
10
+
Some examples:
11
+
12
+
- Matching of the sensor measurements with the events in the system.
13
+
- Joining the purchases with the effective prices of the goods.
14
+
15
+
During as-of join, the records on the right side get stored into a lookup table in the state, and the records from the left side query this state for matches.
16
+
17
+

18
+
19
+
### Requirements
20
+
To perform a join, the underlying topics must follow these requirements:
21
+
22
+
1.**Both topics must have the same number of partitions.**
23
+
Join is a stateful operation, and it requires partitions of left and right topics to be assigned to the same application during processing.
24
+
25
+
2.**Keys in both topics must be distributed across partitions using the same algorithm.**
26
+
For example, messages with the key `A` must go to the same partition number for both left and right topics. This is Kafka's default behaviour.
27
+
28
+
29
+
### Example
30
+
31
+
Join records from the topic "measurements" with the latest effective records from
32
+
the topic "metadata" using the "inner" join strategy and a grace period of 14 days:
`StreamingDataFrame.join_asof` stores the right records to the state.
122
+
The `grace_ms` parameter regulates the state's lifetime (default - 7 days) to prevent it from growing in size forever.
123
+
124
+
It shares some similarities with `grace_ms` in [Windows](windowing.md/#lateness-and-out-of-order-processing):
125
+
126
+
- The timestamps are obtained from the records.
127
+
- The join key keeps track of the maximum observed timestamp for **each individual key**.
128
+
- The older values get expired only when the larger timestamp gets stored to the state.
129
+
130
+
Adjust `grace_ms` based on the expected time gap between the left and the right side of the join.
131
+
132
+
### Limitations
133
+
134
+
- Joining dataframes belonging to the same topics (aka "self-join") is not supported.
135
+
- As-of join preserves headers only for the left dataframe.
136
+
If you need headers of the right side records, consider adding them to the value.
137
+
138
+
## Message ordering between partitions
139
+
Joins use [`StreamingDataFrame.concat()`](concatenating.md) under the hood, which means that the application's internal consumer goes into a special "buffered" mode
140
+
when the join is used.
141
+
142
+
In this mode, it buffers messages per partition in order to process them in the timestamp order between different topics.
143
+
Timestamp alignment is effective only for the partitions **with the same numbers**: partition zero is aligned with other zero partitions, but not with partition one.
144
+
145
+
Note that message ordering works only when the messages are consumed from the topics.
146
+
If you change timestamps of the record during processing, they will be processed in the original order.
0 commit comments