Beam / Dataflow ::readfrompubsub(id_label) :: Unexpected Behavior
Solution 1:
Those are different IDs. As explained here, every message published to a topic has a field named messageId
that is guaranteed to be unique within the topic. Pub/Sub guarantees at-least-once delivery so a subscription can have duplicates (i.e. messages with the same messageId
). Dataflow has exactly-once processing semantics because it uses that field to de-duplicate messages when reading from a subscription. This is independent of the sink, which does not need to be BigQuery.
Using id_label
(or .withIdAttribute()
in the Java SDK) we can force that messages are considered duplicate according to a different field that should be unique (such as order ID, customer ID, etc.). The input source will read the repeated messages only once, you won't see them increase the count of input elements in the pipeline. Keep in mind that the Direct Runner is intended for testing purposes only and does not offer the same guarantees in terms of checkpointing, de-duplication, etc. As an example refer to this comment. That's the most likely cause of why you are seeing them in the pipeline, also taking into account the NotImplementedError
messages, so I'd suggest moving to Dataflow Runner.
On the other side, insertId
is used, on a best-effort basis, to avoid duplicate rows when retrying streaming inserts in BigQuery. Using BigQueryIO
it is created under the hood and can't be specified manually. In your case, if your N messages enter the pipeline and N are written to BigQuery, it is working as expected. If any had to be retried, the row had the same insertId
and was, therefore, discarded.
Post a Comment for "Beam / Dataflow ::readfrompubsub(id_label) :: Unexpected Behavior"