Skip to content Skip to sidebar Skip to footer

Beam / Dataflow ::readfrompubsub(id_label) :: Unexpected Behavior

Can someone clarify what's the purpose for id_label argument in ReafFromPubSub transform? I'm using BigQuery sink, my understanding it acts like an insertId for BQ Streaming API, T

Solution 1:

Those are different IDs. As explained here, every message published to a topic has a field named messageId that is guaranteed to be unique within the topic. Pub/Sub guarantees at-least-once delivery so a subscription can have duplicates (i.e. messages with the same messageId). Dataflow has exactly-once processing semantics because it uses that field to de-duplicate messages when reading from a subscription. This is independent of the sink, which does not need to be BigQuery.

Using id_label (or .withIdAttribute() in the Java SDK) we can force that messages are considered duplicate according to a different field that should be unique (such as order ID, customer ID, etc.). The input source will read the repeated messages only once, you won't see them increase the count of input elements in the pipeline. Keep in mind that the Direct Runner is intended for testing purposes only and does not offer the same guarantees in terms of checkpointing, de-duplication, etc. As an example refer to this comment. That's the most likely cause of why you are seeing them in the pipeline, also taking into account the NotImplementedError messages, so I'd suggest moving to Dataflow Runner.

On the other side, insertId is used, on a best-effort basis, to avoid duplicate rows when retrying streaming inserts in BigQuery. Using BigQueryIO it is created under the hood and can't be specified manually. In your case, if your N messages enter the pipeline and N are written to BigQuery, it is working as expected. If any had to be retried, the row had the same insertId and was, therefore, discarded.

Post a Comment for "Beam / Dataflow ::readfrompubsub(id_label) :: Unexpected Behavior"