• No results found

Exactly-once processing

In document Streaming Data.pdf (Page 164-168)

Consumer device capabilities and limitations

8.1 The core concepts

8.1.4 Exactly-once processing

There are plenty of uses cases where being able to process a message exactly once is important. In the case from the previous section, we’re receiving data from an order stream and need to update a third-party order system. Because we didn’t implement the third-party API, we can’t make any assumptions regarding its idem-potency. Therefore, we need to ensure that we don’t send it duplicate messages.

Ideally, to handle this we would be able to use an acknowledgment feature of a streaming API. This interaction pattern—acknowledgment from the third-party API and the streaming client acknowledging the message being handled—are illus-trated in figure 8.6. Again, due to the nature of this work, using a web browser as a client is impractical.

Streaming client

Events

Process messages

Record message on send and then delete log entry when getting back ACK and return

ACK so it can be used.

Record the message on receipt and then once the ACK is received, it is safe to delete it

and send an ACK back.

Send

ACK

ACK ACK

3rd party API Streaming

data API

Figure 8.6 HML with full acknowledgment shown in context

This is fraught with challenges. First, you need to implement message acknowledg-ment for the third-party API as discussed in chapter 7. Then you need to impleacknowledg-ment it with the streaming API. Unfortunately, in many cases you don’t control the streaming API and may have to deal with the inability to send back acknowledg-ments. How can you ensure that you’ve processed a message exactly once? In this case we need to keep a record of all the messages we’ve seen for some time period.

If there is a guaranteed unique message ID, we can use it in our “been processed message store”—otherwise the data can be hashed or perhaps another fingerprint-ing mechanism can be computed. By keepfingerprint-ing a record of all the messages we’ve seen, we increase our client complexity and storage requirements but we also pro-tect ourselves from when the streaming API crashes. The archipro-tecture for this looks like figure 8.7.

This may seem simple enough, but what if there’s more than one streaming API server? At a certain point, one is going to crash or be taken down for maintenance.

When this happens our streaming client will connect to a different API server and begin to consume the stream. How do we ensure that we don’t process a previously processed message again? Remember, we can’t assume that the streaming API is keeping track of what messages were already delivered to us. Perhaps it does during normal maintenance, but what if it crashed before it could record the last message sent to us and sends us the same message again? To adequately handle this, we’ll need to use a distributed store for recording the previously seen message. This is illustrated in figure 8.8, where there is more than one streaming API and a stream-ing client.

Streaming client

Events

Process messages

Record message on send and then delete log entry when getting back ACK and return

ACK so it can be used.

Send

ACK

ACK ACK

3rd party API Streaming

data API

Local storage

Record the message on receipt and then once the ACK is received,

it is safe to delete it and record it in our local storage.

Figure 8.7 HML with partial ACKing and local storage

To keep things simple in figure 8.8 I only show one streaming client, but I think you get the picture. As time goes on we see that Streaming API Server 01 goes away and Streaming API Server 02 sends our streaming client a message we’ve already seen. In this case and in all cases we need to check whether we’ve already processed the mes-sage, and if so we need to discard it. There is a subtlety to this design that you need to keep in mind. We’re using a distributed store for keeping track of previously pro-cessed messages. Fantastic—now we’ve protected ourselves from double-processing messages. But if we’re processing a stream that has a significant velocity, any requests we make to a service that is over the network may potentially jeopardize our ability to meet an SLA. Keep this in mind, measure the performance, and if you notice an impact consider using a local store that flushes its messages to a distributed store. Fig-ure 8.9 illustrates how this may work.

When the streaming client detects that the network connection to the streaming API server has been lost, it needs to synchronize the local store with the distributed store. As long as this step happens prior to receiving another message to process, then we should be able to ensure that we don’t process a message more than once.

Streaming client

On receipt of the message we need to check the distributed store. If it is not

found then record the message on and once the ACK is received, it is safe to

delete it and record it.

Send

Figure 8.8 HML with partial ACKing and distributed storage for processed messages

Streaming client

On receipt of the message check the local store. If it is not found then record

the message and once the ACK is received, delete it and record it.

A lost, we need to synch the local store with the

distributed store.

On receipt of the message check the local store. If it is not found then record

the message and once the ACK is received, delete it and record it.

Connection terminated

Figure 8.9 Time lapse of handling a streaming API server changing and keeping exactly-once processing

In document Streaming Data.pdf (Page 164-168)