The Beam & Correctness

laser

 

I think that most folks at one time or another have interacted with Netflix, Hulu, Microsoft, Amazon, etc. Depending upon the importance of privacy in your life, it may or may not be cool to see these companies have come to learn what is important to you. In my case, Amazon has a pretty good understanding on what books will interest me. All of these companies figure out your likes and dislikes in pretty much the same way. They create a profile for you. They then keep track of which movies, books, clothing, etc. you buy. Each one of your purchases adds to the patterns that describe you. When one of these companies recognizes that you are on your computer, they collect all of your patterns, analyze them, and come of with a list of items that might interest you.

The collection of the different items that you purchased is bundled in something called a batch and processed.

batch

The figure shows a batch of eight items. If you are a little on the techie side, the batch is composed of bounded data — you know the number of items – ten — so the number of items is bounded. These companies are very efficient at building your specific profile and processing the patterns within your profile.

 

machimelearning

The processing of your patterns can be described by terms like statistical analysis or machine learning. Basically what happens is  significant descriptors are extracted from you profiles — you like cook books — , a model is built, and a prediction is made as to what might interest you.

On the plus side, batch processing doesn’t need to worry about late arriving data, incomplete data, does not need to respond in sub second time, and most often returns correct answers.

future7

Batch processing is nice, but it does not help us solve the type of problems that we are addressing We have a building with hallways and rooms. Hallways and rooms have cameras that are emitting HD video and infra red images in real time. Each camera is sending a stream of images at 15 frames per second.

dsc_0229

People move through our building. As they do, we collect physiological data — muscle movement, brain activity, respiration. what they see and what they hear. So as we are collecting what is happening in each room, we are collecting the state of each person moving through the building. We stream the state information for processing — statistical analysis and machine learning.

ToGod

Our goal is rather simple. We train more efficiently and effectively so that our students can go home at night.

Our problem is both similar and different than trying to predict what you would be most interested in buying. Our data is not collected as a batch, but as real time streams of data. Streams of data are described as unbounded data – at any point in time you might have one item or you might have a thousand items of data.

streams

Instead of developing patterns from your purchase history and grouping them together, we collect data from widely distributed sensors deployed around the world. With batch, there is the luxury that you are confident that your data is correct and is in one place when needed.  With streams of data fed through networks scattered around the world, you can never be sure what data will be where at any given time.Because we can not be sure that we have all the data at any point in time, we can not be sure that the data is correct — is a true representation of context.

A more deeper dive into stream versus batch processing can be found here.

What we needed was a set of tools that could help us process enormous amounts of varying data in real time and could help us determine if our data was correct — the data described the current context. And we needed to draw conclusions in sub second time.

Dataflow

We knew that we needed an elastic type of environment — something that easily scales out and up. We looked at AWS, AZURE, and Google Cloud. Any of the three could have provided our required environment. But when I think of massively parallel system, machine learning and processing large quantities of data, I think Google. I feel at home when I sit at the console in Google Cloud — I know the task is solvable.

We also needed a relatively easy to use tool set that could deal with time, batch, and stream processing. Google has a tool called Beam that has been made open source via Apache. Beam understands parallel processing of streams of data, how to handle early and late arriving data, event time versus process time, watermarks and triggers. There are other tools like Flink, Spark and its migration path and Apache Storm. But I looked at the code samples for Beam, and I felt this is obvious.

 

I will keep you updated on our progress.

 

Any questions:

troy@biosension.com

 

 

 

Leave a Reply