Code good for passing the salt, but it won’t win you the lottery
However, before heading out for a lottery ticket, potential users should be aware that the software is currently at its best when predicting what a chef might be about to do or need when preparing a salad.
The research is concerned with predicting actions, and the self-learning software is pretty good at it, once it’s gone through a few hours of training videos.
In this case, the software was fed 40 videos of around six minutes each in which different salad dishes were prepared consisting of an average of 20 actions. It also sat through 1,712 videos of 52 different actors making breakfast. A little like the worst season of MasterChef ever.
Why food? Team member Professor Juergen Gall explained to The Register that food preparation videos were used due to the large annotated data sets already available for use in research. The algorithm could then learn the sequence of actions, which could vary for each recipe, and how long each action would take.
Recurrent v Convolutional
The first part, called the decoder, uses a recurrent neural network (RNN) and analyses the video until the current frame, predicting for each frame what action has been taken so far.
The second part, also using a RNN, takes that decoded sequence and, based on what it has learned, predicts what action will happen in the future as well as when the current action will end and how long the future action will take.
The predicted action is then appended to the decoded the sequence and the process runs again, and again to lengthen the prediction.
A convolutional neural network (CNN), which could anticipate all actions in one pass was also tested, but found to be less accurate overall than the RNN approach.
The team showed the software between 20 and 30 per cent of new food preparation videos and let the algorithm predict what would happen for the rest of the video. The system worked well, with predictive accuracy at 40 per cent over short periods.
The longer the video the algorithm had to watch, the lower the accuracy. However, it still hit 15 per cent beyond the three-minute mark.
Not just for tossing the salad
Gall told El Reg that the algorithm had potential for use outside of the kitchen: “It would work for industry applications as well. If the approach is trained on videos showing maintenance processes, it would predict what tool a mechanic might need next.” Certainly, the team working on LESTER could use a smarter toolbox.
As for the vital work of saving users from having to watch terrible movies, Gall reckoned that with training and work, the algorithm could also be used to generate spoilers: “The algorithm predicts the sequence of upcoming actions (discuss, fight, kiss…) but not the entire video of the next few minutes. Predicting entire video frames will be at some point in the future possible, but the current results in this direction are not very good yet.”
Gall was keen to emphasise that the study is only a first step in the field of action prediction and that further work is needed. The algorithm behaves noticeably worse if it is not told what has happened in the first part of the video, and is currently never 100 per cent correct. Future research will both improve accuracy and widen the scope of applications. ®