May 26, 2017

Microsoft Research Asia Academic Day 2017

Location: Yilan, Taiwan

Register
  • Historically, AI, Robotics and Computer Vision shared the same origin. Early 70’s most of the AI laboratories in the world, such as MIT-AI lab and Stanford AI Lab, conducted research in these three areas in the same places. Researchers in these areas discussed research issues together, by the fact-to-face manner and published their papers in the common place, IJCAI (International Joint Conference on Artificial Intelligence). Around early 80’s, however, the separation occurred among these three areas. ICRA (International Conference on Robotics and Automation) and ICCV (International Conference on Computer Vision) launched from IJCAI around that time. It was inevitable to have such separations for deeper research along the Reductionism. Recently, however, the Cambrian explosion is occurring in these areas through too many fragmental theories by too many researchers. It is the time that we need the Holism to re-organize these areas for avoiding further fragmentations and, even, the extinction of these areas. I will examine why robotics needs AI, why AI needs Robotics, and what is the key issue toward the Holism. From this analysis, I will try to define the key directions in the future Robotics research.

  • Internet of Things (IoT) refers to connecting devices to each other through the Internet. Most IoT systems manage physical devices (such as Apple watches and Google glasse). In this talk we propose the concept of cyber IoT devices that are computer animation. An example is “Dandelion Mirror” that is cyber physical integration merging the virtual and physical worlds. In other words, it is a cyber-physical system (CPS) integrating computation, networking and physical process. We use IoTtalk, an IoT device management platform to develop cyber physical IoT applications. IoTtalk connects input devices (such as heart beat rate sensor) to flexibly interact with the cyber devices. We show how IoTtalk can easily accommodate cyber IoT devices such as a ball motion in animation, and how one can use a mobile phone (physical device) to control a flower growing in animation (cyber debice) and a physical pendulum guide the swing of a cyber pendulum.

  • Varying types of shots is a fundamental element in the language of film, commonly used by a visual storytelling director. The technique is often used in creating professional recordings of a live concert, but meanwhile may not be appropriately applied in audience recordings of the same event. Such variations could cause the task of classifying shots in concert videos, professional or amateur, very challenging. We propose a novel probabilistic-based approach, named as Coherent Classification Net (CC-Net), to tackle the problem by addressing three crucial issues. First, We focus on learning more effective features by fusing the layer-wise outputs extracted from a deep convolutional neural network (CNN), pre-trained on a large-scale dataset for object recognition. Second, we introduce a frame-wise classification scheme, the error weighted deep cross-correlation model (EW-Deep-CCM), to boost the classification accuracy. Specifically, the deep neural network-based cross-correlation model (Deep-CCM) is constructed to not only model the extracted feature hierarchies of CNN independently but also relate the statistical dependencies of paired features from different layers. Then, a Bayesian error weighting scheme for classifier combination is adopted to explore the contributions from individual Deep-CCM classifiers to enhance the accuracy of shot classification in each image frame. Third, we feed the frame-wise classification results to a linear-chain conditional random field (CRF) module to refine the shot predictions by taking account of the global and temporal regularities. We provide extensive experimental results on a dataset of live concert videos to demonstrate the advantage of the proposed CC-Net over existing popular fusion approaches.

  • Artificial intelligence, and its embodiment robotics, originally aimed for making complete human copies, 100 % AI systems for replacing human workers. However, as seen in Prof. Reddy ‘s Turing Award Lecture, we have found that there is a huge boundary between artificial and human intelligence, referred to as the Frame. There is always an exception beyond the frame, that an AI system can define its tasks. Human intelligence can easily overcome such a frame by using exceptional handling methods, while artificial intelligence cannot do it and gets stuck there. Prof. Reddy, thus, proposes 90% AI and to re-name AI as augmented intelligence rather than artificial intelligence. Augmented intelligence, or 90% AI, usually works autonomously on routine works to help the burden of human workers, and, when the system encounters exceptional cases beyond the frame, the system consults fellow human co-workers to help the system. Augmented intelligence aims not to replace human workers instead of to cooperate and to help human workers. In this session, we consider the necessary requirements for such augmented intelligence robots. First, Prof. Luo at Taiwan University will outline the influence of such systems on human society. Next, Prof. Inaba at the University of Tokyo proposes one of the key technologies for such robots, that can understand the situation of fellow human workers to decide whether it is a good timing to collaborate with human or not. Finally, Prof. Oishi of the University of Tokyo describes a 3D modeling technique for giving the environmental frame of such AI ​​systems.

  • In this session, we will go beyond machine learning and discuss topics on machine generation and discovery. Can a machine comments like a young people who are familiar with internet culture for a fashion photo? Is it possible that a bot can sense users’ emotion and appropriately react to them in conversations? And can machines discover something new without any labelled data? We will discuss more possibilities of machines in this AI era.

  • Having a natural language conversation with a computer has been envisioned in movies over the years, ranging from HAL in “2001 Space Odyssey” to C3PO in “Star Wars” to Data in “Star Trek Next Generation” to  Samantha in “Her”. Yet the realization of true conversation understanding would require the following: robust speech recognition, natural language understanding, awareness of emotional and social cues, and mental model of the world. In this session, we have three great speakers who will describe the latest advances in research and also point out future problems to work on in this very important and exciting area.

  • This session has three topics for realizing:

    • Truly-wearable small devices that does not need local battery by using wireless power transmission (given by Prof. Yoshihiro Kawahara).
    • Quick & accurate robot control by using vision & DNN technology (given by Prof. Jenn-Jier James Lien).
    • Much smarter personal assistant systems by observing human behavior (given by Prof. Hao-Chuan Wang).
  • In this session, we have three presentations address three aspects of AI: machine learning, hardware, and language generation. The first talk presented by Prof. James Kwok describes a fast large-scale low-rank matrix learning method with a convergence rate of O(1/T), where T is the number of iterations.  The second talk given by Prof. Pascual Martínez-Gómez explains how to leverage phrases of different forms mapped to similar images to recognize phrasal entailment relations.  Prof. Yoshino closes the session by showing how to generate natural language sentences using a one-hot vector representation which can utilize information from various sources.

  • Recent years have witnessed the fast-growing research on artificial intelligence, especially the breakthroughs in deep learning, leading to many exciting ground-breaking applications in computer vision and multimedia communities. On the other hand, there remain many open problems and grand challenges regarding deep learning for vision and multimedia. In this session, we hope to discuss some reflections on this important research field, and discuss what are missing and what are the opportunities for academia and industry to further advance this field.