Cindy M. Grimm, professor in the School of Mechanical, Industrial, and Manufacturing Engineering at Oregon State University argues, When describing the behavior of robotic systems, we tend to rely on anthropomorphisms.
A visitor puts her hand against the glass in front of “Nexi” robot during the “ROBOTS” exhibition at the Hong Kong Science Museum in Hong Kong on May 8, 2021. The exhibition explores the 500-year story of humanoid robots and the artistic and scientific quest to understand what it means to be human.
Photo: Miguel Candela / SOPA Images/Sip via Reuters Connect
Cameras “see,” decision algorithms “think,” and classification systems “recognize.” But the use of such terms can set us up for failure, since they create expectations and assumptions that often do not hold, especially in the minds of people who have no training in the underlying technologies involved. This is particularly problematic because many of the tasks we envision for robotic technologies are typically ones that humans currently do (or could do) some part of. The natural tendency is to describe these tasks as a human would using the “skills” a human has—which may be very different from how a robot performs the task. If the task specification relies only on “human” specifications—without making clear the differences between “robotic” skills and “human” ones—then the chance of a misalignment between the human-based description of the task and what the robot actually does will increase.
Designing, procuring, and evaluating AI and robotic systems that are safe, effective, and behave in predictable ways represents a central challenge in contemporary artificial intelligence, and using a systematic approach in choosing the language that describes these systems is the first step toward mitigating risks associated with unexamined assumptions about AI and robotic capabilities. Specifically, actions we consider simple need to be broken down and their components carefully mapped to their algorithmic and sensor counterparts, while avoiding the pitfalls of anthropomorphic language. This serves two purposes. First, it helps to reveal underlying assumptions and biases by more clearly defining functionality. Second, it helps non-technical experts better understand the limitations and capabilities of the underlying technology, so they can better judge if it meets their application needs...
One could argue that the two statements “The robot sees an apple” and “The robot detects an object that has the appearance of an apple” are pretty much the same, but in their assumptions of cognitive ability, they are very different. “See” carries with it a host of internal models and assumptions: Apples are red or green, fit in the hand, smell like apples, crunch when you bite them, are found on trees and fruit bowls, etc. We are used to seeing apples in a wide variety of lighting conditions and varying view points—and we have some notion of the context in which they are likely to appear. We can separate out pictures of apples from paintings or cartoons. We can recognize other objects in a scene that tell us if something is likely to be an apple or another red object. In other words, we bring an entire internal representation of what an apple is when looking at an image—we don’t just see the pixels. “Detect,” on the other hand, connotes fewer internal assumptions and evokes, instead, the image of someone pointing a sensor at an apple and having it go “ding.” This is more akin to how a robot “sees” and how it internally represents an apple. A sensor (the camera) is pointed at the apple, and the numeric distribution of pixel values is examined. If the pixel values “match” (numerically) the previously-learned examples of pixel distributions for images labeled as “apples,” the algorithm returns the symbol “apple.” How does the algorithm get this set of example pixel distributions? Not by running around and picking up objects and seeing if they smell and taste like apples, but from millions of labeled images (thank you, Flickr). These images are largely taken with good lighting and from standard viewpoints, which means that the algorithm struggles to detect an apple in bad lighting and from odd angles, nor does it know how to distinguish between an image that matches its criteria of being an apple but isn’t one. Hence, it is more accurate to say that the robot has detected an object that has the appearance of an apple.
Source: Brookings Institution