Vladimir Kulyukin
Autonomous Robots, No. 16, 2004
Summary
Human Robot interaction has become more and more
important in relevant activities with challenging tasks, from mining to space
walks, therefore tight integration of perception, language and vision are
important characteristics for cooperating in such environments, GUI interface
is not acceptable any longer. It is believed that everyday language is somehow
grounded in visually-guided activities and deep integration must be achieved at
this level. The system proposed is a speech method for commanding a robot in
performing certain grasping tasks (such as a “pepsi can”). To modes of
interaction are possible, local (when the robot and the operator are I each
others sight) and remote (which need the use of a camera to allow the human to
understand the robots environments). The proposed robot has a standard 3T
architecture, there are three tiers of functionality: deliberation, execution
and control. The execution ties (which receives from the deliberation tier the
inputs to be satisfied) is implemented with the Reactive Action Package (RAP)
System, this is a set of methods for achieving a specific goal under different
circumstances. Steps needed to execute a method are task nets and the RAP may
operate knowing or not where the object is, the RAP becomes a task when its
index clause is matching a task description. The control tier contains the
robot’s skills, these are enables when the control tier is asked to execute
them and is not aware of success or failures, since these are under the
execution tier.
Actions are then
categorized in internal (for manipulation of robot’s memory) an external (for
manipulation of external objects or moving operations), the output of a skill
is a set of symbolic assertions that the skill puts in the robot’s memory.
Goals, object and actions are part of a Semantic Network, where each not
represents a memory organization package (MOP) and they are connected between
each other with abstraction links or package links. Nodes are activated by an
activation algorithm on the direct memory access parsing (DMAP) done by token
sequences. For token T all expectation that present T must be advances, for
those activated then the target MOPs must be activated and for each MOP, in
presence of callbacks (Kulyukin and Settle, 2001), a run must be performed. Object
recognition is performed through GHD (Generalized Hamming Distance), which is a
medication of the classical method, being in fact capable of approximate
similarity, and color histogram models (CHM). The system needs pieces of a
priori knowledge, it needs the semantic network of goals and objects, the
library of object models, the context-free command and control grammars for
speech recognition, the library of RAP and the library for robotics skills. Interaction
is based on passive rarefication of knowledge through goal disambiguation,
mutual understand is necessary, denoting this to be a cognitive machinery
situation. Voice recognition is performed through Microsoft Speech API (SAPI),
which is a middle layer between an application and a speech recognition engine,
it includes the Microsoft English SR Engine Version 5, with 6o,000 English
words and capable to be provided with other languages. SAPI converts voice
input into a token sequence, activating the algorithm discussed above. Goal
disambiguation can appear under the form of sensory, mnemonic and linguistic.
The robot is capable of asking for clarifying in ambiguous situations. A great
advantage of the system is that it allows introspection, this permits operator
to ask the robot what it is capable to do and therefore it make the machine
easy for non expert operators and helps learning performances both of robot and
human.
Key
Concepts
Human – Robot Interaction
Key Results
Experiments have been
performed proving the importance of introspection and showing how it is mainly
used in at the first interaction and abounded with decrease of learning factor.
Limitations are still to be overcome, in fact it can’t handle deictic
references, quantification and negation and it is capable of interacting with one
person.
No comments:
Post a Comment