4. MACHINE LEARNING
4.3. MAS Learning
As demonstrated by VIENA, there are advantages to developing multi-agents systems that learn and adapt. In VIENA’s case, a single system can learn to adapt to multiple users. A further development of this area of adaptive systems exists in the combining of the fields of machine learning and multi-agent systems. Merging the two areas of research presents distinct challenges as well as advantages. Specifically, in [Vidal, 2003], it is stated that the definition of machine learning is essentially violated within multi-agent systems because an agent is no longer learning from a fixed set of experiences (training examples). Since E changes, the learned target function changes.
[Singh and Huhns, 1997] identify differences between challenges faced by traditional machine learning research and research involving machine learning within cooperative agent systems. In a traditional agent-based, machine learning system, an agent must learn and adapt to an environment that is passive and has no intentions. The agent may also have imprecise sensors that cause it to learn inaccurate information about the environment. In machine learning with systems of multiple agents, an agent learns about its environment which is active, because it includes other agents who have intentions, commitments, beliefs, abilities, and can also learn. An agent might also be deliberately misled about the environment by other agents. The different challenges highlight the fact that learning has moved from being single-agent oriented to multi-agent oriented. [Weiss, 1995] describes the two types as isolated learning and interactive learning, respectively. Agents in a MAS can learn communally because
learning can be influenced by exchanged information, shared assumptions, commonly developed viewpoints of their environment, and commonly accepted social and cultural conventions and norms. Weiss also identifies two problems that researchers must address when determining the source of impact on performance. Credit (or blame) for an overall performance change must be assigned to an external agent to agent
interaction, or credit (or blame) for an action must be assigned to an internal agent decision.
There are two major areas of application of machine learning techniques to multi-agent systems: learning to coordinate or cooperate, and learning from other agents through the exchange of information (cooperative learning) to improve learning performance of each agent, or the system as a whole. [Nunes and Oliveira, 2003] seek to perform the latter by modeling human cooperative learning in a team based on the exchanging of advice. The authors employ agents that are heterogeneous with respect to learning algorithms in the hope that different algorithms solving similar problems may lead to different forms of exploration of the same search space, increasing the probability of finding a good solution. The problem domain is a simplified traffic-control problem where each agent must traffic-control four traffic lights at an intersection.
Learning parameters are adapted using two methods: 1) reinforcement-based, unsupervised learning using a quality measure that is directly supplied by the environment, and 2) supervised learning using peer advice as the desired response.
Agents request advice when their current average quality since the beginning of the
present time epoch drops below a certain percentage of the best average quality reported by its peers at the beginning of the present epoch. Average quality is assessed at the beginning of each green-yellow-red traffic cycle. Quality is determined by how well the agent has managed the traffic flow. When advice is requested, the advisee sends the current state of traffic to the advisor who has the best overall score reported at the start of epoch. The advisor then switches its internal learning representation back to what was reflected at the beginning of epoch, and runs the state communicated by the advisee to give advice in the form of a suggested response to the current state. For a neural network implementation, this would simply involve setting the network weights back to the values present at the beginning of the epoch for the advisor. The advisee would then use the response to update its own internal learning representation. In the case of a neural network implementation, the advisor’s response would be backpropagated to adjust network weights accordingly. The researches found that advice exchange causes a fast increase of quality at early stages as good responses are shared. After comparing against agents that employed stand-alone, isolated learning, it was found that advice seeking agents fall less commonly into local optima because they are better at overcoming bad initial parameters. This is due to the fact that supervised learning allows exploration of more promising regions of the search space. This is an important benefit of supervised learning that will be discussed in Section 5.2.3 with experiment examples where KMAS performs direct revision for known agents.
Along with cooperative learning, researchers have found machine learning techniques as valuable tools to aid in the coordination process of multi-agent systems.
Traditional coordination mechanisms such as negotiation must rely heavily on communication between agents. [Bazzan, 1997] identifies this communication
bottleneck as a major shortcoming in existing coordination frameworks. Bazzan hopes to demonstrate research that minimizes or even eliminates the need for communication when coordinating agent activities. Like our first example, the problem domain is traffic-control, but only one learning technique is used, and agents do not communicate.
Agents only know their own utility payoffs, and not those of others. Reinforcement learning is applied by way of a critic, “nature”, that provides local and global payoff utility. The global payoff utility acts as an incentive to coordinate toward the global goal of stabilizing coordination such that traffic flows as long as possible without stopping at red lights.
The learning algorithm is a genetic algorithm that models strings of chosen strategies employed in the past. During the learning process, a fitness for each string is computed, and this influences the next generation of strategies used. Fitness is
determined by calculating the cumulative payoff of a specific strategy available, with increasingly discounted payoffs for strategies chosen farther in the past. This specific strategy is then compared against the cumulative payoffs of all strategies. Payoff is only calculated for the time interval between the current learning period and the last time period where a change in normal traffic pattern was determined. At the beginning
of each time step, if a change in local or global traffic pattern has not occurred, and a learning period has not started, each agent will act according to a strategy chosen by fitness. This strategy will yield a payoff determined by nature, and will be used in subsequent learning periods. If a change in normal traffic pattern occurs, strategies are chosen according to the direction of the highest flow of traffic. A strategy simply corresponds to giving more green time to a certain direction of the traffic flow.
The researchers found, not surprisingly, that coordination is reached faster when global traffic pattern seldom changes. It was also found that higher learning frequency (more learning periods) provided a good counter measure to environments with higher rates of individual traffic pattern changes at each intersection.