3.4 A Reinforcement Learning Model of Statistical Discrimination
3.5.1 Exploration
3.5.1.1 Finding Optimal Learning Parameters
GA’s (Holland 1975) are often used in stochastic optimisation problems. An optimisation problem is, for instance, the approximation of a function that estimates some empirical observable value. GA’s are, in principle, a directed search process. At the start of the process, a pool of candidate solutions called chromosomes is initialised. A chromosome contains a number of genes. The genes represent, for example, parameter values of a function to be approximated. A chromosome is initialised with a number of typically randomly initialised genes. A gene is represented as a binary string. The bits of the string can encode different things such as the digits of a number. The task of the algorithm is to evolve and select the best solutions from the chromosome pool by applying genetic operators such as mutation or crossover. These operators change and recombine the bits of the fittest genes. Mutation switches a bit of the string with a certain probability; crossover selects a fraction of the chromosomes and recombines genes of
the same type from the resulting sample at a randomly selected point in the string. The new pool of chromosomes constitutes the next generation. Fitness is determined by a fitness function, which computes the distance from the candidate solution to the problem solution. While a subset of the fittest genes is reproduced in the next generation using the operators, unfit genes are removed from the population. The process stops after a certain criterion, e.g., after a maximum number of generations has been computed, or some fitness threshold has been reached (for an introduction, see, for example, Goldberg (1989).
Here, the GA is initialised with chromosomes containing four genes. The genes represent the parametersαemployer, γemployer,αworker and γworker. The population size of a generation is limited to 20 chromosomes; the maximum number of evolutions is bound to 25. The probability that mutation occurs is set at 0.35. The crossover rate is 12%. The framework used for imple- mentation is JGAP (2011). The fitness function is given by the resulting employment discrimination after a simulation run of 5000 time steps. For this, the average difference between employment levels between green and purple workers over all time steps is computed. The larger the difference, the ‘fitter’ the candidate parameter set. The time scale has no relation to the classroom game. The reason is that this model is of an exploratory nature and unknown whether the necessary agent learning can be achieved in ‘real time’. Moreover, large time scales can inform about the stability of the model in the long run.
Applying this algorithm to each model variant produces for each vari- ant a set of different optimal parameter values. Before looking at some example runs in the next sections, the parameters and the outcome of the optimisation procedure are shown.
Table 3.5 summarises the relevant parameters for each model variant.
χ is set to an arbitrary high value (100 in this case), because the goal is to find out whether it is possible to generate discrimination at all. There- fore, a limitation of state descriptions provides no benefit at this stage. ζ
is set to a small value to allow frequent re-evaluation of mappings and con- sequently, adjustment in the beliefs. That is, it is easier to revise negative stereotypes. If discriminatory outcomes emerge, they are likely to be based on co-evolution and not ignorance on the employer side. Similarly, the pa- rameters ν and µare set at intervals that allow reasonable large samples of rewards for single rules (about 100), but can be changed frequently enough over the 5000 time steps to allow reasonable variation in the expanded state- action mappings. Finally, ρwas fixed at a value >0 to prevent traps in the search process, but not too large to prevent excessive switching.
Parameter value meaning Discrimination parameters
fqθ 0.5 Probability of good test result if invested
fuθ 0.2 Probability of good test result if not invested
c 0 - 0.1 Investment cost interval BRA parameters
ζ 0.05 Weight for revisiting expanded nodes
ρ 0.3 Weight for switching paths in the tree
µ 75 Interval for deleting inferior expansions
ν 100 Interval for creating new expansions
χ 100 Maximum numbers of nodes
Variant I
αemployer,r1 0.01 - 0.15
choice parameter forr1 - action set bound to
descriptor L1
0 (test-result is ambiguous and
(colour is green or colour is purple))
γemployer,r1 0.01 - 0.5 discount parameter for r1
αworker 0.01 - 0.2 choice parameter for worker rule
γworker 0.01 - 0.5 discount parameter for worker rule Variant II
αemployer,r1 0.01 - 0.1
choice parameter for r1 - action set bound
to descriptor L1
0 ((test-result=+-) and
(colour=green or colour=purple))
γemployer,r1 0.01 - 0.5 discount parameter for r1
αemployer,r2 0.01 - 0.1
choice parameter for r2 - action set bound
to descriptor L2
0 ((test-result= ++) and
(colour=green or colour=purple))
γemployer,r2 0.01 - 0.5 discount parameter for r2
αemployer,r3 0.01 - 0.1
choice parameter for r3 - action set bound
to descriptor L3
0 ((test-result=- -) and
(colour=green or colour=purple))
γemployer,r3 0.01 - 0.5 discount parameter for r3
αworker 0.01 - 0.2 choice parameter for worker rule
γworker 0.01 - 0.5 discount parameter for worker rule Variant III
αemployer,r1 0.01 - 0.15
choice parameter for r1 - action set
bound to descriptor L1
0 ((test-result=++
or test-result=+- or test-result=- - ) and (colour=reen or colour=purple))
γemployer,r1 0.01 - 0.5 discount parameter for r1
αworker 0.01 - 0.2 choice parameter for worker rule
γworker 0.01 - 0.5 discount parameter for worker rule Table 3.5: Simulation parameters for finding optimal RL parameters.
Figure 3.2: Model fit for various parameter settings in model variant I.
Figures 3.2 to 3.4 summarise the different parameter settings that have been visited by the GA. The figures display all samples which were run and are sorted by the largest difference in the employment levels of the two groups. The x-axis represent simulation runs while the y-axis displays the various parameters of the model and the fitness criterion, which all vary between 0 and 1. On the right end of the graph are those simulations that produce the strongest discrimination.
In general, settings with smallαworker(early lock-in into investment/non- investment behaviour) have the best chances to produce the discrimination. Thus, if worker behaviour is relatively stable and differs, for some reason, between groups (here due to the variation of choice and learning behaviour), then employers discriminate between them accordingly.